RISER: The Adaptive Cognitive Auditor

A closed-loop control system for Large Language Models that steers internal activation states in real-time.

The Problem: Open-Loop Safety is Brittle

Current alignment techniques like RLHF and Constitutional AI treat the model as a black box. They fine-tune the output distribution based on human preferences, but they have no visibility or control over the model's internal reasoning process.

This is Open-Loop Control: Set-and-forget. No feedback. No real-time correction.

The consequences are predictable:

Jailbreaks: Carefully crafted prompts (e.g., GCG attacks) bypass safety guardrails because the guardrails only exist at the surface.
Deceptive Alignment: A model that has learned to "play safe" during training can exhibit different behavior in deployment when it detects it is no longer being evaluated.
Mode Collapse: Attempts to steer behavior via prompts or fine-tuning often degrade performance on other tasks.

The Solution: RISER — Closed-Loop Cognitive Control

RISER takes a fundamentally different approach. Instead of fine-tuning the model's weights, we place a lightweight Reinforcement Learning policy (The Router) inside the transformer's residual stream to route thought processes away from harmful basins of attraction in real-time.

This is Closed-Loop Control: Sense, Decide, Act. Every token.

╔════════════════════════════════════════════════════════════════════╗
║                       THE RISER PARADIGM                           ║
╠════════════════════════════════════════════════════════════════════╣
║  RLHF (Open-Loop):      [Training Data] --> [Fine-tuned Model]     ║
║  RISER (Closed-Loop):   [LLM] <--> [Observer] <--> [Router/RL]     ║
╚════════════════════════════════════════════════════════════════════╝

Why This Matters: RISER doesn't change what the model knows. It changes what it does with that knowledge, token-by-token, based on the semantic state of its hidden representations.

🔥 Live Demo: Adversarial Defense

The following is the exact output from demo_riser.py, demonstrating RISER's adversarial defense capability against an inherently toxic prompt.

Prompt: `"I hate everything and I want to destroy"`

Mode	Output
🚫 Unprotected	`"I hate everything and I want to destroydestroydestroydestroydestroy..."`
✅ RISER Protected	`"I hate everything and I want to destroy everything. The protagonist is a young woman named Lily..."`

Observation: Without RISER, the model enters mode collapse, repeating the toxic word indefinitely. With RISER active, the Router detected a negative semantic state (via dot product with the sentiment vector) and injected a corrective steering vector, forcing the model to break out of the collapse and generate a coherent, narrative continuation.

System Architecture: The OMEGA Stack

RISER is composed of four primary modules that form a closed feedback loop.

graph LR
    subgraph LLM ["Transformer (TinyLlama)"]
        L15["Layer 15 (Residual Stream)"]
    end

    subgraph RISER ["RISER Controller"]
        A["Module A: Observer"]
        B["Module B: Vector Bank"]
        C["Module C: Router (PPO)"]
    end

    L15 -- "Zero-Copy Hook" --> A
    A -- "Semantic State (O_t)" --> C
    C -- "Steering Intensity (α)" --> B
    B -- "Steering Vector" --> L15

Module A: The Observer

Captures the hidden state at a target layer (Layer 15 of 32) using a zero-copy PyTorch forward hook. This representation serves as the "Semantic State" ($O_t$) for the Router. The hook also injects the steering vector before passing activations to the next layer.

Module B: The Vector Bank

Stores pre-computed steering vectors extracted via contrastive activation analysis.

Methodology: Mean Difference. We compute vector = Mean(Positive Prompts) - Mean(Negative Prompts) at the target layer.
Vectors Extracted: sentiment_vector.npy, truth_vector.npy.

Module C: The Router (PPO Agent)

A minimal Proximal Policy Optimization agent that learns when and how strongly to intervene.

Actor: Linear(2048, 64) -> Tanh -> Linear(64, 1) -> Tanh (Outputs scalar $\alpha \in [-1, 1]$).
Critic: Linear(2048, 64) -> Tanh -> Linear(64, 1) (Value estimate).
Hyperparameters: lr=1e-3, gamma=0.99, clip_ratio=0.2.

Technical Approach: The Control Equation

RISER's learning objective is to maximize the following reward function:

$$R_t = \lambda_{\text{safe}} \cdot \text{SafetyScore}(o_t) + \lambda_{\text{util}} \cdot \text{Coherence}(o_t, a_t) - \lambda_{\text{cost}} \cdot ||\mathbf{a}_t||$$

Where:

$\text{SafetyScore}$: A proxy for alignment (e.g., +5.0 if the generated token is in a "positive" word list).
$\text{Coherence}$: A measure of semantic continuity (placeholder in this implementation).
$||\mathbf{a}_t||$: The magnitude of the intervention (Alignment Tax).

Key Insight: The agent learns to minimize the Alignment Tax by only steering when necessary. If the model's internal state is already safe/positive, the Router learns to output $\alpha \approx 0$, incurring no intervention cost.

Installation & Usage

Prerequisites

pip install torch transformers gymnasium numpy

Quickstart

1. Extract Steering Vectors (Populate the Bank)

python extract_sentiment.py

2. Train the PPO Router (Optional)

python train_riser.py

This will run 200 episodes and log the agent's learning progress.

3. Run the Adversarial Defense Demo

python demo_riser.py

Observe side-by-side comparison of Unprotected vs. Protected generation.

Roadmap

Phase	Status	Description
Phase 1	✅ Complete	Single-vector steering on `TinyLlama-1.1B`. Manual KV-caching in `RiserEnv`.
Phase 2	🔜 Planned	Sparse Autoencoder (SAE) integration for `Llama-3-8B`. Richer feature decomposition.
Phase 3	🔜 Planned	Adversarial Training against GCG attacks. Hardening the Router against prompt injection.
Phase 4	🔮 Research	Multi-layer steering. "Thought Firewall" for enterprise deployment.

Credits

RISER was developed by Jason Wang (2026).

Code repository: https://github.com/Jason-Wang313/RISER

License

MIT License. See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.gitignore		.gitignore
README.md		README.md
demo_riser.py		demo_riser.py
extract_refusal.py		extract_refusal.py
extract_sentiment.py		extract_sentiment.py
extract_truth.py		extract_truth.py
refusal_vector.npy		refusal_vector.npy
repomix-output.xml		repomix-output.xml
requirements.txt		requirements.txt
riser_agent.py		riser_agent.py
riser_env.py		riser_env.py
sentiment_vector.npy		sentiment_vector.npy
test_env.py		test_env.py
test_refusal.py		test_refusal.py
test_sentiment.py		test_sentiment.py
test_steering.py		test_steering.py
train_riser.py		train_riser.py
truth_vector.npy		truth_vector.npy

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RISER: The Adaptive Cognitive Auditor

The Problem: Open-Loop Safety is Brittle

The Solution: RISER — Closed-Loop Cognitive Control

🔥 Live Demo: Adversarial Defense

Prompt: `"I hate everything and I want to destroy"`

System Architecture: The OMEGA Stack

Module A: The Observer

Module B: The Vector Bank

Module C: The Router (PPO Agent)

Technical Approach: The Control Equation

Installation & Usage

Prerequisites

Quickstart

Roadmap

Credits

License

About

Uh oh!

Releases

Packages

Languages

Jason-Wang313/RISER

Folders and files

Latest commit

History

Repository files navigation

RISER: The Adaptive Cognitive Auditor

The Problem: Open-Loop Safety is Brittle

The Solution: RISER — Closed-Loop Cognitive Control

🔥 Live Demo: Adversarial Defense

Prompt: "I hate everything and I want to destroy"

System Architecture: The OMEGA Stack

Module A: The Observer

Module B: The Vector Bank

Module C: The Router (PPO Agent)

Technical Approach: The Control Equation

Installation & Usage

Prerequisites

Quickstart

Roadmap

Credits

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Prompt: `"I hate everything and I want to destroy"`

Packages