Skip to content

Add mech interp examples & utils#31

Draft
rsmith49 wants to merge 5 commits intomainfrom
contrib-mi/add-transformer-lens
Draft

Add mech interp examples & utils#31
rsmith49 wants to merge 5 commits intomainfrom
contrib-mi/add-transformer-lens

Conversation

@rsmith49
Copy link
Contributor

@rsmith49 rsmith49 commented Jan 16, 2026

User description

In progress - current implementation is mostly vibe coded, will clean up and merge


Generated description

Below is a concise technical summary of the changes proposed in this PR:

graph LR
main_("main"):::added
HookedTransformerLLMClient_("HookedTransformerLLMClient"):::added
ActivationCapture_("ActivationCapture"):::added
TrajectoryActivations_("TrajectoryActivations"):::added
InterventionManager_("InterventionManager"):::added
create_zero_ablation_hook_("create_zero_ablation_hook"):::added
main_ -- "Adds hooked-transformer LLM client enabling local inference with hooks." --> HookedTransformerLLMClient_
main_ -- "Captures activations per step via ActivationCapture context manager." --> ActivationCapture_
ActivationCapture_ -- "get_trajectory returns TrajectoryActivations aggregating step ActivationCaches." --> TrajectoryActivations_
main_ -- "Uses TrajectoryActivations to access/save per-step activation data." --> TrajectoryActivations_
main_ -- "Applies InterventionManager to add/remove hook-based interventions during runs." --> InterventionManager_
InterventionManager_ -- "Uses zero-ablation hook to zero specific attention heads." --> create_zero_ablation_hook_
classDef added stroke:#15AA7A
classDef removed stroke:#CD5270
classDef modified stroke:#EDAC4C
linkStyle default stroke:#CBD5E1,font-size:13px
Loading

Integrates TransformerLens into the ARES framework to enable trajectory-level mechanistic interpretability for agents. Introduces specialized components for capturing model activations and performing causal interventions during agent execution.

TopicDetails
Examples & Docs Provides a comprehensive guide and a runnable example demonstrating how to use ActivationCapture and InterventionManager for analyzing agent behavior.
Modified files (3)
  • examples/07_mech_interp_hooked_transformer.py
  • pyproject.toml
  • src/ares/contrib/mech_interp/README.md
Latest Contributors(2)
UserCommitDate
joshua.greaves@gmail.comBump-ARES-to-0.0.2-72January 29, 2026
ryan@withmartian.comAdd-Tinker-Example-58January 29, 2026
Interpretability Core Implements the HookedTransformerLLMClient and ActivationCapture to facilitate activation tracking and hook-based interventions within ARES environments.
Modified files (4)
  • src/ares/contrib/mech_interp/__init__.py
  • src/ares/contrib/mech_interp/activation_capture.py
  • src/ares/contrib/mech_interp/hook_utils.py
  • src/ares/contrib/mech_interp/hooked_transformer_client.py
Latest Contributors(0)
UserCommitDate
This pull request is reviewed by Baz. Review like a pro on (Baz).

@rsmith49 rsmith49 force-pushed the contrib-mi/add-transformer-lens branch from d852b59 to 9de6668 Compare January 22, 2026 05:37
title = {ARES Mechanistic Interpretability Module},
author = {Martian},
year = {2025},
url = {https://github.com/anthropics/ares}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LOL I see you claude code

@rsmith49 rsmith49 force-pushed the contrib-mi/add-transformer-lens branch from 5953af7 to 90c5138 Compare January 30, 2026 19:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant