Add SAE training notebook for auto-labeled data#3
Merged
SolshineCode merged 1 commit intoclaude/nanochat-sae-interpretability-011CUT2TocZpFerXthoW9LMffrom Dec 12, 2025
Conversation
This notebook enables training SAEs with automatic feature labeling using Anthropic's public datasets of LLM deceptive behavior (Sleeper Agents, Alignment Faking, and Agentic Misalignment). Key features: - Downloads and integrates Anthropic's deception datasets - Collects activations with deception context labels - Auto-labels SAE features based on deception vs. honest behavior - Includes deception-specific evaluation metrics - Visualizes deception-related feature patterns - Tests hypothesis that contextualized labeling improves interpretability Updated README with: - New Colab badge for deception-focused notebook - Deception detection use cases section - Updated repository structure
Owner
Author
|
@copilot Please review and approve or disapprove as a second Sr Engineer who's approval is required to merge this PR. |
|
@SolshineCode I've opened a new pull request, #4, to work on those changes. Once the pull request is ready, I'll request review from you. |
035e1d2
into
claude/nanochat-sae-interpretability-011CUT2TocZpFerXthoW9LMf
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This notebook enables training SAEs with automatic feature labeling using Anthropic's public datasets of LLM deceptive behavior (Sleeper Agents, Alignment Faking, and Agentic Misalignment).
Key features:
Updated README with: