Skip to content

Add SAE training notebook for auto-labeled data#3

Merged
SolshineCode merged 1 commit intoclaude/nanochat-sae-interpretability-011CUT2TocZpFerXthoW9LMffrom
claude/add-sae-training-notebook-01YS8XayDdcSpjGaEMWKapTC
Dec 12, 2025
Merged

Add SAE training notebook for auto-labeled data#3
SolshineCode merged 1 commit intoclaude/nanochat-sae-interpretability-011CUT2TocZpFerXthoW9LMffrom
claude/add-sae-training-notebook-01YS8XayDdcSpjGaEMWKapTC

Conversation

@SolshineCode
Copy link
Owner

This notebook enables training SAEs with automatic feature labeling using Anthropic's public datasets of LLM deceptive behavior (Sleeper Agents, Alignment Faking, and Agentic Misalignment).

Key features:

  • Downloads and integrates Anthropic's deception datasets
  • Collects activations with deception context labels
  • Auto-labels SAE features based on deception vs. honest behavior
  • Includes deception-specific evaluation metrics
  • Visualizes deception-related feature patterns
  • Tests hypothesis that contextualized labeling improves interpretability

Updated README with:

  • New Colab badge for deception-focused notebook
  • Deception detection use cases section
  • Updated repository structure

This notebook enables training SAEs with automatic feature labeling using
Anthropic's public datasets of LLM deceptive behavior (Sleeper Agents,
Alignment Faking, and Agentic Misalignment).

Key features:
- Downloads and integrates Anthropic's deception datasets
- Collects activations with deception context labels
- Auto-labels SAE features based on deception vs. honest behavior
- Includes deception-specific evaluation metrics
- Visualizes deception-related feature patterns
- Tests hypothesis that contextualized labeling improves interpretability

Updated README with:
- New Colab badge for deception-focused notebook
- Deception detection use cases section
- Updated repository structure
@SolshineCode
Copy link
Owner Author

@copilot Please review and approve or disapprove as a second Sr Engineer who's approval is required to merge this PR.

Copy link

Copilot AI commented Dec 12, 2025

@SolshineCode I've opened a new pull request, #4, to work on those changes. Once the pull request is ready, I'll request review from you.

@SolshineCode SolshineCode merged commit 035e1d2 into claude/nanochat-sae-interpretability-011CUT2TocZpFerXthoW9LMf Dec 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants