Add SAE training notebook for auto-labeled data by SolshineCode · Pull Request #3 · SolshineCode/nanochat-SAE

SolshineCode · 2025-12-12T15:18:39Z

This notebook enables training SAEs with automatic feature labeling using Anthropic's public datasets of LLM deceptive behavior (Sleeper Agents, Alignment Faking, and Agentic Misalignment).

Key features:

Downloads and integrates Anthropic's deception datasets
Collects activations with deception context labels
Auto-labels SAE features based on deception vs. honest behavior
Includes deception-specific evaluation metrics
Visualizes deception-related feature patterns
Tests hypothesis that contextualized labeling improves interpretability

Updated README with:

New Colab badge for deception-focused notebook
Deception detection use cases section
Updated repository structure

This notebook enables training SAEs with automatic feature labeling using Anthropic's public datasets of LLM deceptive behavior (Sleeper Agents, Alignment Faking, and Agentic Misalignment). Key features: - Downloads and integrates Anthropic's deception datasets - Collects activations with deception context labels - Auto-labels SAE features based on deception vs. honest behavior - Includes deception-specific evaluation metrics - Visualizes deception-related feature patterns - Tests hypothesis that contextualized labeling improves interpretability Updated README with: - New Colab badge for deception-focused notebook - Deception detection use cases section - Updated repository structure

SolshineCode · 2025-12-12T15:19:44Z

@copilot Please review and approve or disapprove as a second Sr Engineer who's approval is required to merge this PR.

Copilot · 2025-12-12T15:19:54Z

@SolshineCode I've opened a new pull request, #4, to work on those changes. Once the pull request is ready, I'll request review from you.

Copilot AI mentioned this pull request Dec 12, 2025

Senior engineering review and approval of SAE deception training notebook #4

Merged

SolshineCode merged commit 035e1d2 into claude/nanochat-sae-interpretability-011CUT2TocZpFerXthoW9LMf Dec 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add SAE training notebook for auto-labeled data#3

Add SAE training notebook for auto-labeled data#3
SolshineCode merged 1 commit intoclaude/nanochat-sae-interpretability-011CUT2TocZpFerXthoW9LMffrom
claude/add-sae-training-notebook-01YS8XayDdcSpjGaEMWKapTC

SolshineCode commented Dec 12, 2025

Uh oh!

SolshineCode commented Dec 12, 2025

Uh oh!

Copilot AI commented Dec 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

SolshineCode commented Dec 12, 2025

Uh oh!

SolshineCode commented Dec 12, 2025

Uh oh!

Copilot AI commented Dec 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants