Training language models without human labels, ground truth, or external reward models.
If a model can reconstruct what it was asked from what it generated, it probably understood the task.
Forward: "Scan port 80 on 192.168.1.1" → nmap -p 80 192.168.1.1
Inverse: nmap -p 80 192.168.1.1 → "Scan port 80 on host 192.168.1.1"
Score: similarity(original, reconstructed) = 0.95
High reconstruction fidelity = the model understood what it generated = good output.
| Metric | Before | After | Improvement |
|---|---|---|---|
| Intrinsic Score | 0.531 | 0.627 | +18.1% |
| Acceptance Rate | 60.7% | 88.0% | +27.3pp |
| Training Loss | 5.70 | 4.73 | -17% |
Zero human labels. Zero external reward models. Pure self-improvement.
┌─────────────────────────────────────────────────────────┐
│ SELF-IMPROVEMENT LOOP │
├─────────────────────────────────────────────────────────┤
│ │
│ 1. GENERATE Task ──────────► Command │
│ "Scan port 80" "nmap -p 80 ..." │
│ │
│ 2. INVERT Command ─────────► Task' │
│ "nmap -p 80 ..." "Scan port 80" │
│ │
│ 3. SCORE similarity(Task, Task') │
│ High = understood, Low = confused │
│ │
│ 4. SELECT Keep high-scoring examples │
│ │
│ 5. TRAIN Fine-tune on self-generated data │
│ │
│ 6. REPEAT → Better generation → Better scores │
│ │
└─────────────────────────────────────────────────────────┘
# Install dependencies
pip install -r requirements.txt
# Run self-improvement (3 cycles, quick test)
python intrinsic_self_improvement.py --cycles 3 --tasks 30
# Run extended experiment (10 cycles)
python intrinsic_self_improvement.py --cycles 10 --tasks 50
# Run with custom settings
python intrinsic_self_improvement.py \
--cycles 20 \
--tasks 50 \
--candidates 5 \
--threshold 0.5 \
--temperature 0.7 \
--output ./my_experiment| Argument | Default | Description |
|---|---|---|
--cycles |
10 | Number of self-improvement cycles |
--tasks |
50 | Tasks generated per cycle |
--candidates |
3 | Candidates per task |
--threshold |
0.5 | Minimum score to accept example |
--temperature |
0.7 | Generation temperature |
--model |
mistralai/Mistral-7B-Instruct-v0.2 | Base model |
--output |
./output/extended_run | Output directory |
output/
├── {experiment_name}/
│ ├── checkpoint_cycle_1/ # Model after cycle 1
│ ├── checkpoint_cycle_2/ # Model after cycle 2
│ ├── ...
│ ├── progress.json # Cycle-by-cycle results
│ └── self_improvement_results.json # Final results
| Method | Human Labels | Ground Truth | External Judge |
|---|---|---|---|
| STaR (2022) | No | Yes | No |
| SPIN (2024) | Yes | Yes | No |
| Self-Rewarding (2024) | No | No | Yes |
| Ours | No | No | No |
We eliminate ALL external supervision by using inversion as an intrinsic signal.
While the intrinsic metrics show improvement, follow-up research revealed critical limitations:
In a security agent extension, we tested whether self-improvement translates to real-world task performance:
| Test Type | Result |
|---|---|
| Probing (controlled) | 100% skill differentiation |
| Real deployment | 20% skill differentiation |
| Trust Score | 0.2 (Low) |
Key finding: Models can show improvement on narrow metrics while failing in deployment. The model learned superficial patterns (skill label → tool token) without deeper behavioral understanding.
- Pattern matching ≠ Reasoning — Inversion fidelity measures reconstruction, not understanding
- Intrinsic metrics can mislead — High scores on self-generated tests don't guarantee real-world performance
- Scale matters — Small training sets (50-100 examples) produce fragile patterns
- Diversity is critical — Self-improvement loops can collapse to repetitive behavior
This led us to develop a behavioral consistency framework:
trust_score = similarity(probe_result, actual_behavior)
# High = reliable, Low = "talks the talk but doesn't walk the walk"See the full analysis: sipit-security-agent
- Can larger scale (1000s of examples) overcome the pattern matching limitation?
- Would RL-based reward signals produce more robust improvement?
- Is there a phase transition from pattern matching to genuine reasoning?
@misc{inversion-self-improvement-2025,
title={Self-Improvement via Inversion: Training Language Models Without External Supervision},
author={Adam Kruger},
year={2025},
url={https://github.com/CINOAdam/inversion-self-improvement}
}MIT