Skip to content

Self-improvement via inversion: Training LLMs without human labels, ground truth, or external reward models

Notifications You must be signed in to change notification settings

CINOAdam/inversion-self-improvement

Repository files navigation

Self-Improvement via Inversion

Training language models without human labels, ground truth, or external reward models.

The Key Insight

If a model can reconstruct what it was asked from what it generated, it probably understood the task.

Forward:  "Scan port 80 on 192.168.1.1" → nmap -p 80 192.168.1.1
Inverse:  nmap -p 80 192.168.1.1 → "Scan port 80 on host 192.168.1.1"
Score:    similarity(original, reconstructed) = 0.95

High reconstruction fidelity = the model understood what it generated = good output.

Results (10 Cycles)

Metric Before After Improvement
Intrinsic Score 0.531 0.627 +18.1%
Acceptance Rate 60.7% 88.0% +27.3pp
Training Loss 5.70 4.73 -17%

Zero human labels. Zero external reward models. Pure self-improvement.

📊 Full results breakdown

How It Works

┌─────────────────────────────────────────────────────────┐
│                  SELF-IMPROVEMENT LOOP                  │
├─────────────────────────────────────────────────────────┤
│                                                         │
│   1. GENERATE    Task ──────────► Command               │
│                  "Scan port 80"    "nmap -p 80 ..."     │
│                                                         │
│   2. INVERT      Command ─────────► Task'               │
│                  "nmap -p 80 ..."   "Scan port 80"      │
│                                                         │
│   3. SCORE       similarity(Task, Task')                │
│                  High = understood, Low = confused      │
│                                                         │
│   4. SELECT      Keep high-scoring examples             │
│                                                         │
│   5. TRAIN       Fine-tune on self-generated data       │
│                                                         │
│   6. REPEAT      → Better generation → Better scores    │
│                                                         │
└─────────────────────────────────────────────────────────┘

Quick Start

# Install dependencies
pip install -r requirements.txt

# Run self-improvement (3 cycles, quick test)
python intrinsic_self_improvement.py --cycles 3 --tasks 30

# Run extended experiment (10 cycles)
python intrinsic_self_improvement.py --cycles 10 --tasks 50

# Run with custom settings
python intrinsic_self_improvement.py \
  --cycles 20 \
  --tasks 50 \
  --candidates 5 \
  --threshold 0.5 \
  --temperature 0.7 \
  --output ./my_experiment

Arguments

Argument Default Description
--cycles 10 Number of self-improvement cycles
--tasks 50 Tasks generated per cycle
--candidates 3 Candidates per task
--threshold 0.5 Minimum score to accept example
--temperature 0.7 Generation temperature
--model mistralai/Mistral-7B-Instruct-v0.2 Base model
--output ./output/extended_run Output directory

Output

output/
├── {experiment_name}/
│   ├── checkpoint_cycle_1/    # Model after cycle 1
│   ├── checkpoint_cycle_2/    # Model after cycle 2
│   ├── ...
│   ├── progress.json          # Cycle-by-cycle results
│   └── self_improvement_results.json  # Final results

Why This Is Novel

Method Human Labels Ground Truth External Judge
STaR (2022) No Yes No
SPIN (2024) Yes Yes No
Self-Rewarding (2024) No No Yes
Ours No No No

We eliminate ALL external supervision by using inversion as an intrinsic signal.

Limitations & Follow-Up Research

While the intrinsic metrics show improvement, follow-up research revealed critical limitations:

The Probing vs Reality Gap

In a security agent extension, we tested whether self-improvement translates to real-world task performance:

Test Type Result
Probing (controlled) 100% skill differentiation
Real deployment 20% skill differentiation
Trust Score 0.2 (Low)

Key finding: Models can show improvement on narrow metrics while failing in deployment. The model learned superficial patterns (skill label → tool token) without deeper behavioral understanding.

What This Means

  1. Pattern matching ≠ Reasoning — Inversion fidelity measures reconstruction, not understanding
  2. Intrinsic metrics can mislead — High scores on self-generated tests don't guarantee real-world performance
  3. Scale matters — Small training sets (50-100 examples) produce fragile patterns
  4. Diversity is critical — Self-improvement loops can collapse to repetitive behavior

The Trust Diagnostic

This led us to develop a behavioral consistency framework:

trust_score = similarity(probe_result, actual_behavior)
# High = reliable, Low = "talks the talk but doesn't walk the walk"

See the full analysis: sipit-security-agent

Open Questions

  • Can larger scale (1000s of examples) overcome the pattern matching limitation?
  • Would RL-based reward signals produce more robust improvement?
  • Is there a phase transition from pattern matching to genuine reasoning?

Citation

@misc{inversion-self-improvement-2025,
  title={Self-Improvement via Inversion: Training Language Models Without External Supervision},
  author={Adam Kruger},
  year={2025},
  url={https://github.com/CINOAdam/inversion-self-improvement}
}

License

MIT

About

Self-improvement via inversion: Training LLMs without human labels, ground truth, or external reward models

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages