-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Feature Request: --samples flag for statistical sampling across model runs
Summary
Add an optional --samples <n> (or -n) flag that runs each model N times per focus session, then aggregates results with statistical analysis. This enables users to leverage the law of large numbers to converge on the true expected behavior distribution for each model.
Problem
Currently, each model is queried once per focus session. Single samples are noisy—model responses have inherent variance due to:
- Temperature/sampling randomness
- Prompt sensitivity
- Context window edge effects
A single run may surface an atypical confusion or miss a common failure mode. This makes it hard to distinguish signal (systematic model limitations) from noise (one-off artifacts).
Proposed Solution
CLI Interface
# Run each model 10 times, aggregate results
bun run src/cli.ts test ./skill.md "Generate a podcast" --samples 10
# Short flag
bun run src/cli.ts test ./skill.md "Generate a podcast" -n 10
# Combined with model selection
bun run src/cli.ts test ./skill.md "task" -m openai/gpt-5,anthropic/claude-sonnet-4.5 -n 5Output Changes
The summary should show aggregated statistics:
Focus Group Test (10 samples per model)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Skill: SKILL.md
Task: Generate a podcast with 3 voices
Models: 2 × 10 samples = 20 runs
qwen/qwen3-coder:free [10/10 OK] 143s $0.00
meta-llama/llama-3.2-3b-instruct [9/10 OK] 248s $0.00
Summary (aggregated across samples)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Completed: 19/20 runs
Confusion Frequency:
• "voice path format unclear" — 18/20 (90%) ±6%
• "no command to list available voices" — 12/20 (60%) ±11%
• "chunk duration not specified" — 4/20 (20%) ±9%
Suggested Improvements (by frequency):
1. Add explicit full path example for --voice flag (95%)
2. Include a command to list available voices (60%)
3. Specify default chunk duration (25%)
Statistical Notes:
• 10 samples provides ±10% precision at 95% CI for 50% base rate
• Confusions appearing <20% may be noise artifacts
Detailed Results
The show command should support drilling into individual samples:
# Show aggregated view (default)
bun run src/cli.ts show latest
# Show all individual samples
bun run src/cli.ts show latest --all-samples
# Show samples for specific model
bun run src/cli.ts show latest --model qwen/qwen3-coder:freeImplementation Considerations
Database Schema
Extend the run storage to track sample relationships:
- Add
sample_group_id(ULID) to group samples from same test invocation - Add
sample_index(1-N) to identify individual samples - Preserve backward compatibility:
sample_size=1is the current behavior
Aggregation Logic
For combining confusions/suggestions across samples:
- Semantic deduplication: Cluster similar confusions (e.g., "voice path unclear" ≈ "voice file path not specified")
- Frequency counting: Report what % of samples surfaced each confusion
- Confidence intervals: Show uncertainty bounds based on sample size
Parallelization
- Run samples in parallel where API rate limits allow
- Add
--parallel <n>flag to control concurrency (default: 3) - Show progress:
[7/20 runs complete]
Cost Estimation
Update cost display to reflect total:
Estimated cost: 2 models × 10 samples × $0.002 = ~$0.04
Use Cases
- Pre-ship validation: Run 20+ samples before releasing critical skill docs to catch rare failure modes
- A/B testing doc changes: Compare confusion rates between doc versions with statistical significance
- Model benchmarking: Characterize which models systematically struggle with your doc style
- Regression detection: Track if confusion rates change across skill doc iterations
Alternatives Considered
- Manual re-runs: User runs
testcommand multiple times manually → tedious, no aggregation - External scripting: Wrap CLI in bash loop → loses structured aggregation, database coherence
- Always multi-sample: Remove single-sample mode → too slow/expensive for iteration
Additional Context
The statistical foundation here is straightforward:
- With n=10 samples, we can detect effects >30% with reasonable confidence
- With n=30 samples, we approach the CLT threshold for normal approximations
- Recommend documenting sample size guidance in README
Labels: enhancement