Feature Request: `--samples` flag for statistical sampling across model runs

# Feature Request: `--samples` flag for statistical sampling across model runs

## Summary

Add an optional `--samples <n>` (or `-n`) flag that runs each model N times per focus session, then aggregates results with statistical analysis. This enables users to leverage the **law of large numbers** to converge on the true expected behavior distribution for each model.

## Problem

Currently, each model is queried once per focus session. Single samples are noisy—model responses have inherent variance due to:
- Temperature/sampling randomness
- Prompt sensitivity
- Context window edge effects

A single run may surface an atypical confusion or miss a common failure mode. This makes it hard to distinguish signal (systematic model limitations) from noise (one-off artifacts).

## Proposed Solution

### CLI Interface

```bash
# Run each model 10 times, aggregate results
bun run src/cli.ts test ./skill.md "Generate a podcast" --samples 10

# Short flag
bun run src/cli.ts test ./skill.md "Generate a podcast" -n 10

# Combined with model selection
bun run src/cli.ts test ./skill.md "task" -m openai/gpt-5,anthropic/claude-sonnet-4.5 -n 5
```

### Output Changes

The summary should show aggregated statistics:

```
Focus Group Test (10 samples per model)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Skill: SKILL.md
Task:  Generate a podcast with 3 voices
Models: 2 × 10 samples = 20 runs

  qwen/qwen3-coder:free               [10/10 OK]  143s  $0.00
  meta-llama/llama-3.2-3b-instruct    [9/10 OK]   248s  $0.00

Summary (aggregated across samples)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Completed: 19/20 runs

Confusion Frequency:
  • "voice path format unclear"           — 18/20 (90%) ±6%
  • "no command to list available voices" — 12/20 (60%) ±11%
  • "chunk duration not specified"        —  4/20 (20%) ±9%

Suggested Improvements (by frequency):
  1. Add explicit full path example for --voice flag (95%)
  2. Include a command to list available voices (60%)
  3. Specify default chunk duration (25%)

Statistical Notes:
  • 10 samples provides ±10% precision at 95% CI for 50% base rate
  • Confusions appearing <20% may be noise artifacts
```

### Detailed Results

The `show` command should support drilling into individual samples:

```bash
# Show aggregated view (default)
bun run src/cli.ts show latest

# Show all individual samples
bun run src/cli.ts show latest --all-samples

# Show samples for specific model
bun run src/cli.ts show latest --model qwen/qwen3-coder:free
```

## Implementation Considerations

### Database Schema

Extend the run storage to track sample relationships:
- Add `sample_group_id` (ULID) to group samples from same test invocation
- Add `sample_index` (1-N) to identify individual samples
- Preserve backward compatibility: `sample_size=1` is the current behavior

### Aggregation Logic

For combining confusions/suggestions across samples:
1. **Semantic deduplication**: Cluster similar confusions (e.g., "voice path unclear" ≈ "voice file path not specified")
2. **Frequency counting**: Report what % of samples surfaced each confusion
3. **Confidence intervals**: Show uncertainty bounds based on sample size

### Parallelization

- Run samples in parallel where API rate limits allow
- Add `--parallel <n>` flag to control concurrency (default: 3)
- Show progress: `[7/20 runs complete]`

### Cost Estimation

Update cost display to reflect total:
```
Estimated cost: 2 models × 10 samples × $0.002 = ~$0.04
```

## Use Cases

1. **Pre-ship validation**: Run 20+ samples before releasing critical skill docs to catch rare failure modes
2. **A/B testing doc changes**: Compare confusion rates between doc versions with statistical significance
3. **Model benchmarking**: Characterize which models systematically struggle with your doc style
4. **Regression detection**: Track if confusion rates change across skill doc iterations

## Alternatives Considered

1. **Manual re-runs**: User runs `test` command multiple times manually → tedious, no aggregation
2. **External scripting**: Wrap CLI in bash loop → loses structured aggregation, database coherence
3. **Always multi-sample**: Remove single-sample mode → too slow/expensive for iteration

## Additional Context

The statistical foundation here is straightforward:
- With n=10 samples, we can detect effects >30% with reasonable confidence
- With n=30 samples, we approach the CLT threshold for normal approximations
- Recommend documenting sample size guidance in README

---

**Labels**: `enhancement`


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: `--samples` flag for statistical sampling across model runs #2

Feature Request: `--samples` flag for statistical sampling across model runs

Summary

Problem

Proposed Solution

CLI Interface

Output Changes

Detailed Results

Implementation Considerations

Database Schema

Aggregation Logic

Parallelization

Cost Estimation

Use Cases

Alternatives Considered

Additional Context

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Feature Request: --samples flag for statistical sampling across model runs #2

Description

Feature Request: --samples flag for statistical sampling across model runs

Summary

Problem

Proposed Solution

CLI Interface

Output Changes

Detailed Results

Implementation Considerations

Database Schema

Aggregation Logic

Parallelization

Cost Estimation

Use Cases

Alternatives Considered

Additional Context

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Feature Request: `--samples` flag for statistical sampling across model runs #2

Feature Request: `--samples` flag for statistical sampling across model runs