Skip to content

Feature Request: --samples flag for statistical sampling across model runs #2

@EmZod

Description

@EmZod

Feature Request: --samples flag for statistical sampling across model runs

Summary

Add an optional --samples <n> (or -n) flag that runs each model N times per focus session, then aggregates results with statistical analysis. This enables users to leverage the law of large numbers to converge on the true expected behavior distribution for each model.

Problem

Currently, each model is queried once per focus session. Single samples are noisy—model responses have inherent variance due to:

  • Temperature/sampling randomness
  • Prompt sensitivity
  • Context window edge effects

A single run may surface an atypical confusion or miss a common failure mode. This makes it hard to distinguish signal (systematic model limitations) from noise (one-off artifacts).

Proposed Solution

CLI Interface

# Run each model 10 times, aggregate results
bun run src/cli.ts test ./skill.md "Generate a podcast" --samples 10

# Short flag
bun run src/cli.ts test ./skill.md "Generate a podcast" -n 10

# Combined with model selection
bun run src/cli.ts test ./skill.md "task" -m openai/gpt-5,anthropic/claude-sonnet-4.5 -n 5

Output Changes

The summary should show aggregated statistics:

Focus Group Test (10 samples per model)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Skill: SKILL.md
Task:  Generate a podcast with 3 voices
Models: 2 × 10 samples = 20 runs

  qwen/qwen3-coder:free               [10/10 OK]  143s  $0.00
  meta-llama/llama-3.2-3b-instruct    [9/10 OK]   248s  $0.00

Summary (aggregated across samples)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Completed: 19/20 runs

Confusion Frequency:
  • "voice path format unclear"           — 18/20 (90%) ±6%
  • "no command to list available voices" — 12/20 (60%) ±11%
  • "chunk duration not specified"        —  4/20 (20%) ±9%

Suggested Improvements (by frequency):
  1. Add explicit full path example for --voice flag (95%)
  2. Include a command to list available voices (60%)
  3. Specify default chunk duration (25%)

Statistical Notes:
  • 10 samples provides ±10% precision at 95% CI for 50% base rate
  • Confusions appearing <20% may be noise artifacts

Detailed Results

The show command should support drilling into individual samples:

# Show aggregated view (default)
bun run src/cli.ts show latest

# Show all individual samples
bun run src/cli.ts show latest --all-samples

# Show samples for specific model
bun run src/cli.ts show latest --model qwen/qwen3-coder:free

Implementation Considerations

Database Schema

Extend the run storage to track sample relationships:

  • Add sample_group_id (ULID) to group samples from same test invocation
  • Add sample_index (1-N) to identify individual samples
  • Preserve backward compatibility: sample_size=1 is the current behavior

Aggregation Logic

For combining confusions/suggestions across samples:

  1. Semantic deduplication: Cluster similar confusions (e.g., "voice path unclear" ≈ "voice file path not specified")
  2. Frequency counting: Report what % of samples surfaced each confusion
  3. Confidence intervals: Show uncertainty bounds based on sample size

Parallelization

  • Run samples in parallel where API rate limits allow
  • Add --parallel <n> flag to control concurrency (default: 3)
  • Show progress: [7/20 runs complete]

Cost Estimation

Update cost display to reflect total:

Estimated cost: 2 models × 10 samples × $0.002 = ~$0.04

Use Cases

  1. Pre-ship validation: Run 20+ samples before releasing critical skill docs to catch rare failure modes
  2. A/B testing doc changes: Compare confusion rates between doc versions with statistical significance
  3. Model benchmarking: Characterize which models systematically struggle with your doc style
  4. Regression detection: Track if confusion rates change across skill doc iterations

Alternatives Considered

  1. Manual re-runs: User runs test command multiple times manually → tedious, no aggregation
  2. External scripting: Wrap CLI in bash loop → loses structured aggregation, database coherence
  3. Always multi-sample: Remove single-sample mode → too slow/expensive for iteration

Additional Context

The statistical foundation here is straightforward:

  • With n=10 samples, we can detect effects >30% with reasonable confidence
  • With n=30 samples, we approach the CLT threshold for normal approximations
  • Recommend documenting sample size guidance in README

Labels: enhancement

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions