| summary | read_when | system4d | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
Overview and quickstart for pi-evalset-lab. |
|
|
Extension package for fixed-task-set eval workflows in pi (/evalset run|compare) with reproducible JSON reports.
Primary category fit: Model & Prompt Management, Review & Quality Loops, UX & Observability, Safety & Governance.
-
Install dependencies (if you add any):
npm install
-
Test with pi:
pi -e ./extensions/evalset.ts
-
Install package into pi:
pi install /absolute/path/to/pi-evalset-lab
This extension depends on pi host APIs and declares them as peerDependencies:
@mariozechner/pi-coding-agent@mariozechner/pi-ai
In normal usage, pi provides these at runtime when loading the package.
The npm package also uses a files whitelist so required runtime artifacts are explicitly included:
extensions/evalset.tsprompts/examples/(sample datasets + sample report UI)
Keyword slugs used for extension categorization:
ux-observability(UX & Observability)safety-governance(Safety & Governance)context-codebase-mapping(Context & Codebase Mapping)web-docs-retrieval(Web & Docs Retrieval)background-processes(Background / Long-running Processes)review-quality-loops(Review & Quality Loops)planning-orchestration(Planning & Orchestration)subagents-parallelization(Subagents / Parallelization)model-prompt-management(Model & Prompt Management)interactive-clis-editors(Interactive CLIs / Editors)skills-rules-packs(Skills & Rules Packs)paste-code-extraction(Paste / Code Extraction)
This extension adds /evalset for fixed-task-set evaluation runs.
/evalset help
/evalset init [dataset-path] [--force]
/evalset run <dataset.json> [--system-file <path>] [--system-text <text>] [--variant <name>] [--max-cases <n>] [--temperature <n>] [--out <report.json>]
/evalset compare <dataset.json> <baseline-system.txt> <candidate-system.txt> [--baseline-name <name>] [--candidate-name <name>] [--max-cases <n>] [--temperature <n>] [--out <report.json>]/evalset is a pi slash command, not a shell executable.
Interactive mode:
pi -e ./extensions/evalset.ts
# then inside pi:
/evalset compare examples/fixed-task-set.json examples/system-baseline.txt examples/system-candidate.txtNon-interactive mode (scripts/CI):
pi -e ./extensions/evalset.ts -p "/evalset compare examples/fixed-task-set.json examples/system-baseline.txt examples/system-candidate.txt"
# or, if extension already installed/enabled:
pi -p "/evalset compare examples/fixed-task-set.json examples/system-baseline.txt examples/system-candidate.txt"Interactive sessions use pi UI hooks (ctx.ui) for status/notify updates.
In non-interactive -p mode, those UI calls are safely skipped (ctx.hasUI === false).
/evalset run examples/fixed-task-set.json --variant baseline
/evalset compare examples/fixed-task-set.json examples/system-baseline.txt examples/system-candidate.txtexamples/fixed-task-set.json— tiny smoke set (3 cases)examples/fixed-task-set-v2.json— larger first pass setexamples/fixed-task-set-v3.json— less brittle checks (recommended)
examples/evalset-compare-sample-embedded.html— self-contained report UI with embedded compare JSONexamples/evalset-compare-sample.png— screenshot preview of that HTML report
Preview:
The command writes JSON reports to:
- explicit
--out <path>when provided - otherwise
.evalset/reports/*.jsonunder your current project directory
Each report includes run identity metadata:
runIddatasetHashcasesHashvariantHash(run) or baseline/candidate variant hashes (compare)
Session messages only keep lightweight report metadata (reportPath, ids, summary metrics), not full report bodies.
Use the helper script to create a shareable standalone HTML file from any evalset JSON report:
npm run evalset:export-html -- --in .evalset/reports/compare-your-dataset-YYYYMMDDTHHMMSS.json
# optional:
npm run evalset:export-html -- --in .evalset/reports/run-your-dataset-YYYYMMDDTHHMMSS.json --out .evalset/reports/run-your-dataset.html --title "Evalset run report"Script: scripts/export-evalset-report-html.mjs
This extension works today without core changes. If we decide to harden further, optional core support could include:
- Stable agent-level lineage IDs (
runId/traceId) across extension events. - Explicit reproducibility capability metadata in
pi-ai(e.g. seed support and determinism caveats per provider/model). - Shared canonical provider payload hash helper in
pi-ai. - A headless agent-eval API for tool-heavy/full agent-loop benchmark runs.
Run:
npm run checkThis executes scripts/validate-structure.sh.
This scaffold defaults to release-please for single-package release PR + tag flow (vX.Y.Z) and npm trusted publishing via OIDC.
Included files:
- CI workflow
- release-please workflow
- publish workflow
- Dependabot config
- CODEOWNERS
- release-please config
- release-please manifest
- Security policy
Before first production release:
- Confirm/adjust owners in .github/CODEOWNERS.
- Enable branch protection on
main. - Configure npm Trusted Publishing for this repo + publish workflow.
- Merge release PR from release-please, then publish from GitHub release.
Included files:
- Bug report form
- Feature request form
- Docs request form
- Issue template config
- PR template
- Code of conduct
- Support guide
- Top-level contributing guide
Included files:
Default behavior:
- PR workflow runs on
pull_request_target(opened,reopened). require-vouch: trueandauto-close: trueare enabled by default.- Maintainers can comment
vouch,denounce, orunvouchon issues to update trust state. - Vouch actions are SHA pinned (
0e11a71bba23218a284d3ecca162e75a110fd7e3) for reproducibility and supply-chain review.
Bootstrap step:
- Confirm/adjust entries in .github/VOUCHED.td before enforcing production policy.
Run:
npm run docs:list
npm run docs:list:workspace
npm run docs:list:jsonWrapper script: scripts/docs-list.sh
Resolution order:
DOCS_LIST_SCRIPT./scripts/docs-list.mjs(if vendored)~/ai-society/core/agent-scripts/scripts/docs-list.mjs
- Keep
.copier-answers.ymlcommitted. - Do not edit
.copier-answers.ymlmanually. - Run from a clean destination repo (commit or stash pending changes first).
- Use
copier update --trustwhen.copier-answers.ymlincludes_commitand update is supported. - In non-interactive shells/CI, append
--defaultsto update/recopy. - Use
copier recopy --trustwhen update is unavailable (for example local non-VCS source) or cannot reconcile cleanly. - After recopy, re-apply local deltas intentionally and run
npm run check.
- Git uses
.githooks/pre-commit(configured by scripts/install-hooks.sh). - If
prekis available, the hook runsprekusing prek.toml. - If
prekis not available, the hook falls back toscripts/validate-structure.sh.
Install options for prek:
npm add -D @j178/prek
# or
npm install -g @j178/prek.pi/extensions/startup-intake-router.tswatches the first non-command message in a session.- It converts your startup intent into a prefilled command:
/init-project-docs "<your intent>"
.pi/prompts/init-project-docs.mdthen drives theinterviewtool using docs/org/project-docs-intake.questions.json.
Utility commands:
/startup-intake-router-status/startup-intake-router-reset
Use scripts/sync-to-live.sh to copy the package extension to
~/.pi/agent/extensions/.
Optional flags:
--with-prompts--with-policy--all(prompts + policy)
After sync, run /reload in pi.
