-
Notifications
You must be signed in to change notification settings - Fork 127
Add RunLedger replay gate for agent regressions #24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add RunLedger replay gate for agent regressions #24
Conversation
|
Thanks for taking a look! This PR adds a replay-only RunLedger gate. The workflow run is currently waiting on fork approval (action_required). If you are open to it, please approve/authorize the workflow run so CI can complete. Happy to adjust anything. |
|
Hi @ZackMitchell910, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds a RunLedger-based CI gate for detecting regressions in tool-using agent behavior. It introduces a deterministic, replay-only evaluation suite that uses pre-recorded tool interactions (cassettes) to validate agent outputs against JSON schemas and baseline metrics without making external calls.
Key changes:
- Adds a complete RunLedger evaluation suite with configuration, test case, and replay cassettes
- Implements a GitHub Actions workflow that runs on pull requests to enforce regression gates
- Includes baseline metrics file for pass rate, tool call budgets, and timing thresholds
Reviewed changes
Copilot reviewed 8 out of 9 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
.github/workflows/runledger.yml |
GitHub Actions workflow that runs RunLedger evals on PRs with artifact uploads |
evals/runledger/suite.yaml |
Suite configuration defining agent command, budgets, assertions, and regression thresholds |
evals/runledger/schema.json |
JSON schema for validating agent output structure (category and reply fields) |
evals/runledger/agent/agent.py |
Stub agent implementation that handles task input and produces formatted output |
evals/runledger/cases/t1.yaml |
Test case definition for triaging a password reset ticket |
evals/runledger/cassettes/t1.jsonl |
Pre-recorded tool response for replay mode (search_docs result) |
baselines/runledger-demo.json |
Baseline metrics snapshot for regression comparison (pass rate, tool calls, timing) |
README.md |
Documentation explaining the RunLedger CI gate and how to run it locally |
.gitignore |
Excludes runledger_out/ directory from version control |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| ticket = msg.get("input", {}).get("ticket", "") | ||
| send({"type": "tool_call", "name": "search_docs", "call_id": "c1", "args": {"q": ticket}}) | ||
| elif msg.get("type") == "tool_result": | ||
| send({"type": "final_output", "output": {"category": "account", "reply": "Reset password instructions sent."}}) |
Copilot
AI
Dec 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The hardcoded output values in the agent ("category": "account", "reply": "Reset password instructions sent.") do not use any information from the tool result. The agent receives a search result with a snippet "Use the reset link." but ignores it completely. This creates a disconnect between the tool call and the output, making the tool call appear unnecessary.
| send({"type": "final_output", "output": {"category": "account", "reply": "Reset password instructions sent."}}) | |
| tool_output = msg.get("output") | |
| details = "" | |
| if tool_output is not None: | |
| try: | |
| details = " Details: " + json.dumps(tool_output) | |
| except (TypeError, ValueError): | |
| details = " Details: " + str(tool_output) | |
| send( | |
| { | |
| "type": "final_output", | |
| "output": { | |
| "category": "account", | |
| "reply": "Reset password instructions sent." + details, | |
| }, | |
| } | |
| ) |
| description: triage a login ticket | ||
| input: | ||
| ticket: reset password | ||
| cassette: cassettes/t1.jsonl | ||
| assertions: | ||
| - type: required_fields |
Copilot
AI
Dec 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The description field "triage a login ticket" contains a potential inaccuracy. The test case is about resetting a password, not triaging a login issue. Consider updating this to "triage a password reset ticket" or similar to better match the actual ticket content.
| description: triage a login ticket | |
| input: | |
| ticket: reset password | |
| cassette: cassettes/t1.jsonl | |
| assertions: | |
| - type: required_fields | |
| description: triage a password reset ticket | |
| input: | |
| ticket: reset password | |
| cassette: cassettes/t1.jsonl | |
| assertions: | |
| - type: required_fields |
Summary
runledger/Runledger@v0.1runledger_out/How to run locally
Notes