Add RunLedger replay gate for agent regressions #24

ZackMitchell910 · 2025-12-19T10:55:53Z

Summary

add a replay-only RunLedger eval suite (suite/case/schema/cassette + stub agent)
add a baseline file for regression gating
add a GitHub Actions workflow using runledger/Runledger@v0.1
add a small README note + ignore runledger_out/

How to run locally

runledger run evals/runledger --mode replay --baseline baselines/runledger-demo.json

Notes

no external calls; replay-only cassette
feel free to remove the suite/workflow if it is not desired
GitHub Actions note: workflows from first-time contributors/forks may require a maintainer to click “Approve and run” before checks will execute.

ZackMitchell910 · 2025-12-19T11:42:31Z

Thanks for taking a look! This PR adds a replay-only RunLedger gate. The workflow run is currently waiting on fork approval (action_required). If you are open to it, please approve/authorize the workflow run so CI can complete. Happy to adjust anything.

cryxnet · 2025-12-21T21:39:39Z

Hi @ZackMitchell910,
Thanks for the PR!
I'm going to review it in the coming days.

Copilot

Pull request overview

This PR adds a RunLedger-based CI gate for detecting regressions in tool-using agent behavior. It introduces a deterministic, replay-only evaluation suite that uses pre-recorded tool interactions (cassettes) to validate agent outputs against JSON schemas and baseline metrics without making external calls.

Key changes:

Adds a complete RunLedger evaluation suite with configuration, test case, and replay cassettes
Implements a GitHub Actions workflow that runs on pull requests to enforce regression gates
Includes baseline metrics file for pass rate, tool call budgets, and timing thresholds

Reviewed changes

Copilot reviewed 8 out of 9 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
`.github/workflows/runledger.yml`	GitHub Actions workflow that runs RunLedger evals on PRs with artifact uploads
`evals/runledger/suite.yaml`	Suite configuration defining agent command, budgets, assertions, and regression thresholds
`evals/runledger/schema.json`	JSON schema for validating agent output structure (category and reply fields)
`evals/runledger/agent/agent.py`	Stub agent implementation that handles task input and produces formatted output
`evals/runledger/cases/t1.yaml`	Test case definition for triaging a password reset ticket
`evals/runledger/cassettes/t1.jsonl`	Pre-recorded tool response for replay mode (search_docs result)
`baselines/runledger-demo.json`	Baseline metrics snapshot for regression comparison (pass rate, tool calls, timing)
`README.md`	Documentation explaining the RunLedger CI gate and how to run it locally
`.gitignore`	Excludes runledger_out/ directory from version control

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-23T08:59:41Z

evals/runledger/agent/agent.py

+            ticket = msg.get("input", {}).get("ticket", "")
+            send({"type": "tool_call", "name": "search_docs", "call_id": "c1", "args": {"q": ticket}})
+        elif msg.get("type") == "tool_result":
+            send({"type": "final_output", "output": {"category": "account", "reply": "Reset password instructions sent."}})


The hardcoded output values in the agent ("category": "account", "reply": "Reset password instructions sent.") do not use any information from the tool result. The agent receives a search result with a snippet "Use the reset link." but ignores it completely. This creates a disconnect between the tool call and the output, making the tool call appear unnecessary.

Suggested change

send({"type": "final_output", "output": {"category": "account", "reply": "Reset password instructions sent."}})

tool_output = msg.get("output")

details = ""

if tool_output is not None:

try:

details = " Details: " + json.dumps(tool_output)

except (TypeError, ValueError):

details = " Details: " + str(tool_output)

send(

{

"type": "final_output",

"output": {

"category": "account",

"reply": "Reset password instructions sent." + details,

},

}

)

Copilot · 2025-12-23T08:59:42Z

evals/runledger/cases/t1.yaml

+description: triage a login ticket
+input:
+  ticket: reset password
+cassette: cassettes/t1.jsonl
+assertions:
+- type: required_fields


The description field "triage a login ticket" contains a potential inaccuracy. The test case is about resetting a password, not triaging a login issue. Consider updating this to "triage a password reset ticket" or similar to better match the actual ticket content.

Suggested change

description: triage a login ticket

input:

ticket: reset password

cassette: cassettes/t1.jsonl

assertions:

- type: required_fields

description: triage a password reset ticket

input:

ticket: reset password

cassette: cassettes/t1.jsonl

assertions:

- type: required_fields

ZackMitchell910 added 2 commits December 19, 2025 03:50

Add RunLedger replay gate

d2d1a72

Normalize baseline paths and tidy agent stub

657ca6c

cryxnet requested a review from Copilot December 23, 2025 08:57

cryxnet self-assigned this Dec 23, 2025

Copilot started reviewing on behalf of cryxnet December 23, 2025 08:57 View session

Copilot AI reviewed Dec 23, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add RunLedger replay gate for agent regressions #24

Add RunLedger replay gate for agent regressions #24

Uh oh!

ZackMitchell910 commented Dec 19, 2025 •

edited

Loading

Uh oh!

ZackMitchell910 commented Dec 19, 2025

Uh oh!

cryxnet commented Dec 21, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Dec 23, 2025

Uh oh!

Copilot AI Dec 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-            send({"type": "final_output", "output": {"category": "account", "reply": "Reset password instructions sent."}})
+            tool_output = msg.get("output")
+            details = ""
+            if tool_output is not None:
+                try:
+                    details = " Details: " + json.dumps(tool_output)
+                except (TypeError, ValueError):
+                    details = " Details: " + str(tool_output)
+            send(
+                {
+                    "type": "final_output",
+                    "output": {
+                        "category": "account",
+                        "reply": "Reset password instructions sent." + details,
+                    },
+                }
+            )

Add RunLedger replay gate for agent regressions #24

Are you sure you want to change the base?

Add RunLedger replay gate for agent regressions #24

Uh oh!

Conversation

ZackMitchell910 commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

How to run locally

Notes

Uh oh!

ZackMitchell910 commented Dec 19, 2025

Uh oh!

cryxnet commented Dec 21, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ZackMitchell910 commented Dec 19, 2025 •

edited

Loading