Skip to content

Conversation

@ZackMitchell910
Copy link

@ZackMitchell910 ZackMitchell910 commented Dec 19, 2025

Summary

  • add a replay-only RunLedger eval suite (suite/case/schema/cassette + stub agent)
  • add a baseline file for regression gating
  • add a GitHub Actions workflow using runledger/Runledger@v0.1
  • add a small README note + ignore runledger_out/

How to run locally

runledger run evals/runledger --mode replay --baseline baselines/runledger-demo.json

Notes

  • no external calls; replay-only cassette
  • feel free to remove the suite/workflow if it is not desired
  • GitHub Actions note: workflows from first-time contributors/forks may require a maintainer to click “Approve and run” before checks will execute.

@ZackMitchell910
Copy link
Author

Thanks for taking a look! This PR adds a replay-only RunLedger gate. The workflow run is currently waiting on fork approval (action_required). If you are open to it, please approve/authorize the workflow run so CI can complete. Happy to adjust anything.

@cryxnet
Copy link
Owner

cryxnet commented Dec 21, 2025

Hi @ZackMitchell910,
Thanks for the PR!
I'm going to review it in the coming days.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a RunLedger-based CI gate for detecting regressions in tool-using agent behavior. It introduces a deterministic, replay-only evaluation suite that uses pre-recorded tool interactions (cassettes) to validate agent outputs against JSON schemas and baseline metrics without making external calls.

Key changes:

  • Adds a complete RunLedger evaluation suite with configuration, test case, and replay cassettes
  • Implements a GitHub Actions workflow that runs on pull requests to enforce regression gates
  • Includes baseline metrics file for pass rate, tool call budgets, and timing thresholds

Reviewed changes

Copilot reviewed 8 out of 9 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
.github/workflows/runledger.yml GitHub Actions workflow that runs RunLedger evals on PRs with artifact uploads
evals/runledger/suite.yaml Suite configuration defining agent command, budgets, assertions, and regression thresholds
evals/runledger/schema.json JSON schema for validating agent output structure (category and reply fields)
evals/runledger/agent/agent.py Stub agent implementation that handles task input and produces formatted output
evals/runledger/cases/t1.yaml Test case definition for triaging a password reset ticket
evals/runledger/cassettes/t1.jsonl Pre-recorded tool response for replay mode (search_docs result)
baselines/runledger-demo.json Baseline metrics snapshot for regression comparison (pass rate, tool calls, timing)
README.md Documentation explaining the RunLedger CI gate and how to run it locally
.gitignore Excludes runledger_out/ directory from version control

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

ticket = msg.get("input", {}).get("ticket", "")
send({"type": "tool_call", "name": "search_docs", "call_id": "c1", "args": {"q": ticket}})
elif msg.get("type") == "tool_result":
send({"type": "final_output", "output": {"category": "account", "reply": "Reset password instructions sent."}})
Copy link

Copilot AI Dec 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hardcoded output values in the agent ("category": "account", "reply": "Reset password instructions sent.") do not use any information from the tool result. The agent receives a search result with a snippet "Use the reset link." but ignores it completely. This creates a disconnect between the tool call and the output, making the tool call appear unnecessary.

Suggested change
send({"type": "final_output", "output": {"category": "account", "reply": "Reset password instructions sent."}})
tool_output = msg.get("output")
details = ""
if tool_output is not None:
try:
details = " Details: " + json.dumps(tool_output)
except (TypeError, ValueError):
details = " Details: " + str(tool_output)
send(
{
"type": "final_output",
"output": {
"category": "account",
"reply": "Reset password instructions sent." + details,
},
}
)

Copilot uses AI. Check for mistakes.
Comment on lines +2 to +7
description: triage a login ticket
input:
ticket: reset password
cassette: cassettes/t1.jsonl
assertions:
- type: required_fields
Copy link

Copilot AI Dec 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description field "triage a login ticket" contains a potential inaccuracy. The test case is about resetting a password, not triaging a login issue. Consider updating this to "triage a password reset ticket" or similar to better match the actual ticket content.

Suggested change
description: triage a login ticket
input:
ticket: reset password
cassette: cassettes/t1.jsonl
assertions:
- type: required_fields
description: triage a password reset ticket
input:
ticket: reset password
cassette: cassettes/t1.jsonl
assertions:
- type: required_fields

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants