Define --full as the composition of --all, --max-steps 100, and --auto-respond. Also fix a few reward parsing bugs. #252

pimpale · 2025-12-23T20:44:09Z

Note

CLI and config updates

Replace full with all in config (hud/cli/eval.py) and add --all flag; --full now sets --all, --auto-respond, and --max-steps 100.
auto_respond is now an explicit boolean flag (no implicit default via --full); max_steps defaults to 10.
Task selection uses all for multi-task runs; display updated to show all and explicit max_steps.

Reward handling changes

Reward is computed in EvalContext.__aexit__ from evaluate tools; runners stop manually setting ctx.reward (hud/datasets/runner.py).
Telemetry exit uses self.reward directly (hud/eval/context.py).

Tool result structure

Environment returns MCPToolResult with structuredContent for both local and remote tool calls (hud/environment/environment.py).
Improve reward parse logging to print result.structuredContent (hud/agents/base.py).

^{Written by Cursor Bugbot for commit 2b60eb4. This will update automatically on new commits. Configure here.}

hud/cli/eval.py

pimpale added 5 commits December 23, 2025 02:41

don't automatically turn on auto respond

1f2c653

more stuff

d29fe88

fix

1fe6a47

remove log errors

4d180a9

fix ruff issue

8d028d9

cursor bot reviewed Dec 23, 2025

View reviewed changes

hud/cli/eval.py Outdated Show resolved Hide resolved

pimpale added 2 commits December 23, 2025 21:05

fix test failures

6b15fdd

all should be independent of max steps or auto respond

2b60eb4

cursor bot reviewed Dec 23, 2025

View reviewed changes

hud/cli/eval.py Show resolved Hide resolved

pimpale merged commit 016db63 into main Dec 23, 2025
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Define --full as the composition of --all, --max-steps 100, and --auto-respond. Also fix a few reward parsing bugs. #252

Define --full as the composition of --all, --max-steps 100, and --auto-respond. Also fix a few reward parsing bugs. #252

Uh oh!

pimpale commented Dec 23, 2025 •

edited by cursor bot

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Define --full as the composition of --all, --max-steps 100, and --auto-respond. Also fix a few reward parsing bugs. #252

Define --full as the composition of --all, --max-steps 100, and --auto-respond. Also fix a few reward parsing bugs. #252

Uh oh!

Conversation

pimpale commented Dec 23, 2025 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pimpale commented Dec 23, 2025 •

edited by cursor bot

Loading