Implement monitor rubrics #653

mikasenghaas · 2025-12-21T19:15:22Z

Description

This PR implements the MonitorRubric a streamlined abstraction to export information from the state for logging purposes using declarative style. This is done by allowing to specify state keys (+arbitrary transforms) which will get registered as (zero-weight) reward functions. For example, one may count the length of the trajectory field to log num_turns (implemented in MultiTurnMonitorRubric) or average the command execution timeouts in sandboxes from the sandbox_state.command_execution_times as avg_command_execution_time (implemented in SandboxMonitorRubric).

In addition, implements the add_rubric method on Environment which composes the currently registered rubrics with an additional one using RubricGroup. This can be used to register "default" monitor rubrics (e.g. always export num_turns in MultiTurnEnv) across environment hierarchies.

Example

uv run vf-eval math-python -n1 -r1 -v

We see the following registered monitors from the respective environments:

MultiTurnEnv: num_turns
ToolEnv: total_tool_calls
SandboxEnv: sandbox_ready_wait_time, sandbox_command_execution_time
PythonEnv: python_ready_wait_time

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Test improvement

Testing

All existing tests pass when running uv run pytest locally.
New tests have been added to cover the changes

Checklist

My code follows the style guidelines of this project as outlined in AGENTS.md
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
Any dependent changes have been merged and published

Additional Notes

willccbb · 2025-12-21T21:45:22Z

verifiers/envs/multiturn_env.py

 logger = logging.getLogger(__name__)


+class MultiTurnMonitorRubric(MonitorRubric):


small nit -- what's the need for a distinct MultiTurnMonitorRubric class? All env classes currently extend MultiTurnEnv already, and build up the trajectory (even if single-turn)

Risks other extensions (e.g. SandboxMonitorRubric) not being multi-turn compatible by default, which seems undesirable.

willccbb · 2025-12-21T21:47:48Z

verifiers/envs/sandbox_env.py

+    command_execution_times: list[float]
+
+
+class SandboxMonitorRubric(vf.MonitorRubric):


Is this serving a different role than MultiTurnMonitorRubric? is the idea to have many monitor rubrics which track new things introduced at different hierarchies?

willccbb · 2025-12-21T21:49:31Z

Semi-related -- #645 adds an option to bypass scoring, which currently would skip all Rubric classes. Do we want to treat MonitorRubric instances as fundamentally separate in the case of a RubricGroup?

willccbb · 2025-12-22T04:37:20Z

verifiers/envs/environment.py

+        if self.rubric is None:
+            self.rubric = rubric
+        else:
+            self.rubric = vf.RubricGroup(rubrics=[self.rubric, rubric])


if self.rubric is already a RubricGroup, maybe we should do self.rubric.add_rubric ?

willccbb · 2025-12-22T04:38:29Z

verifiers/envs/multiturn_env.py


+class MultiTurnMonitorRubric(MonitorRubric):
+    def __init__(self):
+        super().__init__(state_keys=[("trajectory", "num_turns", len)])


this feels a bit restrictive/unintuitive + not totally seeing the value of the state_keys approach.

we can already do:

async def num_turns(state) -> float: return len(state['trajectory'])

which is fairly concise + avoids the need for a new pattern

could also be worth adding add_metric to Rubric which behaves just like add_reward_func but with default weight 0?

mikasenghaas added 8 commits December 21, 2025 17:33

implement monitor rubric

4bda90d

add to __init__

dfa1bc0

allow arbitrary transforms

fb5b97f

allow adding rubric

7128ba2

make monitor rubric with 3 diff args

ccdb539

add monitor metrics in common classes

42cba23

integrate with math python

58ba11b

fix tests

1d34438

mikasenghaas requested a review from willccbb December 21, 2025 19:20

willccbb reviewed Dec 21, 2025

View reviewed changes

willccbb reviewed Dec 22, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement monitor rubrics #653

Implement monitor rubrics #653

Uh oh!

mikasenghaas commented Dec 21, 2025 •

edited

Loading

Uh oh!

willccbb Dec 21, 2025

Uh oh!

willccbb Dec 21, 2025

Uh oh!

willccbb Dec 21, 2025

Uh oh!

willccbb commented Dec 21, 2025

Uh oh!

willccbb Dec 22, 2025

Uh oh!

willccbb Dec 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		logger = logging.getLogger(__name__)


		class MultiTurnMonitorRubric(MonitorRubric):

		command_execution_times: list[float]


		class SandboxMonitorRubric(vf.MonitorRubric):

Implement monitor rubrics #653

Are you sure you want to change the base?

Implement monitor rubrics #653

Uh oh!

Conversation

mikasenghaas commented Dec 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Example

Type of Change

Testing

Checklist

Additional Notes

Uh oh!

willccbb Dec 21, 2025

Choose a reason for hiding this comment

Uh oh!

willccbb Dec 21, 2025

Choose a reason for hiding this comment

Uh oh!

willccbb Dec 21, 2025

Choose a reason for hiding this comment

Uh oh!

willccbb commented Dec 21, 2025

Uh oh!

willccbb Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

willccbb Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mikasenghaas commented Dec 21, 2025 •

edited

Loading