Skip to content

Conversation

@mikasenghaas
Copy link
Member

@mikasenghaas mikasenghaas commented Dec 21, 2025

Description

This PR implements the MonitorRubric a streamlined abstraction to export information from the state for logging purposes using declarative style. This is done by allowing to specify state keys (+arbitrary transforms) which will get registered as (zero-weight) reward functions. For example, one may count the length of the trajectory field to log num_turns (implemented in MultiTurnMonitorRubric) or average the command execution timeouts in sandboxes from the sandbox_state.command_execution_times as avg_command_execution_time (implemented in SandboxMonitorRubric).

In addition, implements the add_rubric method on Environment which composes the currently registered rubrics with an additional one using RubricGroup. This can be used to register "default" monitor rubrics (e.g. always export num_turns in MultiTurnEnv) across environment hierarchies.

Example

uv run vf-eval math-python -n1 -r1 -v

We see the following registered monitors from the respective environments:

  • MultiTurnEnv: num_turns
  • ToolEnv: total_tool_calls
  • SandboxEnv: sandbox_ready_wait_time, sandbox_command_execution_time
  • PythonEnv: python_ready_wait_time
Screenshot 2025-12-21 at 8 07 46 PM

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Test improvement

Testing

  • All existing tests pass when running uv run pytest locally.
  • New tests have been added to cover the changes

Checklist

  • My code follows the style guidelines of this project as outlined in AGENTS.md
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

Additional Notes

logger = logging.getLogger(__name__)


class MultiTurnMonitorRubric(MonitorRubric):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small nit -- what's the need for a distinct MultiTurnMonitorRubric class? All env classes currently extend MultiTurnEnv already, and build up the trajectory (even if single-turn)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Risks other extensions (e.g. SandboxMonitorRubric) not being multi-turn compatible by default, which seems undesirable.

command_execution_times: list[float]


class SandboxMonitorRubric(vf.MonitorRubric):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this serving a different role than MultiTurnMonitorRubric? is the idea to have many monitor rubrics which track new things introduced at different hierarchies?

@willccbb
Copy link
Member

Semi-related -- #645 adds an option to bypass scoring, which currently would skip all Rubric classes. Do we want to treat MonitorRubric instances as fundamentally separate in the case of a RubricGroup?

if self.rubric is None:
self.rubric = rubric
else:
self.rubric = vf.RubricGroup(rubrics=[self.rubric, rubric])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if self.rubric is already a RubricGroup, maybe we should do self.rubric.add_rubric ?


class MultiTurnMonitorRubric(MonitorRubric):
def __init__(self):
super().__init__(state_keys=[("trajectory", "num_turns", len)])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this feels a bit restrictive/unintuitive + not totally seeing the value of the state_keys approach.

we can already do:

async def num_turns(state) -> float:
    return len(state['trajectory'])

which is fairly concise + avoids the need for a new pattern

could also be worth adding add_metric to Rubric which behaves just like add_reward_func but with default weight 0?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants