-
Notifications
You must be signed in to change notification settings - Fork 15
Description
Problem Statement
Strands Evals currently focuses on evaluating agent behavior through trajectories (sequences of actions/tool calls) and outputs (final responses). However, it cannot verify the actual side effects that agents produce in their environment - such as changes to databases, file systems, API states, or other external systems.
Example
When an agent performs actions that modify external state (e.g., creates a file, updates a database record, sends an email), evaluators can only see:
- What the agent said it did (output)
- What tools it called (trajectory)
But evaluators cannot verify:
- Whether the action actually succeeded
- What the final state of the environment looks like
- Whether side effects match expectations
Proposed Solution
Phase 1: Environment State Callback Pattern
Concept: Add an optional environment_state_callback to the task function that captures environment state after task execution. This has minimal API changes, and backward compatible
- Task functions returns {"output": ..., "trajectory": ..., "environment_states": {...}}
- The environment_state field contains arbitrary data representing the post-execution environment
- Evaluators receive this state in EvaluationData and can verify it against expectations
- Cases can include expected_environment_state for comparison
Phase 2: Environment State extractors to get the default states from different clients
- Provide built-in extractors for common scenarios
- Allow custom evaluators to access environment states by key
Use Case
-
As a robotics engineer, I want to evaluate whether my agent successfully moved objects to target locations, so I can verify task completion beyond just checking if the agent called the right movement commands.
-
As a database application developer, I want to verify that my agent correctly created, updated, or deleted database records, so I can ensure data integrity beyond just checking if SQL commands were executed.
-
As a workflow orchestration developer, I want to verify the final state of a multi-step process (e.g., all artifacts present, correct configuration applied), so I can ensure end-to-end success rather than just individual step completion.
...
Exit Criteria
Being able to run a SWEBench-like test case with this repo. We should be able to check out a repo at a specific commit, run the agent to fix a bug present at that commit, and then run the unit tests to confirm if the agent actually fixed the bug
Alternatives Solutions
No response
Additional Context
No response