Skip to content

[EPIC] Add Environment_State Validation #110

@poshinchen

Description

@poshinchen

Problem Statement

Strands Evals currently focuses on evaluating agent behavior through trajectories (sequences of actions/tool calls) and outputs (final responses). However, it cannot verify the actual side effects that agents produce in their environment - such as changes to databases, file systems, API states, or other external systems.

Example

When an agent performs actions that modify external state (e.g., creates a file, updates a database record, sends an email), evaluators can only see:

  • What the agent said it did (output)
  • What tools it called (trajectory)

But evaluators cannot verify:

  • Whether the action actually succeeded
  • What the final state of the environment looks like
  • Whether side effects match expectations

Proposed Solution

Phase 1: Environment State Callback Pattern

Concept: Add an optional environment_state_callback to the task function that captures environment state after task execution. This has minimal API changes, and backward compatible

  • Task functions returns {"output": ..., "trajectory": ..., "environment_states": {...}}
  • The environment_state field contains arbitrary data representing the post-execution environment
  • Evaluators receive this state in EvaluationData and can verify it against expectations
  • Cases can include expected_environment_state for comparison

Phase 2: Environment State extractors to get the default states from different clients

  • Provide built-in extractors for common scenarios
  • Allow custom evaluators to access environment states by key

Use Case

  • As a robotics engineer, I want to evaluate whether my agent successfully moved objects to target locations, so I can verify task completion beyond just checking if the agent called the right movement commands.

  • As a database application developer, I want to verify that my agent correctly created, updated, or deleted database records, so I can ensure data integrity beyond just checking if SQL commands were executed.

  • As a workflow orchestration developer, I want to verify the final state of a multi-step process (e.g., all artifacts present, correct configuration applied), so I can ensure end-to-end success rather than just individual step completion.
    ...

Exit Criteria

Being able to run a SWEBench-like test case with this repo. We should be able to check out a repo at a specific commit, run the agent to fix a bug present at that commit, and then run the unit tests to confirm if the agent actually fixed the bug

Alternatives Solutions

No response

Additional Context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions