[EPIC] Add Environment_State Validation

### Problem Statement

Strands Evals currently focuses on evaluating agent behavior through trajectories (sequences of actions/tool calls) and outputs (final responses). However, it cannot verify the actual side effects that agents produce in their environment - such as changes to databases, file systems, API states, or other external systems.

### Example
When an agent performs actions that modify external state (e.g., creates a file, updates a database record, sends an email), evaluators can only see:
- What the agent said it did (output)
- What tools it called (trajectory)

But evaluators **cannot** verify:
- Whether the action actually succeeded
- What the final state of the environment looks like
- Whether side effects match expectations

### Proposed Solution

### Phase 1: Environment State Callback Pattern
Concept: Add an optional environment_state_callback to the task function that captures environment state after task execution. This has minimal API changes, and backward compatible

- Task functions returns {"output": ..., "trajectory": ..., "environment_states": {...}}
- The environment_state field contains arbitrary data representing the post-execution environment
- Evaluators receive this state in EvaluationData and can verify it against expectations
- Cases can include expected_environment_state for comparison


### Phase 2: Environment State extractors to get the default states from different clients
- Provide built-in extractors for common scenarios
- Allow custom evaluators to access environment states by key

### Use Case

- As a robotics engineer, I want to evaluate whether my agent successfully moved objects to target locations, so I can verify task completion beyond just checking if the agent called the right movement commands.

- As a database application developer, I want to verify that my agent correctly created, updated, or deleted database records, so I can ensure data integrity beyond just checking if SQL commands were executed.

- As a workflow orchestration developer, I want to verify the final state of a multi-step process (e.g., all artifacts present, correct configuration applied), so I can ensure end-to-end success rather than just individual step completion.
...


### Exit Criteria

Being able to run a SWEBench-like test case with this repo. We should be able to check out a repo at a specific commit, run the agent to fix a bug present at that commit, and then run the unit tests to confirm if the agent actually fixed the bug


### Alternatives Solutions

_No response_

### Additional Context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[EPIC] Add Environment_State Validation #110

Problem Statement

Example

Proposed Solution

Phase 1: Environment State Callback Pattern

Phase 2: Environment State extractors to get the default states from different clients

Use Case

Exit Criteria

Alternatives Solutions

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[EPIC] Add Environment_State Validation #110

Description

Problem Statement

Example

Proposed Solution

Phase 1: Environment State Callback Pattern

Phase 2: Environment State extractors to get the default states from different clients

Use Case

Exit Criteria

Alternatives Solutions

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions