Skip to content

[EPIC] Add Definitive Evaluation / assertions #109

@poshinchen

Description

@poshinchen

Problem Statement

Strands Evals currently requires LLM invocations for ALL evaluations, even for simple assertions that should be deterministic. This feature request proposes adding definitive (deterministic, non-LLM) evaluation capabilities to complement existing LLM-based evaluators, enabling fast, reliable, and cost-free assertions for structural and factual checks.

Possible definitive evaluator / method

Basic

  • Equals
  • Contains
  • MaxDuration / MinDuration
  • HasMatchingSpan
  • StartsWith
  • IsInstance

Span-level

  • HasSpan(name) - Check if span with name exists
  • HasAttributes
  • HasSpanWithAttributes(name, attributes) - Span with specific attributes
  • HasMatchingSpan(query) - Complex query-based check
  • SpanSequence
  • HassErrorFlag

Tool-level

  • ToolCalled
  • ToolSucceeded
  • ToolResultContains
  • ToolResultEquals

Generic

  • Custom: definitive callback

...

Proposed Solution

Solution 1: Add definitive matching method as part of evaluators

The implementation will look like:

evaluators = [
    # Definitive checks (fast, free, reliable)
    Equals(value=expected_output),
    HasSpanWithName(name="calculator"),
    MaxDuration(seconds=5.0),
    Contains(value="Paris"),
    
    # LLM-based checks (quality assessment)
    HelpfulnessEvaluator(),
    OutputEvaluator(rubric="Assess response quality"),
]

Alternative 2: Separate Assertion Phase

Additional assertions field, need to think about the score and failure model (evaluator runs only when all definitive assertions pass).

assertions = [
    Equals(value=expected_output),
    HasSpanWithName(name="calculator"),
    MaxDuration(seconds=5.0),
]
evaluators = [
    HelpfulnessEvaluator(),
    OutputEvaluator(rubric="Assess response quality"),
]
experiment = Experiment(
    cases=test_cases,
    assertions=assertions,
    evaluators=evaluators
)

Use Case

  • As a developer testing my agent locally, I want instant feedback when my agent fails basic checks without LLM costs

  • As a developer ensuring trace quality, I want to verify execution structure and metadata to catch instrumentation and configuration issues

  • I want fast, cheap health checks so that I can continuously verify basic functionality

Alternatives Solutions

Users can now set up its own custom evaluator with definitive assertions

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions