-
Notifications
You must be signed in to change notification settings - Fork 15
Description
Problem Statement
Strands Evals currently requires LLM invocations for ALL evaluations, even for simple assertions that should be deterministic. This feature request proposes adding definitive (deterministic, non-LLM) evaluation capabilities to complement existing LLM-based evaluators, enabling fast, reliable, and cost-free assertions for structural and factual checks.
Possible definitive evaluator / method
Basic
- Equals
- Contains
- MaxDuration / MinDuration
- HasMatchingSpan
- StartsWith
- IsInstance
Span-level
- HasSpan(name) - Check if span with name exists
- HasAttributes
- HasSpanWithAttributes(name, attributes) - Span with specific attributes
- HasMatchingSpan(query) - Complex query-based check
- SpanSequence
- HassErrorFlag
Tool-level
- ToolCalled
- ToolSucceeded
- ToolResultContains
- ToolResultEquals
Generic
- Custom: definitive callback
...
Proposed Solution
Solution 1: Add definitive matching method as part of evaluators
The implementation will look like:
evaluators = [
# Definitive checks (fast, free, reliable)
Equals(value=expected_output),
HasSpanWithName(name="calculator"),
MaxDuration(seconds=5.0),
Contains(value="Paris"),
# LLM-based checks (quality assessment)
HelpfulnessEvaluator(),
OutputEvaluator(rubric="Assess response quality"),
]
Alternative 2: Separate Assertion Phase
Additional assertions field, need to think about the score and failure model (evaluator runs only when all definitive assertions pass).
assertions = [
Equals(value=expected_output),
HasSpanWithName(name="calculator"),
MaxDuration(seconds=5.0),
]
evaluators = [
HelpfulnessEvaluator(),
OutputEvaluator(rubric="Assess response quality"),
]
experiment = Experiment(
cases=test_cases,
assertions=assertions,
evaluators=evaluators
)
Use Case
-
As a developer testing my agent locally, I want instant feedback when my agent fails basic checks without LLM costs
-
As a developer ensuring trace quality, I want to verify execution structure and metadata to catch instrumentation and configuration issues
-
I want fast, cheap health checks so that I can continuously verify basic functionality
Alternatives Solutions
Users can now set up its own custom evaluator with definitive assertions