feat: add `ContextualFaithfulnessEvaluator` by stefanoamorelli · Pull Request #64 · strands-agents/evals

stefanoamorelli · 2025-12-07T15:31:59Z

Description

Adds ContextualFaithfulnessEvaluator for RAG systems to detect hallucinations by validating whether response claims are grounded in retrieval context. This differs from the existing FaithfulnessEvaluator which checks against conversation history rather than retrieved documents.

The evaluator uses a 4-tier scoring system mapped to numeric values:

Not Faithful (0.0)
Partially Faithful (0.33)
Mostly Faithful (0.67)
Fully Faithful (1.0)

Also an optional retrieval_context field to Case and EvaluationData for passing retrieved data through the evaluation pipeline.

Related Issues

#65

Documentation PR

Type of Change

New feature

Testing

New unit tests introduced

I ran hatch run prepare

Checklist

I have read the CONTRIBUTING document
I have added any necessary tests that prove my fix is effective or my feature works
I have updated the documentation accordingly
I have added an appropriate example to the documentation to outline the feature, or no new docs are needed
My changes generate no new warnings
Any dependent changes have been merged and published

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

This field stores documents from vector stores or retrieval systems, enabling RAG evaluation workflows where responses need validation against source context. [1] [1]: https://docs.aws.amazon.com/bedrock/latest/userguide/knowledge-base-evaluate.html

Mirrors the EvaluationData field so users can provide retrieved documents when defining test cases.

Pass retrieval_context from Case through to EvaluationData in both sync and async paths so evaluators can access it.

Defines a 4-tier rating scale from Not Faithful to Fully Faithful, with guidance on evaluating factual claims against retrieval context. Based on faithfulness metrics from RAG evaluation literature. [1] [1]: https://arxiv.org/abs/2309.01431

Validates whether response claims are grounded in retrieval context, designed specifically for RAG systems. Uses structured output with faithfulness tiers mapped to scores [0.0, 0.33, 0.67, 1.0]. This differs from FaithfulnessEvaluator which checks conversation history rather than retrieved documents. [1] [1]: https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/faithfulness/

Tests the new field in Case and updates experiment serialization assertions to include retrieval_context.

Covers initialization, score mapping across all faithfulness tiers, input validation, prompt formatting behavior, async evaluation, and serialization.

Shows RAG evaluation with retrieval_context and updates the available evaluators list to include ContextualFaithfulnessEvaluator and HarmfulnessEvaluator.

stefanoamorelli added 9 commits December 7, 2025 17:12

feat(case): add retrieval_context field for RAG test cases

500da30

Mirrors the EvaluationData field so users can provide retrieved documents when defining test cases.

feat(experiment): propagate retrieval_context to EvaluationData

d92508f

Pass retrieval_context from Case through to EvaluationData in both sync and async paths so evaluators can access it.

feat(evaluators): export ContextualFaithfulnessEvaluator

419e20d

test(case,experiment): add retrieval_context field coverage

828e1c6

Tests the new field in Case and updates experiment serialization assertions to include retrieval_context.

test(evaluators): add ContextualFaithfulnessEvaluator tests

bf69f0c

Covers initialization, score mapping across all faithfulness tiers, input validation, prompt formatting behavior, async evaluation, and serialization.

docs: add ContextualFaithfulnessEvaluator example

e86b560

Shows RAG evaluation with retrieval_context and updates the available evaluators list to include ContextualFaithfulnessEvaluator and HarmfulnessEvaluator.

stefanoamorelli had a problem deploying to manual-approval December 7, 2025 15:32 — with GitHub Actions Failure

stefanoamorelli changed the title ~~feat(types): add retrieval_context field to EvaluationData~~ feat: add ContextualFaithfulnessEvaluator Dec 7, 2025

stefanoamorelli marked this pull request as ready for review December 7, 2025 15:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add `ContextualFaithfulnessEvaluator`#64

feat: add `ContextualFaithfulnessEvaluator`#64
stefanoamorelli wants to merge 9 commits intostrands-agents:mainfrom
stefanoamorelli:feature/contextual-faithfulness-evaluator

stefanoamorelli commented Dec 7, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

stefanoamorelli commented Dec 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issues

Documentation PR

Type of Change

Testing

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

stefanoamorelli commented Dec 7, 2025 •

edited

Loading