fix(experiments): move evaluations to root experiment span #1497
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Disclaimer: Experimental PR review
Greptile Overview
Greptile Summary
This PR refactors the experiment evaluation workflow to ensure that evaluators run within the root experiment span context, enabling proper tracing and attribution of evaluation activities.
Key Changes:
Evaluator Context Placement: Moved the evaluator execution code (lines 2894-2988) from outside the span context into the
with self.start_as_current_span(name=span_name) as span:block. This ensures evaluators are associated with the experiment span.Propagated Attributes Refactoring: Extracted
PropagatedExperimentAttributesinto a variable (propagated_experiment_attributes) that can be reused across multiple context managers, including during task execution and evaluator runs.Wrapped Evaluator Calls: Added
_propagate_attributescontext managers around both regular evaluator calls (line 2906-2916) and composite evaluator calls (line 2943-2977), ensuring experiment attributes are properly propagated through the evaluation chain.Behavior Preservation:
Impact:
This change improves observability by ensuring evaluation traces are properly nested under the experiment span, making it easier to analyze experiment runs in the Langfuse UI.
Confidence Score: 5/5
Important Files Changed
File Analysis
Sequence Diagram
sequenceDiagram participant Caller participant _process_experiment_item participant Span as Span Context participant Task as _run_task participant Evaluator as _run_evaluator participant API as create_score Caller->>_process_experiment_item: Process experiment item _process_experiment_item->>Span: Enter span context activate Span Note over _process_experiment_item: Try block starts _process_experiment_item->>_process_experiment_item: Extract input_data, expected_output _process_experiment_item->>_process_experiment_item: Create propagated_experiment_attributes _process_experiment_item->>Task: Run task with _propagate_attributes activate Task Task-->>_process_experiment_item: Return output deactivate Task _process_experiment_item->>Span: Update span with input/output Note over _process_experiment_item: Try block succeeds Note over _process_experiment_item,Evaluator: Evaluators run INSIDE span context (NEW) loop For each evaluator _process_experiment_item->>Evaluator: Run evaluator with _propagate_attributes activate Evaluator Evaluator-->>_process_experiment_item: Return eval_results deactivate Evaluator loop For each evaluation _process_experiment_item->>API: create_score(trace_id, observation_id, ...) end end alt If composite_evaluator exists _process_experiment_item->>Evaluator: Run composite evaluator with _propagate_attributes activate Evaluator Evaluator-->>_process_experiment_item: Return composite results deactivate Evaluator loop For each composite evaluation _process_experiment_item->>API: create_score(trace_id, observation_id, ...) end end _process_experiment_item->>Span: Exit span context deactivate Span _process_experiment_item-->>Caller: Return ExperimentItemResult