-
Notifications
You must be signed in to change notification settings - Fork 9
Description
Enhancement Request: Cumulative Token Usage Tracking Across Evaluation Runs
Currently, the Evaluation class's run_dataset method accurately tracks and returns token usage for a single evaluation run. However, when run_pipeline in the 5cs evaluation instrument orchestrates multiple Evaluation instances (one for each prompt_type), the token usage from each individual run is not aggregated. This means that while run_dataset might abort if its max_tokens capacity is exceeded, there's no overall capacity limit enforced across the entire run_pipeline execution. The max_tokens parameter passed to Evaluation is applied independently to each category's run, rather than cumulatively across all categories. This can lead to unexpected higher total token consumption than intended for a complete pipeline execution.
Proposed Enhancement
Modify the run_pipeline function to accept and enforce a cumulative max_tokens limit across all evaluation categories. This would involve the following:
Initialization: Initialize a total_accumulated_usage (e.g., a TokenUsage object or similar structure) at the beginning of the run_pipeline function.
Propagation: Pass this total_accumulated_usage to each Evaluation instance, allowing the Evaluation instance to update it. Alternatively, each run_dataset call could return its accumulated_usage, which run_pipeline then adds to total_accumulated_usage.
Capacity Check: After each evaluator.run_dataset call within the loop, run_pipeline should check if the total_accumulated_usage has exceeded the pipeline's overall max_tokens limit.
Early Termination: If the cumulative limit is exceeded, run_pipeline should log a warning and break out of the loop, similar to how run_dataset aborts individual runs.
Return Value: The run_pipeline function should also return the total_accumulated_usage alongside the aggregated_output.