Skip to content

Propagating token usage across runs #13

@guzmanben16

Description

@guzmanben16

Enhancement Request: Cumulative Token Usage Tracking Across Evaluation Runs
Currently, the Evaluation class's run_dataset method accurately tracks and returns token usage for a single evaluation run. However, when run_pipeline in the 5cs evaluation instrument orchestrates multiple Evaluation instances (one for each prompt_type), the token usage from each individual run is not aggregated. This means that while run_dataset might abort if its max_tokens capacity is exceeded, there's no overall capacity limit enforced across the entire run_pipeline execution. The max_tokens parameter passed to Evaluation is applied independently to each category's run, rather than cumulatively across all categories. This can lead to unexpected higher total token consumption than intended for a complete pipeline execution.

Proposed Enhancement
Modify the run_pipeline function to accept and enforce a cumulative max_tokens limit across all evaluation categories. This would involve the following:

Initialization: Initialize a total_accumulated_usage (e.g., a TokenUsage object or similar structure) at the beginning of the run_pipeline function.

Propagation: Pass this total_accumulated_usage to each Evaluation instance, allowing the Evaluation instance to update it. Alternatively, each run_dataset call could return its accumulated_usage, which run_pipeline then adds to total_accumulated_usage.

Capacity Check: After each evaluator.run_dataset call within the loop, run_pipeline should check if the total_accumulated_usage has exceeded the pipeline's overall max_tokens limit.

Early Termination: If the cumulative limit is exceeded, run_pipeline should log a warning and break out of the loop, similar to how run_dataset aborts individual runs.

Return Value: The run_pipeline function should also return the total_accumulated_usage alongside the aggregated_output.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions