Skip to content

Integrating a single indicator for holistic multi-dimensional evaluation #131

@fouratifares

Description

@fouratifares

Hi AssetOpsBench team,

While reviewing the benchmark and leaderboard, I noticed that results are currently reported across multiple dimensions (six at present). I’d like to propose integrating an aggregation method we recently introduced that provides a more holistic comparison across multiple dimensions.

We propose AGI_AUC, an aggregation technique that combines multiple evaluation dimensions into a single indicator intended to measure the general intelligence of a system across the considered benchmarks. Unlike a simple arithmetic mean, AGI_AUC is designed to avoid overstating performance and to better expose weaknesses across dimensions.

Details:

Technical formulation: Equations 2 and 3 in our paper
https://arxiv.org/pdf/2510.20784

Reference implementation:
https://github.com/fouratifares/coherence-agi

The method supports any number of dimensions and has been applied across several benchmarks, including 17 used by the Gemini team.

I believe AGI_AUC could be a useful optional aggregate metric for AssetOpsBench (e.g., alongside per-dimension scores).

Looking forward to feedback and discussion.

Best,
Fares

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions