[NEW] eval results release post #3271

burtenshaw · 2026-02-04T09:46:53Z

Congratulations! You've made it this far! Once merged, the article will appear at https://huggingface.co/blog. Official articles
require additional reviews. Alternatively, you can write a community article following the process here.

Preparing the Article

You're not quite done yet, though. Please make sure to follow this process (as documented here):

Add an entry to _blog.yml.
Add a thumbnail. There are no requirements here, but there is a template if it's helpful.
Check you use a short title and blog path.
Upload any additional assets (such as images) to the Documentation Images repo. This is to reduce bloat in the GitHub base repo when cloning and pulling. Try to have small images to avoid a slow or expensive user experience.
Add metadata (such as authors) to your md file. You can also specify guest or org for the authors.
Ensure the publication date is correct.
Preview the content. A quick way is to paste the markdown content in https://huggingface.co/new-blog. Do not click publish, this is just a way to do an early check.

Here is an example of a complete PR: #2382

Getting a Review

Please make sure to get a review from someone on your team or a co-author.
Once this is done and once all the steps above are completed, you should be able to merge.
There is no need for additional reviews if you and your co-authors are happy and meet all of the above.

Feel free to add @pcuenca as a reviewer if you want a final check. Keep in mind he'll be biased toward light reviews
(e.g., check for proper metadata) rather than content reviews unless explicitly asked.

community-evals.md

Co-authored-by: Nathan Habib <30601243+NathanHB@users.noreply.github.com>

community-evals.md

pcuenca · 2026-02-04T13:20:07Z

community-evals.md

+
+Community evals do not replace benchmarks and leaderboards and closed evals with published leaderboards are still crucial. However, we want to contribute to this with open eval results based on reproducible eval specs. 
+
+This won't solve benchmark saturation or close the benchmark-reality gap. Nor will it stop training on test sets. But it makes the game visible by exposing what is evaluated, how, when, and by whom. 


community-evals.md

NielsRogge · 2026-02-04T13:28:50Z

community-evals.md

+We are going to take evaluations on the Hugging Face Hub in a new direction by decentralizing reporting and allowing the entire community to openly report scores for benchmarks. At first, we will start with a shortlist of 4 benchmarks and over time we’ll expand to the most relevant benchmarks. 
+
+**For Benchmarks:** Dataset repos can now register as benchmarks ([MMLU-Pro](https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro), [GPQA](https://huggingface.co/datasets/Idavidrein/gpqa), [HLE](https://huggingface.co/datasets/cais/hle) are already live). They automatically aggregate reported results from across the hub and display leaderboards in the dataset card. The benchmark defines the eval spec via `eval.yaml`, based on [Inspect AI](https://inspect.aisi.org.uk/), so anyone can reproduce it. The reported results need to align with the task definition. 


(nit) there's a mention of 4 benchmarks, but only 3 get linked

davanstrien

small suggestion

community-evals.md

Co-authored-by: Pedro Cuenca <pedro@huggingface.co> Co-authored-by: Daniel van Strien <davanstrien@users.noreply.github.com>

…to eval-results

community-evals.md

julien-c

looks like it's in good shape! 🚢

burtenshaw added 2 commits February 4, 2026 10:46

[NEW] eval results release post

0d4f93a

add authors and images

5968e01

burtenshaw requested review from NathanHB, davanstrien, julien-c, krampstudio, merveenoyan and pcuenca February 4, 2026 12:48

NathanHB reviewed Feb 4, 2026

View reviewed changes

community-evals.md Outdated Show resolved Hide resolved

Update community-evals.md

68f9375

Co-authored-by: Nathan Habib <30601243+NathanHB@users.noreply.github.com>

pcuenca approved these changes Feb 4, 2026

View reviewed changes

NielsRogge reviewed Feb 4, 2026

View reviewed changes

davanstrien approved these changes Feb 4, 2026

View reviewed changes

community-evals.md Outdated Show resolved Hide resolved

burtenshaw and others added 5 commits February 4, 2026 14:58

Apply suggestions from code review

758485b

Co-authored-by: Pedro Cuenca <pedro@huggingface.co> Co-authored-by: Daniel van Strien <davanstrien@users.noreply.github.com>

add a thumbnail

05dee65

Merge branch 'eval-results' of https://github.com/huggingface/blog in…

04bc973

…to eval-results

link to thumbnail

e4f071a

add banner image

20d7de4

pcuenca reviewed Feb 5, 2026

View reviewed changes

community-evals.md Outdated Show resolved Hide resolved

use banner

ad37a6d

julien-c approved these changes Feb 5, 2026

View reviewed changes

add the video

3cbbe54

burtenshaw merged commit f890f8a into main Feb 5, 2026
1 check passed

burtenshaw deleted the eval-results branch February 5, 2026 15:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NEW] eval results release post #3271

[NEW] eval results release post #3271

Uh oh!

burtenshaw commented Feb 4, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pcuenca Feb 4, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

NielsRogge Feb 4, 2026

Uh oh!

davanstrien left a comment

Uh oh!

Uh oh!

Uh oh!

julien-c left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants


		Community evals do not replace benchmarks and leaderboards and closed evals with published leaderboards are still crucial. However, we want to contribute to this with open eval results based on reproducible eval specs.

		This won't solve benchmark saturation or close the benchmark-reality gap. Nor will it stop training on test sets. But it makes the game visible by exposing what is evaluated, how, when, and by whom.

		We are going to take evaluations on the Hugging Face Hub in a new direction by decentralizing reporting and allowing the entire community to openly report scores for benchmarks. At first, we will start with a shortlist of 4 benchmarks and over time we’ll expand to the most relevant benchmarks.

		For Benchmarks: Dataset repos can now register as benchmarks ([MMLU-Pro](https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro), [GPQA](https://huggingface.co/datasets/Idavidrein/gpqa), [HLE](https://huggingface.co/datasets/cais/hle) are already live). They automatically aggregate reported results from across the hub and display leaderboards in the dataset card. The benchmark defines the eval spec via `eval.yaml`, based on [Inspect AI](https://inspect.aisi.org.uk/), so anyone can reproduce it. The reported results need to align with the task definition.

[NEW] eval results release post #3271

[NEW] eval results release post #3271

Uh oh!

Conversation

burtenshaw commented Feb 4, 2026

Preparing the Article

Getting a Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pcuenca Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

NielsRogge Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

davanstrien left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

julien-c left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants