-
Notifications
You must be signed in to change notification settings - Fork 981
[NEW] eval results release post #3271
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Co-authored-by: Nathan Habib <30601243+NathanHB@users.noreply.github.com>
|
|
||
| Community evals do not replace benchmarks and leaderboards and closed evals with published leaderboards are still crucial. However, we want to contribute to this with open eval results based on reproducible eval specs. | ||
|
|
||
| This won't solve benchmark saturation or close the benchmark-reality gap. Nor will it stop training on test sets. But it makes the game visible by exposing what is evaluated, how, when, and by whom. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
community-evals.md
Outdated
| We are going to take evaluations on the Hugging Face Hub in a new direction by decentralizing reporting and allowing the entire community to openly report scores for benchmarks. At first, we will start with a shortlist of 4 benchmarks and over time we’ll expand to the most relevant benchmarks. | ||
|
|
||
| **For Benchmarks:** Dataset repos can now register as benchmarks ([MMLU-Pro](https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro), [GPQA](https://huggingface.co/datasets/Idavidrein/gpqa), [HLE](https://huggingface.co/datasets/cais/hle) are already live). They automatically aggregate reported results from across the hub and display leaderboards in the dataset card. The benchmark defines the eval spec via `eval.yaml`, based on [Inspect AI](https://inspect.aisi.org.uk/), so anyone can reproduce it. The reported results need to align with the task definition. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit) there's a mention of 4 benchmarks, but only 3 get linked
davanstrien
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small suggestion
Co-authored-by: Pedro Cuenca <pedro@huggingface.co> Co-authored-by: Daniel van Strien <davanstrien@users.noreply.github.com>
…to eval-results
julien-c
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like it's in good shape! 🚢
Congratulations! You've made it this far! Once merged, the article will appear at https://huggingface.co/blog. Official articles
require additional reviews. Alternatively, you can write a community article following the process here.
Preparing the Article
You're not quite done yet, though. Please make sure to follow this process (as documented here):
mdfile. You can also specifyguestororgfor the authors.Here is an example of a complete PR: #2382
Getting a Review
Please make sure to get a review from someone on your team or a co-author.
Once this is done and once all the steps above are completed, you should be able to merge.
There is no need for additional reviews if you and your co-authors are happy and meet all of the above.
Feel free to add @pcuenca as a reviewer if you want a final check. Keep in mind he'll be biased toward light reviews
(e.g., check for proper metadata) rather than content reviews unless explicitly asked.