Stencila Evaluations and Benchmarking
👋 Intro • 🚴 Roadmap • 🛠️ Develop 🙏 Acknowledgements • 💖 Supporters
Welcome to the repository for Stencila's LLM evaluations and benchmarking. This is in early development and consolidates related code we have had in other repos.
We plan the following three main methodologies to evaluating LLMs for science-focussed prompts and tasks. To avoid discontinuities, we are likely to use a weighting approach, in which we gradually increase the weight of the more advanced methodologies as they are developed.
Collate external benchmarks and map prompts to each. For example, combine scores from LiveBench's coding benchmark and Aider's code editing benchmark into a single code-quality score and use for stencila/create/code-chunk, stencila/create/figure-code and other code-related prompts.
Establish a pipeline for evaluating prompts themselves, and which LLMs are best suited to each prompt, using LLM-as-a-jury and other methods for machine-based evaluation.
Use data from user's acceptance and refinement of AI suggestions within documents as the basis for human-based evaluations.
For development, you’ll need to install the following dependencies:
Then, the following will get you started with a development environment:
just initOnce uv is installed, you can use it to install some additional tools:
uv tool install ruff
uv tool install pyrightThe justfile has some common development-related commands that you might want to run.
For example, the check command runs all linting and tests:
just checkTo run anything within the virtual environment, you need to use uv run <command>.
Alternatively, you can install direnv, and have the virtual environment activated automatically.
See here for more details about using direnv and uv together.
Overview of the current design of the code:
- Code is fetched from the sources defined under the
src/evals/benchmarksand save the raw data downloaded. - We then use pydantic classes to validate the incoming data and then save it to parquet data frames using
polars. - The
tablesfolder contains two tables (as CVS). A set of models with anid, and their mapping to the model names in the benchmarks we download. Ausecolumn lets us pick which models we use. The second table is a list of prompts, with an associatedcategory. - We combine the downloaded parquet data frames with the models and prompts tables to generate lists of scores (validated by pydantic) and then save the results to another scoring data frame.
Each of these stages can be run from the command line.
To see the commands, look in the pyproject.toml under the section [project.script]. For example, to download the benchmarks, run:
These commands are also invoked from the justfile (just all)
- By default, all the data just gets saved under a
datafolder in the root of the project. - The scores are currently normalized to 0..1 (rather than 0-100)
- There is no output to any sqlite database yet, thought there is a schema sketched in
src/evals/orm.py.
Thank you to the following projects whose code and/or data we rely on:
We are grateful for the support of the Astera Institute for this work.

