This is a Python codebase for preparing and analyzing the Hugging Face dataset for FormulaCode (67 repositories; 964+ performance‑improving commits) and FormulaCode-V (?? Repositories; 200 performance-improving commits with manually verified pytest benchmarks).
FormulaCode is designed to benchmark the capabilities of large language models (LLMs) to optimize the performance of real‑world codebases. It is designed to complement existing benchmarks (e.g. SWE‑Bench) by using the same API and methodology as SWE‑Bench.
-
Human‑relative metric – FormulaCode scores an optimizer relative to the speed‑up achieved by the human author of the original commit, preventing “memorize‑and‑saturate” tactics.
-
Finer‑grained feedback – Performance measurements provide a dense reward signal that helps RL or evolutionary algorithms iterate more effectively than binary pass/fail unit tests.
-
Performance benchmarks vs. unit tests – Unit tests protect against functional regressions but can be over‑fit; realistic workload benchmarks capture the critical performance hot‑paths developers actually care about.
-
Real‑world impact – FormulaCode uses a library's pre-defined performance categorization workloads. As such, if an LLM statistically outperforms the human baseline on a FormulaCode task, the resulting patch is often state‑of‑the‑art and can be upstreamed to the library. However, this is contingent on the patch being thoroughly validated after manual verification, of course.
Make a tokens.env file with your GitHub and Codecov credentials.
# Cache and backup locations
CACHE_LOCATION=/home/???/formulacode/datasmith/scratch/artifacts/cache.db
BACKUP_DIR=/home/???/formulacode/backup/
# Scraping tokens
GH_TOKENs=github_pat_???,github_pat_???
CODECOV_TOKEN=54c6???
# LLM configuration for context synthesis
DSPY_MODEL_NAME=openai/meta-llama/Llama-3.3-70B-Instruct
DSPY_URL=http://localhost:30000/v1
DSPY_API_KEY=local
DSPY_TEMPERATURE=0.7
# For DockerHub publishing (dataset verification)
DOCKERHUB_NAMESPACE=formulacode # Required for dataset verification
DOCKERHUB_USERNAME=myuser # Required for dataset verification
DOCKERHUB_TOKEN=dckr_pat_xxxxx # Required for dataset verification
# For ECR access (legacy/optional)
AWS_REGION=us-east-1
# Depends on the system.
#DOCKER_USE_BUILDX=0
DOCKER_NETWORK_MODE=hostThen, install uv and set up the development environment:
$ curl -LsSf https://astral.sh/uv/install.sh | sh
# Installs pre-commit hooks and dev dependencies.
$ make install
# Resolve initial formatting issues.
$ uv run pre-commit run -a
$ make check
# Run tests to verify installation.
$ make testEnsure your machine can run CRON tasks. This will be necessary for updating FormulaCode every month.
$ crontab -l
# Edit this file to introduce tasks to be run by cron.
#
# Each task to run has to be defined through a single line
# indicating with different fields when the task will be run
# and what command to run for the task
#
# To define the time you can provide concrete values for
# minute (m), hour (h), day of month (dom), month (mon),
# and day of week (dow) or use '*' in these fields (for 'any').
#
# Notice that tasks will be started based on the cron's system
# daemon's notion of time and timezones.
#
# Output of the crontab jobs (including errors) is sent through
# email to the user the crontab file belongs to (unless redirected).
#
# For example, you can run a backup of all your user accounts
# at 5 a.m every week with:
# 0 5 * * 1 tar -zcf /var/backups/home.tgz /home/
#
# For more information see the manual pages of crontab(5) and cron(8)
#
# m h dom mon dow command
# Clean up Docker containers every day at midnight
0 * * * * /usr/bin/docker container prune -f
# Clean up dangling Docker images every week
0 0 * * 0 /usr/bin/docker image prune -f
# Run FormulaCode update script on the 25th day of every month at 2am
0 2 25 * * cd /home/???/formulacode/datasmith && ./.venv/bin/python scratch/scripts/update_formulacode.py >> scratch/logs/update_formulacode_{date +\%Y\%m\%d}.log 2>&1
$ crontab -e
# <Make the necessary edits>The general layout of the artifacts is as follows:
scratch/artifacts
├── cache.db # See `CACHE_LOCATION` env var
├── raw/ # Raw downloads & lists produced by scripts
│  ├── downloads/ # Per‑repo dashboard archives
│  ├── online_dashboards.jsonl # Updated config for dashboard scraper
│  ├── repos_discovered.csv # Candidates from GitHub search
│  ├── repos_valid.csv
└── processed/. # Outputs of various processing scriptsYou will need to download and install uv to set up Datasmith. The rest of the process is automated using make commands.
$ curl -LsSf https://astral.sh/uv/install.sh | sh
# Install dev environment and pre-commit hooks
$ make install
# Resolve initial formatting issues.
$ uv run pre-commit run -a
$ make checkFor querying github and codecov, we need to set up a few environment variables. You can do this by creating a tokens.env file in the root of the repository with the following content.
$ cat tokens.env
GH_TOKEN=github_pat_???
COVERALLS_TOKEN=XdK???
CODECOV_TOKEN=54c6???
CACHE_LOCATION=/home/???/formulacode/datasmith/scratch/artifacts/cache.db
BACKUP_DIR=/home/???/formulacode/backup/FormulaCode is a dataset of 101 repositories with 4M+ PRs with an automated pipeline to scrape, filter, benchmark, and analyze performance-improving commits in open-source repositories that use Airspeed Velocity (asv) for benchmarking. To update formulacode simply run:
$ python scratch/scripts/update_formulacode.py --start-date 2025-10-01 --end-date 2025-11-01
The next sections describe each step of the pipeline in detail.
The diagram below summarizes how scratch/scripts/update_formulacode.py orchestrates the monthly update pipeline and how each downstream script delegates to internal modules and services.
sequenceDiagram
participant U as update_formulacode.py
participant ENV as env setup
participant C1 as collect_commits.py
participant C2 as collect_and_filter_commits.py
participant C3 as prepare_commits_for_building_reports.py
participant C4 as collect_perf_commits.py
participant C5 as synthesize_contexts.py
participant C6 as build_and_publish_to_ecr.py
participant CSV as repos_valid.csv
participant GH as github or offline store
participant TMP as temp repo dir
participant MI as merge info
participant FS as file system
participant LLM as llm backends
participant SQL as sqlite cache
participant CR as context registry
participant DOCK as docker build
participant ECR as aws ecr
%% Orchestrator setup (grey)
rect rgb(230,230,230)
U->>ENV: step 0 setup environment and logging
ENV->>U: environment ready
end
%% Step 1 collect_commits.py (blue)
rect rgb(210,225,255)
U->>C1: step 1 collect commits
C1->>CSV: read repos_valid.csv
C1->>GH: find perf commits for each repo
C1->>GH: find tagged releases
C1->>FS: write commits jsonl
end
%% Step 2 collect_and_filter_commits.py (green)
rect rgb(210,245,220)
U->>C2: step 2 collect and filter commits
C2->>CSV: read repos_valid.csv
C2->>TMP: clone repo into temp dir
C2->>MI: collect merge shas and commit info
C2->>FS: write merge_commits_filtered parquet
end
%% Step 3 prepare_commits_for_building_reports.py (yellow)
rect rgb(255,250,210)
U->>C3: step 3 prepare commits for reports
C3->>FS: read merge_commits_filtered parquet
C3->>C3: tokenize patches and crude perf filter
C3->>C3: analyze commits in threads
C3->>DOCK: make tasks with container names
C3->>FS: optional get patch from diff url
C3->>FS: write parquet with patch
end
%% Step 4 collect_perf_commits.py (red-ish)
rect rgb(255,225,220)
U->>C4: step 4 classify performance commits
C4->>FS: read prepared parquet
C4->>C4: report builder per row
C4->>SQL: cache completion in sqlite
C4->>LLM: call llm backends
LLM->>C4: performance classification
C4->>FS: write raw parquet
C4->>FS: write perf only parquet
end
%% Step 5 synthesize_contexts.py (purple)
rect rgb(235,220,255)
U->>C5: step 5 synthesize contexts
C5->>C5: configure agent backends
C5->>FS: load perf only parquet
C5->>CR: load context registry and update
CR->>C5: context registry ready
C5->>DOCK: build base image
DOCK->>C5: base image built
C5->>C5: prepare task list per repo and commit
C5->>C5: agent build and validate in threads
C5->>FS: write results jsonl and all files by image json
C5->>CR: update context_registry json
end
%% Step 6 build_and_publish_to_ecr.py (teal)
rect rgb(210,245,245)
U->>C6: step 6 build and publish
C6->>FS: read perf only parquet
C6->>CR: load context registry and update
C6->>DOCK: build base image
DOCK->>C6: base image built
C6->>C6: prepare task list for recent package images
C6->>ECR: optional filter tasks not on ecr
ECR->>C6: list of existing images
C6->>DOCK: build using docker validator
DOCK->>C6: built images
C6->>ECR: publish images to aws ecr
end
We start by collecting all repositories that use Airspeed Velocity (asv) for benchmarking. This can be done in one of two ways:
-
Google BigQuery: Google maintains a public dataset of GitHub repositories that can be queried using SQL.
-
Github Search API: We use the GitHub Search API to find all repositories that have a
asv.conf.jsonfile in their root directory. This is a more comprehensive search that can find repositories that are not indexed by Google BigQuery. This version is implemented here.
To run the script, you need to have a GitHub token with repo and read:org permissions. You can create a token by following the instructions here.
Run:
$ python scratch/scripts/collect_commits.py \
--dashboards scratch/artifacts/pipeflush/repos_valid.csv \
--outfile scratch/artifacts/pipeflush/commits_all.jsonl \
--max-pages 50
# Writes scratch/artifacts/processed/repos_discovered.csv and scratch/artifacts/processed/repos_valid.csvThe scratch/artifacts/processed/repos_valid.csv file contains a subset of the repositories that aren't forks / reuploads / has atleast {min-stars} stars / pass other sanity checks. We found ~700 filtered repositories for this dataset.
Given the list of repositories, we find the subset of commits that have already been closed and merged into the main branch and then filters out those commits that primarily modified the benchmarking files (e.g. asv.conf.json) or were not relevant to the benchmarks (e.g. documentation changes) or could not be installed (e.g. runnning uv pip install -e . causes issues).
$ python scratch/scripts/collect_and_filter_commits.py \
--filtered-benchmarks-pth scratch/artifacts/pipeflush/repos_valid.csv \
--output-pth scratch/artifacts/pipeflush/merge_commits_filtered.parquet \
--threads 8 \
--procs 32
$ python scratch/scripts/prepare_commits_for_building_reports.py \
--input scratch/artifacts/pipeflush/merge_commits_filtered.parquet \
--output scratch/artifacts/pipeflush/merge_commits_filtered_with_patch.parquet \
--max-workers 200 \
--filter-repos \
--fetch-patches
$ python scratch/scripts/collect_perf_commits.py \
--commits scratch/artifacts/processed/merge_commits_filtered_with_patch.parquet \
--outfile scratch/artifacts/processed/perfonly_commits_with_patch.parquet \
--max-workers -1Each context is a (repo, commit) pair with an associated build_env.sh script to install dependencies. Some reasons a context might fail to build (and get filtered out):
- Commit couldn't be checked out
- Commit didn't have an asv.conf.json file
- We could not build the asv environment for the commit.
- We could not run a quick asv run to ensure that the benchmarks run.
$ python scratch/scripts/synthesize_contexts.py \
--commits scratch/artifacts/pipeflush/commits_perfonly.parquet \
--output-dir scratch/artifacts/pipeflush/results_synthesis/ \
--context-registry scratch/artifacts/pipeflush/context_registry.json \
--max-workers 32 \
--max-attempts 3 \
--max-steps 10 \
--max-similar-candidates 5 \
--ignore-exhausted \
--push-to-dockerhubWe rebuild the Docker images from scratch and then upload them to AWS ECR for later use.
$ python scratch/scripts/build_and_publish_to_ecr.py \
--commits scratch/artifacts/processed/perfonly_commits_with_patch_final.parquet \
--context-registry scratch/artifacts/pipeflush/context_registry.json \
--max-workers 5 \
--skip-existingAs an alternative to AWS ECR, you can publish Docker images to DockerHub for easier public sharing and distribution.
First, generate a DockerHub access token:
- Visit https://hub.docker.com/settings/security
- Click "New Access Token"
- Give it a descriptive name (e.g., "datasmith-publisher")
- Copy the token
Then configure your environment:
# Add to your tokens.env file
export DOCKERHUB_NAMESPACE=formulacode # Your DockerHub username or organization
export DOCKERHUB_USERNAME=myusername # Your DockerHub username
export DOCKERHUB_TOKEN=dckr_pat_xxxxx # Access token from aboveOr use Docker login:
$ docker login docker.io
# Enter your username and token when prompted$ python scratch/scripts/build_and_publish_to_dockerhub.py \
--commits scratch/artifacts/processed/perfonly_commits_with_patch_final.parquet \
--context-registry scratch/artifacts/pipeflush/context_registry.json \
--namespace formulacode \
--max-workers 5 \
--skip-existing--namespace: Required. DockerHub namespace (your username or organization)--username: DockerHub username (or useDOCKERHUB_USERNAMEenv var)--password: DockerHub token (or useDOCKERHUB_TOKENenv var)--repository-mode:single(default) ormirrorsingle: All images in one repository with encoded tags (e.g.,formulacode/all:owner-repo-sha--final)mirror: Each project gets its own repository (e.g.,formulacode/owner-repo:sha-final)
--single-repo: Repository name for single mode (default:all)--skip-existing: Skip images that already exist on DockerHub--max-workers: Number of parallel build/push operations (default: 8)
- Public Visibility: New DockerHub repositories default to PUBLIC. Change to private manually on DockerHub if needed.
- Rate Limits: DockerHub free tier has rate limits. The script handles this with exponential backoff, but consider a paid plan for high-volume publishing.
- Organization Repositories: For organization namespaces, repositories may need to be manually created on DockerHub before first push.
- Push Concurrency: Default push concurrency is lower for DockerHub (8) vs ECR (12) to avoid rate limiting.
# DockerHub-specific (add to tokens.env)
DOCKERHUB_NAMESPACE=formulacode # Required
DOCKERHUB_USERNAME=myuser # Required
DOCKERHUB_TOKEN=dckr_pat_xxxxx # Required
DOCKERHUB_RATE_LIMIT_WAIT=60 # Optional: seconds to wait on rate limit (default: 60)
DOCKERHUB_SINGLE_REPO=all # Optional: repo name for single mode (default: all)
# Build settings (shared with ECR)
BUILD_CONCURRENCY=24 # Max parallel builds
PUSH_CONCURRENCY=8 # Max parallel pushes (lower for DockerHub)
DOCKER_USE_BUILDX=0 # Use Docker BuildKit
DOCKER_NETWORK_MODE=host # Build network modeThis is done in FormulaCode's fork of the terminal-bench evaluation framework.
This project is licensed under the MIT License. See the LICENSE file for details.
