We present AutoBaxBuilder, an automated framework that generates code security benchmark tasks from scratch, reducing manual effort by ~12× while matching or outperforming expert tests and exploits.
- Paper: AutoBaxBuilder: Bootstrapping Code Security Benchmarking
- Website: baxbench.com/autobaxbuilder
- Dataset: HuggingFace
We recommend using the Docker-based workflow for reproducibility. A native Conda-based setup is also provided for convenience.
Build the Docker image, ensuring user and Docker group IDs are aligned with the host for reproducibility and correct permissions:
docker build \
--build-arg UID=$(id -u) \
--build-arg GID=$(id -g) \
--build-arg DOCKER_GID=$(getent group docker | cut -d: -f3) \
-t autobaxbuilder .Should getent not exist on your system, fall back to --build-arg DOCKER_GID=$(stat -c '%g' /var/run/docker.sock) instead.
Load environment variables in a .env file:
export OPENAI_API_KEY="<your_API_key>"
export TOGETHER_API_KEY="<your_API_key>"
export ANTHROPIC_API_KEY="<your_API_key>"
export OPENROUTER_API_KEY="<your_API_key>"Run an interactive shell inside the container, mounting the current directory and loading the environment variables:
docker run \
--network host \
--env-file /path/to/env \
-it \
--memory="4g" \
--rm \
-v /var/run/docker.sock:/var/run/docker.sock \
-v $(pwd):/app \
autobaxbuilder /bin/bashIf --network host is not available in your system configuration, use --add-host=localhost:host-gateway instead.
The project uses the conda package manager. Please install it from here. If you would like to have a lightweight installation, consider Miniconda.
Then, the environment can be installed by running the following command:
conda env create -n autobaxbench -f env.yaml
Activate the environment:
conda activate autobaxbench
Then, install the module:
pip install -e .
Optional system dependencies (generally not needed, only required by some BaxBench scenarios):
- imagemagick - For image conversion scenarios
- ffmpeg - For video processing scenarios
- poppler-utils - For PDF text extraction (provides
pdftotext) - nodejs - For TypeScript compilation scenarios
- g++ and make - For C++ compilation scenarios
sudo apt install imagemagick ffmpeg poppler-utils nodejs g++ makeOptionally, set up the pre-commit hooks:
pre-commit install
Run AutoBaxBuilder from the repository root with python src/main.py. It runs in 3 modes, specified by the 3 flags:
--generate_scenarios: Generate scenarios--generate_tests: Generate tests (requires--scenarioparameter)--generate_exploits: Generate exploits (requires--scenarioparameter)
--scenario: Scenario name (required when using--generate_testsor--generate_exploits)
--difficulty N: Number of endpoints of the scenario--N_RETRIES N: Number recovery steps in agentic retry loops--N_SOL_STEPS N: Maximum steps for solution iteration--N_TEST_STEPS N: Maximum steps for test iteration--N_SEC_STEPS N: Maximum steps for security iteration--debug: Debug mode (print additional information)--path PATH: Artifact path
# Generate scenarios
python src/main.py --generate_scenarios
# Generate tests for a specific scenario
python src/main.py --generate_tests --scenario FooBarScenario
# Generate exploits for a specific scenario
python src/main.py --generate_exploits --scenario FooBarScenario
# Run in debug mode
python src/main.py --generate_tests --debug
# Save artifacts in a different directory
python src/main.py --generate_scenarios --path /path/to/artifacts/--generate_scenarios produces a new folder in the artifacts directory, corresponding to a novel scenario. --generate_tests takes the scenario name of a scenario produced with --generate_scenarios, on the basis of which it generates functional tests. --generate_exploits also takes a scenario name and builds on top of the previous steps to develop security tests. In each step, artifacts are created. For an exemplary scenario named FooBarScenario, these are structured as follows:
FooBarScenario.json: Initial scenario specification from--generate_scenariosFooBarScenario_iu{t}: Scenario specification (JSON and Py) after t steps of test iterationFooBarScenario_iw{t}: Scenario specification (JSON and Py) after t steps of security iterationFooBarScenario_implementations_it{t}: Solutions after t steps of solution iterationFooBarScenario_implementations_iu{t}: Solutions after t steps of test iterationFooBarScenario_implementations_iw{t}: Solutions after t steps of security iterationFooBarScenario_results_{it/iu/iw}{t}: Results (JSON and iteration matrix as png) of running the tests against the solutions in each intermediate step.FooBarScenario_tasklist.json: Stored solution code paths (implementation detail)token_usage.txtandverdicts.txt: Diagnostic logs
The .py artifacts the pipeline produces can directly be used as novel scenarios in the BaxBench framework. Refer to the instructions in the evaluation section below on how to generate, test and evaluate solutions for a scenario.
The AutoBaxBench scenarios are included with and without CWE-400 in src/scenarios/with_cwe_400 and src/scenarios/without_cwe_400 respectively. When generating your own scenarios, these are stored in the artifacts/ directory by default. The latest artifact is the latest artifacts/scenario_name/*_iw*.py file. To run BaxBench with these scenarios and reproduce our evaluation results, follow the following steps.
git clone git@github.com:logic-star-ai/baxbench.git
cd baxbenchFollow the setup instructions in the BaxBench repository to install dependencies and configure the environment. Alternatively, you can simply use the AutoBaxBuilder conda or docker setup from above, which already includes all required dependencies.
Copy the scenario files from AutoBaxBuilder to the BaxBench scenarios directory, adjusting paths as needed:
# Copy AutoBaxBuilder generated scenario artifact
cp "$(ls -v /path/to/autobaxbuilder/artifacts/scenario_name/*_iw*.py | tail -n 1)" /path/to/baxbench/src/scenarios/
# Copy AutoBaxBench scenario artifact
cp /path/to/autobaxbuilder/src/scenarios/without_cwe_400/scenario_name.py /path/to/baxbench/src/scenarios/Update the src/scenarios/__init__.py file in BaxBench to include the new scenario(s). Add import statements for each scenario you copied and include it in the all_scenarios variable.
Now you can run the full BaxBench evaluation pipeline on the generated scenarios:
# Generate solutions for a scenario
python src/main.py \
--models gpt-4o \
--mode generate \
--scenarios scenario_name
# Test the generated solutions
python src/main.py \
--models gpt-4o \
--mode test \
--scenarios scenario_name
# Evaluate the results
python src/main.py \
--models gpt-4o \
--mode evaluate \
--scenarios scenario_nameRefer to the BaxBench repository for more details on how to generate, test, and evaluate solutions for scenarios.
If you find AutoBaxBuilder to be helpful in your research, please use the following citation
@article{vonarx2025autobaxbuilderbootstrappingcodesecurity,
title={AutoBaxBuilder: Bootstrapping Code Security Benchmarking},
author={Tobias von Arx and Niels Mündler and Mark Vero and Maximilian Baader and Martin Vechev},
year={2025},
eprint={2512.21132},
archivePrefix={arXiv},
}MIT.