Gyeongwon James Kim, Alex Wilf, Louis-Philippe Morency, Daniel Fried (Carnegie Mellon University)
First, clone this repository and install dependencies in requirements.txt.
Access the dataset via Huggingface or run
cd dataset
python process_dataset.py
This should create mlrc_n_{1,2,3,4,5}.jsonl files in dataset/MLRC.
For each repository, it is recommended to create conda environments from the provided yml files. Refer to conda-docs on how to manage conda environments.
Note: Each sample in the dataset is identified with the "combined_id" ({paper_id}_{function_id}). For example, 2205.00048_0 represents a sample where code repository from paper 2205.00048 is used and function 0 is masked out. Also, 2205.00048_0,1 represents a sample with functions 0 and 1 masked out.
Basic example of running the agent. This runs agent with ReAct prompting, Full history management, and GPT-4o as model backbone on a datapoint with paper 2205.00048 and function 0 removed.
bash run_exp_from_env.sh \
--combined_id 2205.00048_0 \
--agent ReAct \
--memory Full \
--model-engine gpt-4o
Optional arguments
--max-agent-steps 50: maximum number of action-taking steps allowed--compute-budget 1.0: compute budget allowed in dollars--max-compute-time 1800: maximum time (in seconds) allowed for the agent--retrieval full: options are "no", "full", "embedding", and "oracle"--code-retrieval full: options are "no", "full", "ast", and "embedding"
Running the agent produces log files in agents/logs directory and its working directory is in agents/workspace.
These are instructions on how to reproduce the experiments in the paper.
Preparing the dataset will generate files mlrc_n_[1,2,3,4,5].jsonl. Each sample can be run individually with its combined_id. A way to run a batch of samples is described in section Batched run jobs using wandb sweeps.
After running
cd agents
python log_parser.py --log-dir path/to/log/directory
This will generate a csv file in the log directory containing the generated Python code for each run. Then, run
python verifier.py --log-dir path/to/log/directory --model-engine model_name
This will run the model verifier with the specified LLM engine and output a file containing verifier selections.
We reproduce the "agentless" harness. This can be run by setting --agent Agentless and disregarding any parameter related to agent architecture. The maximum number of reasoning tokens can be controlled with --max-reasoning-tokens paramter.
Example
bash run_exp_from_env.sh \
--combined-id 2205.00048_0 \
--agent Agentless \
--model-engine o3-mini \
--max-reasoning-tokens 8192
No vs Full settings are set using --retrieval [full, no] parameter. By default "full" setting is used. In the "Full" setting, agents are given contents of the research paper (paper.txt) as part of the repository. In the "No" setting, this is not given.
Example
bash run_exp_from_env.sh \
--combined-id 2205.00048_0 \
--agent ReAct \
--memory SlidingWindow \
--model-engine gpt-4o \
--retrieval no
Any combination of agent architecture and model backbone can be tested.
--agent: one of [ReAct, Planning, MLAgentBench]. These are prompting techniques described in the paper.--memory: one of [Full, SlidingWindow, Summary]. Note, SlidingWindow and Summary history management strategies take in an additional parameter--lookback kas described in the paper.--model-engine: any LLM can be used via litellm library. The ones used in the paper are [gpt-4o, gpt-4o-mini, anthropic/claude-3-5-sonnet-20240620, anthropic/claude-3-7-sonnet-20250219, o1, o3-mini].
Examples
bash run_exp_from_env.sh \
--combined-id 2205.00048_0 \
--agent ReAct \
--memory Full
--model-engine gpt-4o
bash run_exp_from_env.sh \
--combined-id 2205.00048_0 \
--agent Planning \
--memory SlidingWindow \
--lookback 1 \
--model-engine gpt-4o
bash run_exp_from_env.sh \
--combined-id 2205.00048_0 \
--agent MLAgentBench \
--memory Summary \
--lookback 5 \
--model-engine anthropic/claude-3-5-sonnet-20240620
We make use of wandb sweeps to organize batched runs of multiple samples. Use the template yaml files in deploy directory and run
python deploy_sweeps.py example.yml
This creates a wandb sweep. More information about wandb sweeps here.
To ensure safety, we recommend running agents in a Docker container. The image we used is provided in utils/Dockerfile. Learn more about how to use Docker here.