This repository implements three different pipelines for automated code generation using Large Language Models (LLMs), developed for Project A2 (Architectures for Code Development with LLMs).
The goal is to compare different architectural approaches:
- Naive Baseline: One-shot generation.
- Single-Agent: Multi-step reasoning (LangGraph).
- Multi-Agent: Collaborative multi-role system.
A simple, direct approach where the LLM is given the task and asked to generate the solution in a single pass. This serves as the lower bound for performance comparison.
flowchart LR
Start([Task Description]) --> LLM[LLM Generation]
LLM --> End([Final Code])
A structured reasoning pipeline where a single agent identity performs multiple discrete steps. We use LangGraph to orchestrate this flow, forcing the model to think before coding and review its own work.
Stages:
- Analysis: Understand the requirements.
- Planning: Decompose the problem.
- Generation: Write the code based on the plan.
- Review: Execute the code, check quality metrics and review the code.
- Refinement: Iteratively fixes the code based on the review feedback.
flowchart LR
Start([Task Description]) --> Analysis
Analysis --> Planning
Planning --> Generation
Generation --> Review
Review --> Refinement
Refinement -->|Needs Work| Refinement
Refinement -->|Complete| End([Final Code])
A collaborative architecture with specialized agents, each having a distinct role and identity.
Roles:
- Planner Agent: Analyzes the request and creates a detailed implementation plan.
- Coder Agent: Writes code following the plan and feedback.
- Critic Agent: Reviews the code, runs tests, and provides feedback to the Coder.
flowchart LR
Start([Task Description]) --> Planner
Planner[Planner Agent] -->|Implementation Plan| Coder[Coder Agent]
Coder -->|Code| Critic[Critic Agent]
Critic -->|Feedback| Coder
Critic -->|Approved| End([Final Code])
scripts/
├── run_naive_agent.py # Script to run naive baseline
├── run_single_agent.py # Script to run single agent pipeline
├── run_multi_agent.py # Script to run multi-agent pipeline
└── batch_test.py # Script for batch testing
data/
├── test-tasks.json # Sample tasks for evaluation
└── ... # Task categories (math, strings, etc.)
src/
├── core/
│ ├── llm.py # LLM interfaces and configuration
│ ├── state.py # Global state definitions
│ ├── naive_baseline/
│ │ └── naive_agent.py # Naive one-shot implementation
│ │
│ ├── single_agent/
│ │ ├── agent.py # Step implementations and prompts
│ │ └── pipeline.py # LangGraph workflow definition
│ │
│ └── multi_agent/
│ └── agents/
│ ├── planner/ # Planner agent logic
│ ├── coder/ # Coder agent logic
│ └── critic/ # Critic agent logic
│
├── tools/
│ └── executor.py # Code execution sandbox
│
├── utils/
│ ├── code_parser.py # Utilities for parsing code
│ ├── config.py # Configuration management
│ ├── task_loader.py # Loading tasks from data files
│ └── test_runner.py # Running tests
│
└── evaluation/
└── quality_metrics.py # Static analysis metrics
We use local, quantized models via Ollama to keep the project cheap, reproducible, and easy to run.
-
Qwen2.5-Coder-7B-Instruct
Primary single-agent baseline (best quality / cost tradeoff) -
DeepSeek-Coder-V2-Lite (6.7B)
Small, fast comparison model -
CodeLlama-7B-Instruct
Classic baseline for reference
All models run locally through Ollama and can be swapped via configuration without changing the pipeline.
-
Start Ollama
ollama serve
-
Pull a model
ollama pull qwen2.5-coder:7b-instruct. -
Install dependencies
pip install -r requirements.txt -
Run the pipelines
Naive Baseline
python -m scripts.run_naive_agent
Single-Agent Pipeline
python -m scripts.run_single_agentMulti-Agent Pipeline
python -m scripts.run_multi_agent
| Flag | Description | Default |
|---|---|---|
--task-file |
Path to the input tasks JSON file | data/test-tasks.json |
--task-id |
Run a specific task by its ID | (Runs all) |
--model |
Specify the LLM model to use | qwen2.5-coder:7b-instruct |
--test-file |
Path to external tests file | None |
--verbose |
Show detailed execution logs (Multi-agent only) | False |
Sample Task (data/test-tasks.json)
[
{
"id": "smoke_test",
"signature": "def add(a, b):",
"docstring": "Return the sum of a and b.",
"examples": [
{
"input": "2, 3",
"output": "5"
},
{
"input": "-1, 1",
"output": "0"
}
],
"difficulty": "Easy"
}
]
Expected Output (example)
def add(a, b):
return a + b
- Riccardo Marconi (riccardo.marconi@studenti.polito.it),
- Dorotea Monaco (dorotea.monaco@studenti.polito.it),
- Luigi Gonnella (luigi.gonnella@studenti.polito.it),
- Kevser Gunaydin (kevser.gunaydin@studenti.polito.it),
- Francesco Mina (francesco.mina@studenti.polito.it)