Core Paper: ReCode: Unify Plan and Action for Universal Granularity Control (arXiv:2510.23564v2)
- 📄 Core Paper: 2510.23564v2.md - MUST READ!!!
- 📋 Specification: dev-spec/
- 📝 Dev Worklogs: .worklogs/
- 🔬 Paper Demo Code: .dev-docs/ - Python academic prototype
- 📚 Codex CLI Docs: .knowledge/codex-cli/docs/
- 💻 Codex TypeScript SDK: .knowledge/codex-cli/sdk/typescript/
- 🐳 Harbor Environment Guide: .claude/skills/ReCodeAgent-TB2-Evaluate/
Project Status: Production-ready Rust implementation with Harbor integration for Terminal-Bench 2.0 evaluation Updated Date: 2025-11-22
Project Objective: Productionize the ReCode research paradigm into a high-performance Rust Core + Codex CLI integrated recursive code generation system. Now featuring Harbor Container-Unified Architecture for Terminal-Bench 2.0 benchmark evaluation.
ReCodeAgent is a production implementation of the academic paper ReCode: Unify Plan and Action for Universal Granularity Control, using a Hybrid Architecture: Rust Orchestrator + Codex CLI Executor to enable dynamic granularity control from fixed-granularity decision-making to universal programming agents.
- 🔄 Recursive Code Generation: Placeholder functions auto-expand into executable code
- ⚡ High-Performance Rust Core: DFS tree traversal, AST parsing, checkpoint mechanism
- 🔌 Codex CLI Integration: LLM calls via
codex exec --json, authenticated with~/.codex/auth.json - 🐳 Harbor Container-Unified Architecture: Seamless Terminal-Bench 2.0 evaluation
- 🛠️ Tool Ecosystem Integration: File editing, command execution, environment interaction
- 🔒 Production-Grade Reliability: Type safety, memory efficiency, JSONL event streaming
# Navigate to Harbor workspace
cd ~/harbor-workspace
# Run a single task
harbor run -d terminal-bench@2.0 -t regex-log -a recode-agent
# Specify prompt template
harbor run -d terminal-bench@2.0 -t regex-log -a recode-agent \
--agent-kwarg template=recode_tb2_prompt.jinja2
# Limit max steps + debug mode
harbor run -d terminal-bench@2.0 -t password-recovery -a recode-agent \
--agent-kwarg max_steps=50 --debug
# Batch run all tasks (4 concurrent)
harbor run -d terminal-bench@2.0 -a recode-agent -n 4# Quick smoke test (10 steps)
cargo run --example terminal_bench_smoke --release
# CLI subcommand test
cargo run --release --manifest-path recode-core/Cargo.toml -- \
execute --task-name test --instruction "test task" --working-dir /tmp --max-steps 5# Latest task results
ls -lt ~/harbor-workspace/jobs/ | head -3
# Execution logs
cat ~/harbor-workspace/jobs/<job-id>/<task-id>/agent/command-2/stdout.txt | tail -50
# Verification result (0.0 = fail, 1.0 = success)
cat ~/harbor-workspace/jobs/<job-id>/<task-id>/verifier/reward.txt┌─────────────────────────────────────────────────────────────────┐
│ macOS Development Environment (Host) │
├─────────────────────────────────────────────────────────────────┤
│ ReCodeAgent Repository │
│ ~/dev-space/ReCodeAgent/ │
│ ├── recode-core/src/ # Rust source code │
│ └── recode-core/templates/ # Jinja2 Prompt templates │
├─────────────────────────────────────────────────────────────────┤
│ Harbor Installation │
│ ~/.local/share/uv/tools/harbor/.../agents/installed/ │
│ ├── recode_agent.py # Harbor Agent definition │
│ └── recode-assets/ # Deployment assets │
│ ├── recode-agent # Linux x86_64 binary │
│ ├── templates/ # Jinja2 templates │
│ └── scripts/ # Python bridge scripts │
└─────────────────────────────────────────────────────────────────┘
│ Docker Volume
│ ${HOME}/.codex:/tmp/host-codex:ro
▼
┌─────────────────────────────────────────────────────────────────┐
│ Docker Container (Terminal-Bench 2.0) │
├─────────────────────────────────────────────────────────────────┤
│ /app/ # Working directory │
│ ├── recode-agent # ReCodeAgent binary │
│ ├── AGENTS.md # Rendered system prompt │
│ │ # (auto-loaded by Codex) │
│ ├── instruction.md # Task instruction │
│ └── .codex/ # Codex CLI config │
│ ├── auth.json # Auth (from host) │
│ └── config.toml # Model configuration │
└─────────────────────────────────────────────────────────────────┘
harbor run -d terminal-bench@2.0 -t <task> -a recode-agent
│
├──▶ Step 1: Setup Codex auth (copy auth.json, config.toml)
├──▶ Step 2: Render AGENTS.md (recode-agent render-template)
├──▶ Step 3: Execute task (recode-agent execute + codex exec)
│ └── DFS tree traversal + checkpoint self-verification
└──▶ Step 4: Cleanup
enum Command {
/// Legacy: Run with explicit bridge configuration
Run { env_kind, python, bridge, bridge_args, instruction },
/// Render AGENTS.md template (Harbor Step 2)
RenderTemplate { template, output, task_name, instruction_path },
/// Execute task (Harbor Step 3) - Core execution engine
Execute { task_name, instruction, working_dir, max_steps, codex_home },
}recode-core/
├── Cargo.toml
├── Dockerfile # Container image config
│
├── examples/
│ └── terminal_bench_smoke.rs # Quick local test (10 steps)
│
├── harbor-assets/ # Harbor deployment assets
│ ├── recode-agent # Linux x86_64 binary
│ └── templates/ # Deployment templates
│
├── templates/ # Jinja2 Prompt templates (source)
│ ├── recode_tb2_agents_md.jinja2 # Default AGENTS.md template
│ ├── recode_tb2_prompt.jinja2 # TB2 task prompt
│ ├── recode_microtexecute_tb2_prompt.jinja2 # Codex expansion
│ └── recode_tb2_checkpoint_minimal.jinja2 # Checkpoint
│
├── src/
│ ├── main.rs # CLI entry (run, render-template, execute)
│ ├── codex/
│ │ └── thread_manager.rs # Codex CLI integration
│ ├── execution/
│ │ └── python_executor.rs # Python code execution
│ ├── orchestrator/
│ │ ├── engine.rs # Codex prompt assembly
│ │ └── runtime.rs # DFS tree + checkpoint mechanism
│ └── tree/
│ ├── context.rs
│ └── node.rs
│
└── tests/
└── fixtures/
We model LLM-based agent interaction with the environment as a simplified decision process:
Where:
-
$\mathcal{S}$ : State space -
$\mathcal{A}$ : Primitive action space (executable operations likerun('crack egg')) -
$\mathcal{O}$ : Observation space -
$T: \mathcal{S} \times \mathcal{A} \rightarrow \mathcal{S}$ : Transition function -
$R: \mathcal{S} \times \mathcal{A} \rightarrow \mathbb{R}$ : Reward function
Beyond primitive actions, we introduce plan space prepare_breakfast()).
Decision space:
Key Insight: Plans and actions, though seemingly different, can be unified into a single executable code representation.
- Actions (primitive): Executable operations like
run('click the submit button') - Plans (abstract): Unimplemented placeholder functions like
prepare_breakfast(),get_ingredients()
This unified representation enables seamless transitions between planning and execution.
Algorithm 1: The ReCode Algorithm
Procedure ReCode(T, π, E, c):
if c is None: // Initialize
o_0 ← Reset(E) // Reset environment
c ← Text2Code(T, o_0) // Convert task to root placeholder
end if
code_block ← π(c) // LLM generates child code
for each child u in code_block:
if IsPrimitive(u): // Primitive action
Execute(u, E)
else: // Placeholder function
ReCode(T, π, E, u) // Recursive expansion
end if
end for
end procedure
- Task Initialization: Task instruction → root placeholder function
solve(instruction, observation) - Context Management: Unified variable namespace, persisted across recursion levels
- Error Handling: Self-correction loop (max_rewrite=5)
- Recursion Control: Maximum recursion depth 10
- Checkpoint Mechanism: DFS tree completion ≠ task solved, inject checkpoint for agent self-verification
- Rust 1.83+ (
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh) - Docker (for cross-compilation and Harbor)
- Codex CLI (
~/.codex/auth.jsonconfigured) - Harbor Framework (
uv tool install harbor)
# Local macOS build (development)
cargo build --release --manifest-path recode-core/Cargo.toml
# Linux x86_64 build (Harbor container)
docker build --platform linux/amd64 -f Dockerfile.build-x86 -t recode-builder .
docker create --name tmp recode-builder
docker cp tmp:/build/recode-core/target/release/recode-core ./recode-agent-linux-x86_64
docker rm tmp
# Verify binary
file ./recode-agent-linux-x86_64
# Should output: ELF 64-bit LSB pie executable, x86-64...HARBOR_ASSETS=~/.local/share/uv/tools/harbor/lib/python3.13/site-packages/harbor/agents/installed/recode-assets
# Sync binary
cp ./recode-agent-linux-x86_64 $HARBOR_ASSETS/recode-agent
chmod +x $HARBOR_ASSETS/recode-agent
# Sync templates
cp -r recode-core/templates/*.jinja2 $HARBOR_ASSETS/templates/
# Sync scripts
cp scripts/terminal_bench_bridge.py $HARBOR_ASSETS/scripts/cargo test # All tests
cargo test --test codex_turn_tests # Codex integration tests
cargo clippy # Code linting| Template | Purpose | Default |
|---|---|---|
recode_tb2_agents_md.jinja2 |
AGENTS.md system prompt | ✓ |
recode_tb2_prompt.jinja2 |
TB2 task prompt | |
recode_microtexecute_tb2_prompt.jinja2 |
Codex expansion | |
recode_tb2_checkpoint_minimal.jinja2 |
Checkpoint verification |
| Parameter | Type | Default | Description |
|---|---|---|---|
template |
string | recode_tb2_agents_md.jinja2 |
Jinja2 template filename |
max_steps |
int | 99999 | DFS tree maximum steps |
Usage: --agent-kwarg template=xxx --agent-kwarg max_steps=100
- Authentication:
~/.codex/auth.json(copied to container/app/.codex/) - Model config:
~/.codex/config.toml(default: gpt-5.1-codex-max) - AGENTS.md: Auto-discovered and loaded by Codex (95%+ token savings)
- Command:
codex exec --jsonfor JSONL event streaming
| Document | Description |
|---|---|
| WARP.md | Quick reference guide for WARP/Claude Code |
| CLAUDE.md | Claude Code instructions |
| Harbor ENV Guide | Detailed Harbor operation manual |
| Architecture | Technical architecture specification |
| Roadmap | Implementation roadmap |
ReCodeAgent is evaluated on Terminal-Bench 2.0, a benchmark for terminal-based task automation.
# Run evaluation
harbor run -d terminal-bench@2.0 -a recode-agent -n 4
# View results
cat ~/harbor-workspace/jobs/<job-id>/result.json | jq '.stats'- tokio - Async runtime
- clap - CLI argument parsing
- serde / serde_json - Serialization / JSONL parsing
- minijinja - Jinja2 template rendering
- tree-sitter - High-performance AST parsing
- tracing - Structured logging
- Codex CLI - LLM calls and tool execution
- Harbor Framework - Benchmark evaluation platform
- Docker - Container runtime
Apache License 2.0 - See LICENSE file for details.
- ReCode Paper - Yu et al., 2025
- Terminal-Bench 2.0 - Benchmark dataset
- Harbor Framework - Evaluation platform
- OpenAI Codex CLI
- Architecture Design: RECODE_ARCHITECTURE_V0.1.0.md
- Specification: dev-spec/
- Harbor Docs: https://harborframework.com/docs/running-tbench
Last Updated: 2025-11-22 Version: v0.2.0 Status: ✅ Production-ready with Harbor Terminal-Bench 2.0 integration