ReCodeAgent - Universal Recursive Code Generation Agent

Core Paper: ReCode: Unify Plan and Action for Universal Granularity Control (arXiv:2510.23564v2)

📄 Core Paper: 2510.23564v2.md - MUST READ!!!
📋 Specification: dev-spec/
📝 Dev Worklogs: .worklogs/
🔬 Paper Demo Code: .dev-docs/ - Python academic prototype
📚 Codex CLI Docs: .knowledge/codex-cli/docs/
💻 Codex TypeScript SDK: .knowledge/codex-cli/sdk/typescript/
🐳 Harbor Environment Guide: .claude/skills/ReCodeAgent-TB2-Evaluate/

Project Status: Production-ready Rust implementation with Harbor integration for Terminal-Bench 2.0 evaluation Updated Date: 2025-11-22

Project Objective: Productionize the ReCode research paradigm into a high-performance Rust Core + Codex CLI integrated recursive code generation system. Now featuring Harbor Container-Unified Architecture for Terminal-Bench 2.0 benchmark evaluation.

📖 Project Overview

ReCodeAgent is a production implementation of the academic paper ReCode: Unify Plan and Action for Universal Granularity Control, using a Hybrid Architecture: Rust Orchestrator + Codex CLI Executor to enable dynamic granularity control from fixed-granularity decision-making to universal programming agents.

Core Value Propositions

🔄 Recursive Code Generation: Placeholder functions auto-expand into executable code
⚡ High-Performance Rust Core: DFS tree traversal, AST parsing, checkpoint mechanism
🔌 Codex CLI Integration: LLM calls via codex exec --json, authenticated with ~/.codex/auth.json
🐳 Harbor Container-Unified Architecture: Seamless Terminal-Bench 2.0 evaluation
🛠️ Tool Ecosystem Integration: File editing, command execution, environment interaction
🔒 Production-Grade Reliability: Type safety, memory efficiency, JSONL event streaming

🚀 Quick Start

Running with Harbor (Recommended)

# Navigate to Harbor workspace
cd ~/harbor-workspace

# Run a single task
harbor run -d terminal-bench@2.0 -t regex-log -a recode-agent

# Specify prompt template
harbor run -d terminal-bench@2.0 -t regex-log -a recode-agent \
  --agent-kwarg template=recode_tb2_prompt.jinja2

# Limit max steps + debug mode
harbor run -d terminal-bench@2.0 -t password-recovery -a recode-agent \
  --agent-kwarg max_steps=50 --debug

# Batch run all tasks (4 concurrent)
harbor run -d terminal-bench@2.0 -a recode-agent -n 4

Local Development Testing

# Quick smoke test (10 steps)
cargo run --example terminal_bench_smoke --release

# CLI subcommand test
cargo run --release --manifest-path recode-core/Cargo.toml -- \
  execute --task-name test --instruction "test task" --working-dir /tmp --max-steps 5

View Results

# Latest task results
ls -lt ~/harbor-workspace/jobs/ | head -3

# Execution logs
cat ~/harbor-workspace/jobs/<job-id>/<task-id>/agent/command-2/stdout.txt | tail -50

# Verification result (0.0 = fail, 1.0 = success)
cat ~/harbor-workspace/jobs/<job-id>/<task-id>/verifier/reward.txt

🏗️ Architecture

Harbor Container-Unified Architecture

┌─────────────────────────────────────────────────────────────────┐
│                   macOS Development Environment (Host)          │
├─────────────────────────────────────────────────────────────────┤
│  ReCodeAgent Repository                                         │
│  ~/dev-space/ReCodeAgent/                                       │
│  ├── recode-core/src/           # Rust source code              │
│  └── recode-core/templates/     # Jinja2 Prompt templates       │
├─────────────────────────────────────────────────────────────────┤
│  Harbor Installation                                            │
│  ~/.local/share/uv/tools/harbor/.../agents/installed/           │
│  ├── recode_agent.py            # Harbor Agent definition       │
│  └── recode-assets/             # Deployment assets             │
│      ├── recode-agent           # Linux x86_64 binary           │
│      ├── templates/             # Jinja2 templates              │
│      └── scripts/               # Python bridge scripts         │
└─────────────────────────────────────────────────────────────────┘
                              │ Docker Volume
                              │ ${HOME}/.codex:/tmp/host-codex:ro
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│             Docker Container (Terminal-Bench 2.0)               │
├─────────────────────────────────────────────────────────────────┤
│  /app/                        # Working directory               │
│  ├── recode-agent             # ReCodeAgent binary              │
│  ├── AGENTS.md                # Rendered system prompt          │
│  │                            # (auto-loaded by Codex)          │
│  ├── instruction.md           # Task instruction                │
│  └── .codex/                  # Codex CLI config                │
│      ├── auth.json            # Auth (from host)                │
│      └── config.toml          # Model configuration             │
└─────────────────────────────────────────────────────────────────┘

Execution Flow

harbor run -d terminal-bench@2.0 -t <task> -a recode-agent
    │
    ├──▶ Step 1: Setup Codex auth (copy auth.json, config.toml)
    ├──▶ Step 2: Render AGENTS.md (recode-agent render-template)
    ├──▶ Step 3: Execute task (recode-agent execute + codex exec)
    │             └── DFS tree traversal + checkpoint self-verification
    └──▶ Step 4: Cleanup

CLI Subcommand Structure

enum Command {
    /// Legacy: Run with explicit bridge configuration
    Run { env_kind, python, bridge, bridge_args, instruction },

    /// Render AGENTS.md template (Harbor Step 2)
    RenderTemplate { template, output, task_name, instruction_path },

    /// Execute task (Harbor Step 3) - Core execution engine
    Execute { task_name, instruction, working_dir, max_steps, codex_home },
}

📁 Project Structure

recode-core/
├── Cargo.toml
├── Dockerfile                 # Container image config
│
├── examples/
│   └── terminal_bench_smoke.rs  # Quick local test (10 steps)
│
├── harbor-assets/             # Harbor deployment assets
│   ├── recode-agent           # Linux x86_64 binary
│   └── templates/             # Deployment templates
│
├── templates/                 # Jinja2 Prompt templates (source)
│   ├── recode_tb2_agents_md.jinja2      # Default AGENTS.md template
│   ├── recode_tb2_prompt.jinja2         # TB2 task prompt
│   ├── recode_microtexecute_tb2_prompt.jinja2  # Codex expansion
│   └── recode_tb2_checkpoint_minimal.jinja2    # Checkpoint
│
├── src/
│   ├── main.rs                # CLI entry (run, render-template, execute)
│   ├── codex/
│   │   └── thread_manager.rs  # Codex CLI integration
│   ├── execution/
│   │   └── python_executor.rs # Python code execution
│   ├── orchestrator/
│   │   ├── engine.rs          # Codex prompt assembly
│   │   └── runtime.rs         # DFS tree + checkpoint mechanism
│   └── tree/
│       ├── context.rs
│       └── node.rs
│
└── tests/
    └── fixtures/

🏗️ ReCode Core Methodology

Decision Process Formulation

We model LLM-based agent interaction with the environment as a simplified decision process:

$$\mathcal{M} = \langle \mathcal{S}, \mathcal{A}, \mathcal{O}, T, R \rangle$$

Where:

$\mathcal{S}$: State space
$\mathcal{A}$: Primitive action space (executable operations like run('crack egg'))
$\mathcal{O}$: Observation space
$T: \mathcal{S} \times \mathcal{A} \rightarrow \mathcal{S}$: Transition function
$R: \mathcal{S} \times \mathcal{A} \rightarrow \mathbb{R}$: Reward function

Beyond primitive actions, we introduce plan space $\mathcal{P}$ containing high-level intentions requiring decomposition (e.g., prepare_breakfast()).

Decision space: $\mathcal{D} = \mathcal{A} \cup \mathcal{P}$

Unified Plan & Action Representation

Key Insight: Plans and actions, though seemingly different, can be unified into a single executable code representation.

Actions (primitive): Executable operations like run('click the submit button')
Plans (abstract): Unimplemented placeholder functions like prepare_breakfast(), get_ingredients()

This unified representation enables seamless transitions between planning and execution.

Recursive Code Generation Algorithm

Algorithm 1: The ReCode Algorithm

Procedure ReCode(T, π, E, c):
    if c is None:                      // Initialize
        o_0 ← Reset(E)                 // Reset environment
        c ← Text2Code(T, o_0)          // Convert task to root placeholder
    end if

    code_block ← π(c)                  // LLM generates child code

    for each child u in code_block:
        if IsPrimitive(u):             // Primitive action
            Execute(u, E)
        else:                          // Placeholder function
            ReCode(T, π, E, u)         // Recursive expansion
        end if
    end for
end procedure

Implementation Details

Task Initialization: Task instruction → root placeholder function solve(instruction, observation)
Context Management: Unified variable namespace, persisted across recursion levels
Error Handling: Self-correction loop (max_rewrite=5)
Recursion Control: Maximum recursion depth 10
Checkpoint Mechanism: DFS tree completion ≠ task solved, inject checkpoint for agent self-verification

🛠️ Development

Prerequisites

Rust 1.83+ (curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh)
Docker (for cross-compilation and Harbor)
Codex CLI (~/.codex/auth.json configured)
Harbor Framework (uv tool install harbor)

Build Commands

# Local macOS build (development)
cargo build --release --manifest-path recode-core/Cargo.toml

# Linux x86_64 build (Harbor container)
docker build --platform linux/amd64 -f Dockerfile.build-x86 -t recode-builder .
docker create --name tmp recode-builder
docker cp tmp:/build/recode-core/target/release/recode-core ./recode-agent-linux-x86_64
docker rm tmp

# Verify binary
file ./recode-agent-linux-x86_64
# Should output: ELF 64-bit LSB pie executable, x86-64...

Sync to Harbor

HARBOR_ASSETS=~/.local/share/uv/tools/harbor/lib/python3.13/site-packages/harbor/agents/installed/recode-assets

# Sync binary
cp ./recode-agent-linux-x86_64 $HARBOR_ASSETS/recode-agent
chmod +x $HARBOR_ASSETS/recode-agent

# Sync templates
cp -r recode-core/templates/*.jinja2 $HARBOR_ASSETS/templates/

# Sync scripts
cp scripts/terminal_bench_bridge.py $HARBOR_ASSETS/scripts/

Testing

cargo test                              # All tests
cargo test --test codex_turn_tests     # Codex integration tests
cargo clippy                            # Code linting

⚙️ Configuration

Prompt Templates

Template	Purpose	Default
`recode_tb2_agents_md.jinja2`	AGENTS.md system prompt	✓
`recode_tb2_prompt.jinja2`	TB2 task prompt
`recode_microtexecute_tb2_prompt.jinja2`	Codex expansion
`recode_tb2_checkpoint_minimal.jinja2`	Checkpoint verification

Harbor Agent Parameters

Parameter	Type	Default	Description
`template`	string	`recode_tb2_agents_md.jinja2`	Jinja2 template filename
`max_steps`	int	99999	DFS tree maximum steps

Usage: --agent-kwarg template=xxx --agent-kwarg max_steps=100

Codex CLI Integration

Authentication: ~/.codex/auth.json (copied to container /app/.codex/)
Model config: ~/.codex/config.toml (default: gpt-5.1-codex-max)
AGENTS.md: Auto-discovered and loaded by Codex (95%+ token savings)
Command: codex exec --json for JSONL event streaming

📚 Documentation

Document	Description
WARP.md	Quick reference guide for WARP/Claude Code
CLAUDE.md	Claude Code instructions
Harbor ENV Guide	Detailed Harbor operation manual
Architecture	Technical architecture specification
Roadmap	Implementation roadmap

📈 Benchmark Results

ReCodeAgent is evaluated on Terminal-Bench 2.0, a benchmark for terminal-based task automation.

# Run evaluation
harbor run -d terminal-bench@2.0 -a recode-agent -n 4

# View results
cat ~/harbor-workspace/jobs/<job-id>/result.json | jq '.stats'

🔬 Technology Stack

Core Dependencies (Rust)

tokio - Async runtime
clap - CLI argument parsing
serde / serde_json - Serialization / JSONL parsing
minijinja - Jinja2 template rendering
tree-sitter - High-performance AST parsing
tracing - Structured logging

External Dependencies

Codex CLI - LLM calls and tool execution
Harbor Framework - Benchmark evaluation platform
Docker - Container runtime

📝 License

Apache License 2.0 - See LICENSE file for details.

🙏 Acknowledgments

ReCode Paper - Yu et al., 2025
Terminal-Bench 2.0 - Benchmark dataset
Harbor Framework - Evaluation platform
OpenAI Codex CLI

Links

Architecture Design: RECODE_ARCHITECTURE_V0.1.0.md
Specification: dev-spec/
Harbor Docs: https://harborframework.com/docs/running-tbench

Last Updated: 2025-11-22 Version: v0.2.0 Status: ✅ Production-ready with Harbor Terminal-Bench 2.0 integration

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.agents/skills		.agents/skills
.artifacts/PROJECT_INIT_20251116/bk		.artifacts/PROJECT_INIT_20251116/bk
.claude		.claude
arXiv-(2510-23564v2)-TexSource		arXiv-(2510-23564v2)-TexSource
dev-spec		dev-spec
figures		figures
recode-core		recode-core
scripts		scripts
.gitignore		.gitignore
2510.23564v2.md		2510.23564v2.md
2510.23564v2.pdf		2510.23564v2.pdf
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
GEMINI.md		GEMINI.md
MONITORING_GUIDE.md		MONITORING_GUIDE.md
README.md		README.md
README_Zh.md		README_Zh.md
RUN_TERMINAL_BENCH.md		RUN_TERMINAL_BENCH.md
WARP.md		WARP.md
verification_results.json		verification_results.json

lwyBZss8924d/ReCodeAgent

Folders and files

Latest commit

History

Repository files navigation