Skip to content

Autonomous Epic implementation system with test-gated verification. Stop trusting AI agents. Start verifying them.

Notifications You must be signed in to change notification settings

AojdevStudio/rbp-stack

Repository files navigation

RBP Stack - Stop trusting. Start verifying.

RBP Stack

Stop trusting AI agents. Start verifying them.

License: MIT PRs Welcome

The first autonomous Epic implementation system that prevents AI agents from lying about task completion.


View Demo · Quick Start · How It Works · Documentation




The Problem Everyone Ignores

You give an AI agent an Epic. It returns "done" with all checkboxes marked complete.

Then you look at the code.

  • Tests were never run
  • The UI doesn't render
  • Half the subtasks were skipped
  • There's no audit trail

Sound familiar?

You trusted the agent. The agent lied.


"We spent 3 months building an AI-powered development workflow. 76 stories later, we discovered a painful truth: agents mark tasks 'complete' without doing the work. Checkboxes are just booleans. There's no proof."




The Insight That Changed Everything

After months of frustration, we discovered something simple:


Agents can lie to checkboxes.

They cannot lie to tests.


A checkbox is self-reported. A test is objective verification.

If bun test fails, the lie is exposed. Period.

So we built a system around one unbreakable rule:


No task closes without proof.




Introducing the RBP Stack

Ralph + Beads + PAI

A verification-first autonomous development system.


Component Role
Ralph Autonomous execution loop that never stops until done
Beads Git-backed task graph — the single source of truth
Tests The gatekeeper that agents cannot bypass

Workflow A (BMAD):
Epic  →  BMAD Story  →  Beads  →  Ralph Loop  →  Verified Code

Workflow B (Quick-Plan):
Feature Idea  →  /quick-plan  →  Spec  →  Codex Review  →  Beads  →  Ralph Loop  →  Verified Code

Both workflows use the same gatekeeper:
                              close-with-proof.sh
                                       ↓
                              Tests pass? → Close task
                              Tests fail? → Keep trying

RBP Workflow

From requirements to verified code. No human intervention required.




See It In Action

📺 Demo: Watch Ralph implement a feature autonomously
# 1. Convert your story to beads
./scripts/rbp/parse-story-to-beads.sh docs/stories/story-001.md

# 2. Launch Ralph
./scripts/rbp/ralph.sh

# 3. Watch the magic happen
# Ralph queries Beads → Implements task → Runs tests → Only closes if tests pass
# Repeats until all tasks complete

GIF coming soon — star the repo to get notified!




Defense in Depth

We don't trust agents. We verify them at every layer.


Verification System

Layer Mechanism What It Prevents
1 Objective Acceptance Criteria Vague "it works" claims
2 Protocol Mandate Skipping verification steps
3 Failure State Injection "I don't remember what went wrong"
4 Test Gating (bun test) Claims without passing tests
5 Playwright Verification UI lies ("looks correct")
6 Human Code Review Subtle implementation issues
7 Beads Audit Trail Retroactive tampering

An agent cannot game this system. Either the tests pass or they don't.




Quick Start

Prerequisites

# Beads - Git-backed task tracker (one-time global install, pick one)
brew install steveyegge/beads/bd                # Homebrew (recommended)
# or: curl -fsSL https://raw.githubusercontent.com/steveyegge/beads/main/scripts/install.sh | bash
# or: npm install -g @beads/bd
# or: go install github.com/steveyegge/beads/cmd/bd@latest

# Bun - JavaScript runtime (one-time global install)
curl -fsSL https://bun.sh/install | bash

# Claude Code CLI (one-time global install)
# https://claude.ai/download

# PAI Observability (optional, for real-time monitoring dashboard)
# https://github.com/danielmiessler/Personal_AI_Infrastructure.git

Install

# Clone the repository
git clone https://github.com/AojdevStudio/rbp-stack.git

# Install into your project
./rbp/install.sh /path/to/your/project

# Validate installation
/path/to/your/project/scripts/rbp/validate.sh

Run (Two Workflows)

Workflow A: BMAD Stories (structured story-driven)

# Create a story with BMAD
/bmad:bmm:workflows:create-story

# Convert to beads
./scripts/rbp/parse-story-to-beads.sh docs/stories/story-001.md

# Launch autonomous execution
./scripts/rbp/ralph.sh

Workflow B: Quick-Plan Specs (interview-driven)

# Create a spec through codebase analysis + interview
/quick-plan "add user authentication with JWT"

# Execute with optional Codex pre-flight review
./scripts/rbp/ralph-execute.sh specs/add-user-authentication.md

# Or skip the Codex review
./scripts/rbp/ralph-execute.sh specs/add-user-authentication.md --skip-review

Monitor Progress

bd status        # Task status
bd list --open   # Open tasks
bd tree          # Task hierarchy



Ralph CLI Reference

Ralph is the autonomous execution engine for RBP. It's written in TypeScript and runs on Bun.

Global Options

Available on all commands:

ralph --config <path>        # Custom config file path
ralph --verbose              # Increase output verbosity (debug level)
ralph --quiet                # Decrease output verbosity (warn level)
ralph --json-errors          # Output errors as JSON (default: true)
ralph --no-json-errors       # Output errors as human-readable text

Error Format: By default, errors are output as JSON for programmatic processing. Use --no-json-errors to get human-readable text output. The --json-errors and --no-json-errors flags are mutually exclusive.

Commands

run (default command)

ralph run                              # Run the execution loop
ralph run --bmad                       # Use BMAD workflow explicitly
ralph run --beads                      # Use Beads workflow explicitly
ralph run --max-iterations <n>         # Max iterations (positive integer >= 1)
ralph run --dry-run                    # Dry run mode (no changes)

Validation Rules:

  • --max-iterations must be a positive integer >= 1 (prevents NaN)
  • --bmad and --beads flags cannot be used together
  • The CLI auto-detects workflow if not specified

status

ralph status                           # Show current execution state

close

ralph close <id>                       # Close a task with test verification
ralph close <id> --force               # Force close without tests (-f)
ralph close <id> --dry-run             # Dry run mode

exec-spec

ralph exec-spec <file>                 # Execute a spec file
ralph exec-spec <file> --skip-review   # Skip Codex review
ralph exec-spec <file> --max-iterations <n>  # Max iterations
ralph exec-spec <file> --dry-run       # Dry run mode



How It Works


Architecture

The Core Loop

while tasks_remain:
    task = bd ready           # Query Beads for next unblocked task
    implement(task)           # Agent implements the task
    close-with-proof.sh       # THE GATEKEEPER
        ├── bun test          # Unit tests must pass
        ├── playwright test   # UI tests must pass (if UI task)
        └── bd close          # Only now can the task close

The Gatekeeper Script

#!/usr/bin/env bash
# close-with-proof.sh - The agent cannot bypass this

# Run tests
bun run test || exit 1

# Run Playwright for UI tasks (auto-detected)
if [[ "$TASK_TYPE" == "ui" ]]; then
    bunx playwright test || exit 1
fi

# Only close if all tests pass
bd close "$BEAD_ID"
echo "✅ Task verified and closed"

This is script-level enforcement. The agent has no way around it.




Failure State Injection

When a task fails its test verification, Ralph automatically injects the failure context into the next attempt:

Task Iteration 1:
  ├── Run tests
  ├── Tests fail → Append failure notes to bead
  └── Ralph continues to next task

Task Iteration 2 (when task becomes ready again):
  ├── Read previous failure notes from bead
  ├── Inject "Previous Attempt Failed" section into prompt
  ├── Agent sees exactly what went wrong
  ├── Agent fixes the issues
  ├── Run tests again
  └── If pass → Close with proof

This prevents the agent from making the same mistake twice.




Atomic Subtasks

When a task contains subtasks, the parser creates them as separate child beads with explicit dependencies:

Task: "Create admin dashboard"
├── Subtask 1.1: Build layout structure (no dependencies)
│   └── Bead ID: bd-123.1.1
├── Subtask 1.2: Add sidebar (depends on 1.1)
│   └── Bead ID: bd-123.1.2
├── Subtask 1.3: Implement navigation (depends on 1.2)
│   └── Bead ID: bd-123.1.3
└── Task depends on final subtask (1.3)

Benefits:

  • Clear sequencing: Each subtask has explicit dependencies
  • Granular tracking: Each subtask is independently verifiable
  • Failure recovery: If subtask 2 fails, only that subtask retries (not 1.1)
  • Optimal context: Ralph executes one subtask per iteration



Quick-Plan Workflow

Don't have BMAD? Use the Quick-Plan workflow instead.

How It Works

/quick-plan "feature description"
         ↓
    Codebase Analysis (scans your project)
         ↓
    Interview (asks clarifying questions until ZERO gaps remain)
         ↓
    specs/feature-name.md (with mandatory Testing Strategy + Implementation Tasks)
         ↓
./ralph-execute.sh specs/feature-name.md
         ↓
    [Optional] Codex Pre-Flight Review (GPT-5-Codex analyzes spec)
         ↓
    Parse Spec → Beads (creates task graph with dependencies)
         ↓
    Ralph Loop (bd ready → implement → test → close, repeat)
         ↓
    Verified Code

The Spec Format

Quick-plan generates specs with two mandatory RBP sections:

## Testing Strategy

### Test Framework
bun test (detected from package.json)

### Test Command
`bun test`

### Unit Tests
- [ ] Test: User model validation → File: `tests/user.test.ts`
- [ ] Test: JWT token generation → File: `tests/auth.test.ts`

## Implementation Tasks

<!-- RBP-TASKS-START -->
### Task 1: Create user model
- **ID:** task-001
- **Dependencies:** none
- **Files:** `src/models/user.ts`
- **Acceptance:** User model with email, password hash, timestamps
- **Tests:** `tests/user.test.ts`
- **Subtasks:**
  - [ ] Define TypeScript interfaces
  - [ ] Implement validation logic
  - [ ] Add timestamp fields

### Task 2: Add JWT authentication [UI]
- **ID:** task-002
- **Dependencies:** task-001
- **Files:** `src/auth/jwt.ts`, `src/components/LoginForm.tsx`
- **Acceptance:** Login returns valid JWT, stored in httpOnly cookie
- **Tests:** `tests/auth.test.ts`
<!-- RBP-TASKS-END -->

Codex Pre-Flight Review

Before executing, ralph-execute.sh optionally runs GPT-5-Codex to review the spec:

# With Codex review (default)
./scripts/rbp/ralph-execute.sh specs/feature.md

# Skip review
./scripts/rbp/ralph-execute.sh specs/feature.md --skip-review

Codex checks for:

  • Missing edge cases
  • Wrong technical approaches
  • Missing task dependencies
  • Incomplete testing strategy
  • Security concerns

UI Auto-Detection

Tasks tagged with [UI] or containing UI keywords automatically get the requires-playwright flag. The gatekeeper runs Playwright tests for these tasks.




Key Decisions

Why Beads as Source of Truth?

The agent queries bd ready instead of reading JSON files.

  • No stale state — Beads is always current
  • No sync issues — Single source of truth
  • Git-backed — Full audit trail

Why No Story Atomization?

We analyzed 76 real BMAD stories:

Metric Value
Average story size 3,914 tokens
Largest story 12,962 tokens
Context budget used 12.9% of 100k

All stories fit in a single context window. For larger stories, our Execution Sequencer groups subtasks into phases of 3-5.

Why Test-Gating at Script Level?

Agents can be told "run tests before closing." They can ignore the instruction.

Scripts cannot be ignored. close-with-proof.sh runs the tests. Either they pass or the task stays open.




What's Included

rbp/
├── scripts/
│   ├── ralph.sh              # Main execution loop
│   ├── ralph-execute.sh      # Quick-plan execution (with Codex review)
│   ├── close-with-proof.sh   # Test-gated closure (THE GATEKEEPER)
│   ├── emit-event.sh         # PAI Observability event emitter
│   ├── parse-story-to-beads.sh  # BMAD Story → Beads conversion
│   ├── parse-spec-to-beads.sh   # Quick-plan Spec → Beads conversion (with atomic subtasks)
│   ├── sequencer.sh          # Phase grouping for large stories
│   ├── show-active-task.sh   # Display current task
│   └── save-progress-to-beads.sh  # Sync progress to bead notes
├── commands/rbp/
│   ├── start.md              # /rbp:start command (with dashboard auto-launch)
│   ├── status.md             # /rbp:status command
│   └── validate.md           # /rbp:validate command
├── lib/src/
│   ├── cli.ts                # TypeScript CLI entry point (Commander.js)
│   ├── commands/             # CLI command implementations
│   ├── workflows/            # BMAD and Beads workflow handlers
│   ├── config/               # Configuration loading and validation
│   └── utils/                # Shared utilities and error handling
├── templates/
│   ├── rbp-config.yaml         # Base configuration
│   ├── rbp-config.example.yaml # Documented config with comments
│   └── spec-template.md        # Quick-plan spec format template
├── install.sh                # One-line installation
├── validate.sh               # Installation checker
└── README.md                 # Package documentation

Key features of included scripts:

  • ralph.sh: Failure state injection, completion signal detection
  • close-with-proof.sh: Failure note appending, multi-layer verification
  • parse-spec-to-beads.sh: Atomic subtask creation with dependency chaining
  • cli.ts: TypeScript CLI with validation rules for arguments and options



Tech Stack

  • Execution: Claude Code CLI
  • CLI Engine: TypeScript + Commander.js (bun runtime)
  • State: Beads (git-backed) — query bd ready, never mirror to JSON
  • Testing: bun test + Playwright
  • Scripts: Bash
  • Runtime: bun



Configuration

# rbp-config.yaml
project:
  name: "your-project"

paths:
  stories: "docs/stories"      # BMAD stories
  specs: "specs"               # Quick-plan specs

execution:
  max_iterations: 10
  phase_size: 5

verification:
  require_tests: true
  require_playwright_for_ui: true
  test_command: "bun run test"

quick_plan:
  command: "/quick-plan"
  spec_template: "templates/spec-template.md"

codex:
  enabled: true                # Set false if Codex not installed
  model: "gpt-5-codex"
  reasoning_effort: "high"
  skip_by_default: false       # Set true to skip review by default

observability:
  enabled: true                # Emit events to PAI dashboard
  auto_launch: true            # Auto-start dashboard with /rbp:start



Observability

RBP integrates with PAI (Personal AI Infrastructure) for real-time observability of task execution.

What You Get

Feature Description
Real-time Dashboard Watch task progress in your browser
Event Stream See RBP:TaskStart, RBP:TestRun, RBP:TestResult events live
Debug Visibility Trace through test failures and errors
Multi-Session Support Run multiple RBP sessions with distinct session IDs

Setup

# 1. Install PAI (if not already installed)
git clone https://github.com/danielmiessler/Personal_AI_Infrastructure.git ~/PAI
cd ~/PAI && ./install.sh

# 2. RBP auto-detects PAI and emits events automatically
# Events are written to: ~/.claude/history/raw-outputs/YYYY-MM/YYYY-MM-DD_all-events.jsonl

# 3. Launch dashboard with /rbp:start or manually:
~/.claude/observability/manage.sh start
# Dashboard: http://localhost:5172

Event Types

Event Emitted When
RBP:LoopStart Ralph begins execution
RBP:TaskStart A task is picked from bd ready
RBP:TaskProgress Task status changes (executing, iteration_complete)
RBP:TaskComplete Task closed with proof
RBP:TestRun Tests are about to run
RBP:TestResult Tests complete (includes exit code, output)
RBP:Error An error occurred
RBP:CodexReview Codex pre-flight review starts/completes
RBP:SpecParsed Spec parsed to Beads
RBP:LoopEnd Ralph loop completes

Without PAI

RBP works without PAI — observability events are simply not emitted. You can still monitor progress via:

# File-based logs
tail -f scripts/rbp/progress.txt

# Beads activity
bd activity --follow

# Task status
bd status



The Story Behind RBP

I've been using the BMAD Method for a while now. It's probably the best tool I've found for building software projects with AI — structured stories, clear acceptance criteria, the whole workflow. I'm also an avid Claude Code user. These tools changed how I build.

But something was missing.

Every time I kicked off a BMAD story, I'd watch the AI work... then it would stop. Ask a question. Wait for me. I'd answer, it would continue... then stop again. The constant back-and-forth was killing my productivity. I wanted to give it an Epic and walk away. Come back to working code.

I wanted long-running autonomous processes.

Then I discovered Ralph — Geoffrey Huntley's pattern for relentless AI execution loops. And Beads — Steve Yegge's git-backed task graph. Something clicked.

What if I could combine BMAD's structured stories with Ralph's autonomous loops and Beads' persistent memory?

I started building. 76 stories later, I had a working system. But I also discovered something uncomfortable: AI agents lie. They mark tasks "complete" without running tests. They check boxes without doing the work.

The realization hit me: Checkboxes are self-reported. Tests are objective.

An agent can flip a boolean. It cannot fake a passing test.

So I added test-gated closure. No task closes without proof. The script runs the tests — either they pass or the task stays open. The agent has no say in the matter.

Then I realized: when a task fails, the agent needs to see what went wrong. So I added failure state injection. The previous attempt's notes are automatically injected into the retry prompt. Now agents can learn from their mistakes without human guidance.

Finally, I made subtasks atomic. Each subtask is a separate bead with explicit dependencies, not just checklist items. This lets Ralph execute them sequentially with test verification after each one.

The RBP Stack is the result.

What started as a productivity hack became a verification-first autonomous development system. BMAD creates the stories. Beads tracks the state. Ralph drives the execution. Tests guard the gates. Failure notes teach the next attempt.

Now I give it an Epic and walk away. Come back to verified, working code.


I wanted to stop babysitting AI. This is how I did it.




Roadmap

  • Core execution loop (Ralph)
  • Test-gated closure
  • Story → Beads conversion (BMAD workflow)
  • Spec → Beads conversion (Quick-Plan workflow)
  • Codex pre-flight review integration
  • UI auto-detection (Playwright)
  • Execution sequencer for large stories
  • Real-time progress dashboard (PAI Observability integration)
  • Failure state injection (previous attempt context)
  • Atomic subtask creation with dependencies
  • Parallel task execution
  • Integration with more test frameworks



Contributing

Contributions welcome! Please ensure:

  1. All scripts have tests
  2. Documentation is updated
  3. The verification system is never bypassed

See CONTRIBUTING.md for guidelines.




Acknowledgments

  • Beads — Git-backed issue tracking by Steve Yegge
  • BMAD — Structured story creation framework
  • Claude Code — Execution environment
  • Ralph Pattern — The original autonomous loop concept by Geoffrey Huntley



License

MIT License — see LICENSE for details.




Built with frustration. Verified with tests.


If this helped you, ⭐ star the repo — it helps others find it.


Star History Chart

About

Autonomous Epic implementation system with test-gated verification. Stop trusting AI agents. Start verifying them.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •