The full framework lives inside the p2engine/ directory, with setup instructions and detailed documentation in p2engine/README.md.
P2Engine is the artifact of my master’s thesis. You can read it here thesis, where I explore and optimistically define what makes a system autonomous.
I’ve also written an article on P2Engine, more of a dabble into all the research, the ideas, visions, tangents, and everything around P2Engine and the thesis. Read it here →
Showcase • How It Works • Rollouts • Ledger • How It Works (Diagrams) • Future • Contact
Agent Delegation — Agents delegate tasks to sub-agents. Conversation context is preserved across all agent interactions.
Branch Rewind — Rewind conversations to any point and explore alternative paths. Switch between different conversation branches.
Rollouts — Run multiple agent configurations simultaneously, A/B testing. Compare performance metrics (artifacts, reward, token usage, costs, ledger transactions, speed, and more) and visualize results in real-time.
Ledger — Agents have wallets and can transfer funds. All transactions are recorded with complete audit trails.
P2Engine's architecture and internals are inspired by a beautiful and well-thought-out blueprint laid down by Erik. His architecture proposal and taste for systems, what is an agent, tools, prompt templates, reasoning etc. have guided P2Engine to what it is today. A big recommendation: check out his series, here is Part 1, and then continue to Part 2, Part 3, and Part 4.
P2Engine’s orchestration is fundamentally built on finite state machine (FSM) principles, where each agent conversation progresses through well-defined states with explicit transition rules.
It runs LLM-agents in discrete steps. Each step produces artifacts, and a separate and async evaluator scores them. You can think of it like this: the agent thinks, acts, and creates something; the evaluator observes, scores.
Finite State Machine
Each agent conversation moves through well-defined states: UserMessage → AssistantMessage or ToolCall → ToolResult → AssistantMessage, with WaitingState managing async operations. Agent delegation adds AgentCall/AgentResult states, and conversations end with FinishedState. This explicit state management makes it easier to create reliable coordination and to debug the systems.
Interaction Stack
Every agent maintains a stack storing complete interaction history. This enables conversation rewinding, branch creation for alternative paths, and preserved context across agent handoffs.
Effect-Based Execution
Agent decisions generate "effects" (tool calls, payments, delegations) that execute asynchronously via Celery workers.
Artifact Evaluation
Every agent action produces artifacts that evaluators score automatically.
Ledger
Canton/DAML ledger integration provides agent wallets, automated payments, and immutable audit trails. Agents can earn rewards for the quality of work they do, with all transactions recorded in immutable audit trails.
Click for a full video demonstration
To test how P2Engine was working I used a combination of chat and rollouts. Rollouts became the primary way to sort of simulate stuff and see the interactions, debug/inspect the logs, and inspect the transactions happening on the ledger. This means in its current form, it's a bit sided to facilitate more of experimental cases and this is reflected in the current configuration files (e.g., rollout_joke.yml). They are more examples of simulations rather than true A/B tests.
But the infrastructure does supports proper rollouts
The rollout system provides A/B testing capabilities for systematically comparing agent configurations, tool combinations, and parameters. Built in the runtime/rollout/ module with Celery integration, it enables a simple distributed execution and parallel evaluation of multiple variants.
How It Works
Create YAML configuration files that specify teams, base settings, and variants. The system expands these into individual experiments, runs them in parallel using Celery workers, and collects the metrics. Metrics are conversation artifacts, rewards, token usage, costs, ledger transactions, speed.
Real-time Monitoring
We can watch rollouts unfold through the P2Engine CLI with live progress tables, and stream to Rerun viewer for more visual monitoring. Rerun can also be used to replay, rewind and inspect our rollouts. Each variant gets automatic evaluation scores, and the system aggregates results for easy comparison of what works best.
The Foundation for Learning
Rollouts is the key infrastructure to enable P2Engine's future learning loop. Currently, evaluators score agent outputs automatically, so the next step is to close the loop and use these scores to automatically propagate successful configurations to underperforming variants, creating a system that improves itself through experimentation. Something like that, there are many ways here, and I will talk about that in the Future section.
The idea behind the ledger is to introduce monetary incentives, payments, and audits into multi-agent systems. P2Engine does this by integrating Canton/DAML technology, building toward the vision of the Canton Network while currently running on a local development instance. Agents have wallets, can transfer funds, and earn rewards based on performance. All transactions are recorded in immutable audit trails. The ledger is optional when running P2Engine.
Agent Wallets & Transfers
Each agent gets a wallet managed by DAML smart contracts that enforce validation rules. Agents can transfer funds to each other, with automatic overdraft protection and complete transaction history. Currently implemented as a local Canton instance with test tokens for experimentation and development.
Tools
Agents use standard tools like transfer_funds, check_balance, and reward_agent, see ledger_tools.py. The system automatically tracks all financial activity during conversations and rollouts.
Incentive Alignment
Through the evaluators in P2Engine, agents earn rewards/payments for quality work during rollouts. The current implementation includes basic reward mechanisms based on evaluation scores, providing a foundation for more sophisticated economic coordination research.
Audit Trails
Every transaction flows through the artifact bus, creating permanent records. Rollouts capture before/after ledger snapshots, enabling experiments with agent economic behavior and resource allocation strategies.
Why Canton
Canton provides smart contract validation, privacy-preserving operations, and the robust infrastructure needed for financial accountability. P2Engine's current local Canton setup serves as a development environment and research platform, with the architecture designed to potentially connect to the broader Canton Network ecosystem as it evolves.
Let's do How It Works again, but this time with four diagrams to better learn about P2Engine.
The execution process follows the interactions shown in the Sequence Diagram, illustrating how P2Engine operates end-to-end.
- Agent Processing — Consumes conversation states and produces responses or tool invocations.
- Tool Execution — Runs asynchronously, publishing results back into the conversation streams.
- Event Publication — Broadcasts state-change events so they can be logged, traced, or visualized in real time.
- Evaluation Triggering — Runs automated evaluation when a conversation completes or a "configured" step is reached.
- Reward Settlement — Converts evaluation results into payments recorded on the ledger.
All system events flow through a single stream
The unified event stream powers monitoring, debugging, and analysis by capturing every significant system event. It is built around four key principles:
-
Universal Event Streaming — Routes all agent, tool, and system events into a single, consistent stream.
-
Complete Traceability — Captures both metadata and payload for every event, preserving causal links across the system.
-
Real-time Transparency — Powers live monitoring and debugging tools with immediate event updates, viewable in Rerun and through the P2Engine CLI output.
-
Event Logs — Stores a full history of events for replay, inspection, and post-mortem analysis.
P2Engine’s execution is built on finite state machine (FSM) principles, where each agent conversation moves through well-defined states with explicit transition rules. This makes the system deterministic, debuggable, and flexible enough to handle dynamic routing decisions and asynchronous operations.
The FSM design is guided by four key ideas:
- Emergent Coordination — Decides at runtime how many agents to deploy and how to assign tasks, moving beyond rigid, pre-planned workflows.
- Explicit State Transitions — Governs every step with clear transition rules and handlers, ensuring reliable progression while allowing dynamic routing based on agent outputs.
- Conversation Branching — Allows interaction histories to fork at any point, enabling experimental workflows and comparative analyses of different strategies.
- Dynamic Agent Delegation — Lets agents spawn sub-agents and sub-conversations, building a hierarchical task decomposition that adapts to real system needs.
From request initiation to ledger confirmation
The transaction flow ensures that every financial operation follows a consistent pattern with proper validation, execution, and recording. The diagram shows the full lifecycle: a request is validated (agent exists, balance is sufficient, amount is valid), submitted to Canton for smart contract execution and consensus, and finally committed to update balances and create an immutable audit record. Each transaction result, success or failure, is returned with its ID, allowing for traceability and post-hoc inspection. This design supports compliance reporting and reconciliation while keeping the operational path simple and verifiable.
So, what’s next for P2Engine?
A first big win is to close the learning loop. Use the stored trajectories and evaluator rewards to optimize system prompts, routing, tools, and model choices. The initial focus will be "System Prompt Learning" where the goal will be to improve the “program around the model” (prompts, tool availability, temperatures, routing) instead of retraining weights. What I'm describing here in truth is more configuration learning, but yeh I think solely focus on system prompts then we can go into the full "program arountdthe model" learning over time. Either way I do think this type of learning that fits nicely with P2Engine, and is not far from being achievable. We need to first firm up the rollout module, evaluator, and rewards, fix a few internals, then already we can start experimenting with this learning loop.
In the system prompt learning loop, the router/root agent gives several sub-agents the same task with different configs, scores their steps/outputs, and propagates the best strategy (prompt text, tool allowlist, model, temperature) to weaker variants. Updates can be applied per-step, at x steps, or at the end of the trajectory. We treat each system prompt as a configuration to optimize. It learns which of the prompts models to adjust what works which approaches worked etc. So as Karpathy says its about figuring out which approach for an agent is the best, and that is often steered by the system prompt itself not the weights.
So again, treat each system prompt as a living program the root/router edits based on evaluator feedback. During a rollout, the router compares sibling agents (same task, different configs), propagates the best-performing strategies (prompt text, tool choices, temperature), and normalizes around them.
This idea adopts, reflects and builds upon the thinking of:
- A tweet by Karpathy on System Prompt Learning
- Against RL: The Case for System 2 Learning
- Part 4: Improving agent reasoning, and More Thoughts on Agents
and the idea will evolve as I get to work on this.
Once that’s in place, who knows where P2Engine goes next!
adam.sioud@protonmail.com
surya.b.kathayat@ntnu.no
Adam Dybwad Sioud
Supervisor: Surya Kathayat
NTNU – Norwegian University of Science and Technology, Department of Computer Science

