Skip to content

Commit 4d73567

Browse files
committed
formatting and organization in agent/plans and agent/wiki directories
1 parent 2c409ba commit 4d73567

File tree

7 files changed

+455
-58
lines changed

7 files changed

+455
-58
lines changed

apps/hash-ai-agent/agent/docs/E2E-test-results-2024-12-17.md renamed to apps/hash-ai-agent/agent/plans/E2E-test-results-2024-12-17.md

File renamed without changes.

apps/hash-ai-agent/agent/docs/PLAN-task-decomposition.md renamed to apps/hash-ai-agent/agent/plans/PLAN-task-decomposition.md

Lines changed: 56 additions & 56 deletions
Original file line numberDiff line numberDiff line change
@@ -115,19 +115,19 @@ This enables bounded dynamism — the loop structure is static, but the step fun
115115

116116
### 3.1 MVP Step Types (v1)
117117

118-
| Type | Category | Description | Parallelizable |
119-
| ----------- | --------- | ------------------------------------------------ | -------------- |
120-
| `research` | Strategic | Gather existing knowledge | Yes |
121-
| `synthesize`| Strategic | Combine results OR evaluate against criteria | No |
122-
| `experiment`| Tactical | Test hypothesis empirically | Yes |
123-
| `develop` | Tactical | Build/implement something | Sometimes |
118+
| Type | Category | Description | Parallelizable |
119+
| ------------ | --------- | -------------------------------------------- | -------------- |
120+
| `research` | Strategic | Gather existing knowledge | Yes |
121+
| `synthesize` | Strategic | Combine results OR evaluate against criteria | No |
122+
| `experiment` | Tactical | Test hypothesis empirically | Yes |
123+
| `develop` | Tactical | Build/implement something | Sometimes |
124124

125125
### 3.2 Future Step Types
126126

127-
| Type | Category | Description | Notes |
128-
| ------------- | --------- | ------------------------------------------------ | ------------------------------- |
129-
| `hypothesize` | Strategic | Generate testable hypotheses from findings | Currently implicit in workflow |
130-
| `transform` | Tactical | Pure data manipulation | Can be done via `develop` |
127+
| Type | Category | Description | Notes |
128+
| ------------- | --------- | ------------------------------------------ | ------------------------------ |
129+
| `hypothesize` | Strategic | Generate testable hypotheses from findings | Currently implicit in workflow |
130+
| `transform` | Tactical | Pure data manipulation | Can be done via `develop` |
131131

132132
### 3.3 Subsumed Types
133133

@@ -663,17 +663,17 @@ apps/hash-ai-agent/src/mastra/
663663

664664
## 8. Key Design Decisions
665665

666-
| Decision | Choice | Rationale |
667-
| -------- | ------ | --------- |
668-
| Schema complexity | MVP-first (4 step types) | LLMs struggle with deeply nested discriminated unions |
669-
| Execution mode | Interpreted (deferred) | Supports dynamism, but plan quality is primary focus |
670-
| Stub executors | Low priority | Plan quality scoring is the real signal |
671-
| Plan evaluation | LLM judge + structural | No single "correct" decomposition |
672-
| Hypotheses | Explicit top-level + referenced | Makes scientific structure visible |
673-
| Unknowns | Epistemically rigorous partition | Surfaces uncertainty rather than hiding it |
674-
| Experiment modes | Exploratory vs confirmatory | Reduces researcher degrees of freedom |
675-
| Synthesis modes | Integrative vs evaluative | Subsumes "assess" step type |
676-
| Decision points | Deferred to v2 | Simplifies MVP |
666+
| Decision | Choice | Rationale |
667+
| ----------------- | -------------------------------- | ----------------------------------------------------- |
668+
| Schema complexity | MVP-first (4 step types) | LLMs struggle with deeply nested discriminated unions |
669+
| Execution mode | Interpreted (deferred) | Supports dynamism, but plan quality is primary focus |
670+
| Stub executors | Low priority | Plan quality scoring is the real signal |
671+
| Plan evaluation | LLM judge + structural | No single "correct" decomposition |
672+
| Hypotheses | Explicit top-level + referenced | Makes scientific structure visible |
673+
| Unknowns | Epistemically rigorous partition | Surfaces uncertainty rather than hiding it |
674+
| Experiment modes | Exploratory vs confirmatory | Reduces researcher degrees of freedom |
675+
| Synthesis modes | Integrative vs evaluative | Subsumes "assess" step type |
676+
| Decision points | Deferred to v2 | Simplifies MVP |
677677

678678
---
679679

@@ -698,21 +698,21 @@ interface ExecutionState {
698698

699699
### 10.1 Deterministic Scorers
700700

701-
| Scorer | Measures |
702-
| ------ | -------- |
703-
| `plan-structure` | DAG valid, refs exist, no orphans |
704-
| `plan-coverage` | All requirements addressed by steps |
705-
| `experiment-rigor` | Confirmatory experiments have preregistration |
701+
| Scorer | Measures |
702+
| ------------------- | --------------------------------------------------------------- |
703+
| `plan-structure` | DAG valid, refs exist, no orphans |
704+
| `plan-coverage` | All requirements addressed by steps |
705+
| `experiment-rigor` | Confirmatory experiments have preregistration |
706706
| `unknowns-coverage` | All three unknown categories populated, community check present |
707707

708708
### 10.2 LLM Judge Scorers
709709

710-
| Scorer | Measures |
711-
| ------ | -------- |
712-
| `plan-testability` | Hypotheses can actually be tested |
713-
| `plan-granularity` | Steps appropriately scoped (not too broad, not too narrow) |
714-
| `plan-coherence` | Steps make sense together |
715-
| `plan-completeness` | No obvious gaps for the goal |
710+
| Scorer | Measures |
711+
| ------------------- | ---------------------------------------------------------- |
712+
| `plan-testability` | Hypotheses can actually be tested |
713+
| `plan-granularity` | Steps appropriately scoped (not too broad, not too narrow) |
714+
| `plan-coherence` | Steps make sense together |
715+
| `plan-completeness` | No obvious gaps for the goal |
716716

717717
### 10.3 Phase 1 Approach
718718

@@ -918,13 +918,13 @@ Validation is deterministic (DAG acyclic, refs exist). Scoring is evaluative (pl
918918

919919
### 15.1 Input Injection
920920

921-
| Element | Location | Rationale |
922-
| ------- | -------- | --------- |
923-
| Goal | User message | Primary input, should be prominent |
924-
| Context | User message (after goal) | Supplements goal |
925-
| Available agents | System prompt | Static context the planner reasons about |
926-
| Constraints | User message (after context) | Goal-specific constraints |
927-
| Schema description | System prompt | Defines output structure |
921+
| Element | Location | Rationale |
922+
| ------------------ | ---------------------------- | ---------------------------------------- |
923+
| Goal | User message | Primary input, should be prominent |
924+
| Context | User message (after goal) | Supplements goal |
925+
| Available agents | System prompt | Static context the planner reasons about |
926+
| Constraints | User message (after context) | Goal-specific constraints |
927+
| Schema description | System prompt | Defines output structure |
928928

929929
### 15.2 Agent Presentation
930930

@@ -980,26 +980,26 @@ See `docs/PROMPTS-meta-cognitive.md` for research-planning and experiment-design
980980

981981
### Completed (Phase 1-4)
982982

983-
| Component | File | Status | Tests |
984-
|-----------|------|--------|-------|
985-
| PlanSpec schema | `schemas/plan-spec.ts` | ✅ Complete ||
986-
| Agent profiles | `constants.ts` | ✅ Complete ||
987-
| Plan validator | `tools/plan-validator.ts` | ✅ Complete | 25 tests |
988-
| Topology analyzer | `tools/topology-analyzer.ts` | ✅ Complete ||
989-
| Planner agent | `agents/planner-agent.ts` | ✅ Complete ||
990-
| Positive fixtures (4) | `fixtures/decomposition-prompts/` | ✅ Complete ||
991-
| Negative fixtures | `tools/plan-validator.test.ts` | ✅ Complete | 25 tests |
992-
| Deterministic scorers (4) | `scorers/plan-scorers.ts` | ✅ Complete | 23 tests |
993-
| LLM judge scorers (3) | `scorers/plan-llm-scorers.ts` | ✅ Complete | 6 tests |
983+
| Component | File | Status | Tests |
984+
| ------------------------- | --------------------------------- | ---------- | -------- |
985+
| PlanSpec schema | `schemas/plan-spec.ts` | ✅ Complete | |
986+
| Agent profiles | `constants.ts` | ✅ Complete | |
987+
| Plan validator | `tools/plan-validator.ts` | ✅ Complete | 25 tests |
988+
| Topology analyzer | `tools/topology-analyzer.ts` | ✅ Complete | |
989+
| Planner agent | `agents/planner-agent.ts` | ✅ Complete | |
990+
| Positive fixtures (4) | `fixtures/decomposition-prompts/` | ✅ Complete | |
991+
| Negative fixtures | `tools/plan-validator.test.ts` | ✅ Complete | 25 tests |
992+
| Deterministic scorers (4) | `scorers/plan-scorers.ts` | ✅ Complete | 23 tests |
993+
| LLM judge scorers (3) | `scorers/plan-llm-scorers.ts` | ✅ Complete | 6 tests |
994994

995995
### Not Yet Implemented
996996

997-
| Component | File | Status | Notes |
998-
|-----------|------|--------|-------|
999-
| Planning workflow loop | `workflows/planning-workflow.ts` | ⚠️ Stub only | No `.dountil()` revision logic |
1000-
| Supervisor agent | `agents/supervisor-agent.ts` | ❌ Not started | LLM approval gate |
1001-
| End-to-end tests | `workflows/planning-workflow.test.ts` | ⚠️ Partial | Uses hand-crafted plans, not generated |
1002-
| Stub execution (Level 1) || ❌ Not started | Low priority |
997+
| Component | File | Status | Notes |
998+
| ------------------------ | ------------------------------------- | ------------- | -------------------------------------- |
999+
| Planning workflow loop | `workflows/planning-workflow.ts` | ⚠️ Stub only | No `.dountil()` revision logic |
1000+
| Supervisor agent | `agents/supervisor-agent.ts` | ❌ Not started | LLM approval gate |
1001+
| End-to-end tests | `workflows/planning-workflow.test.ts` | ⚠️ Partial | Uses hand-crafted plans, not generated |
1002+
| Stub execution (Level 1) | | ❌ Not started | Low priority |
10031003

10041004
---
10051005

Lines changed: 144 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,144 @@
1+
# PR: Task Decomposition & Planning Framework
2+
3+
## 🌟 What is the purpose of this PR?
4+
5+
Introduces a framework for decomposing complex R&D goals into structured, executable plans using LLM-based planning agents. The core insight is treating LLM planning as a "compiler front-end" that produces an **Intermediate Representation (IR)** — the `PlanSpec` — which can be validated, scored, and eventually compiled into executable workflows.
6+
7+
This PR establishes the foundational infrastructure for plan generation and quality evaluation, with the goal of enabling autonomous research and development workflows.
8+
9+
## 🔗 Related links
10+
11+
- `agent/docs/PLAN-task-decomposition.md` — Full design document and implementation plan
12+
- `agent/docs/E2E-test-results-2024-12-17.md` — Latest E2E test outputs
13+
14+
## 🚫 Blocked by
15+
16+
_None_
17+
18+
## 🔍 What does this change?
19+
20+
### Core Schema & Types
21+
22+
- **`schemas/plan-spec.ts`** — Full Zod schema for `PlanSpec` with 4 step types:
23+
- `research` — Parallelizable information gathering
24+
- `synthesize` — Combining findings (integrative) or evaluating results (evaluative)
25+
- `experiment` — Testing hypotheses (exploratory or confirmatory with preregistration)
26+
- `develop` — Building/implementing artifacts
27+
28+
- **`schemas/planning-fixture.ts`** — Types for test fixtures (`PlanningFixture`, `ExpectedPlanCharacteristics`)
29+
30+
- **`constants.ts`** — 12 agent capability profiles with `canHandle` mappings for executor assignment
31+
32+
### Validation & Analysis
33+
34+
- **`tools/plan-validator.ts`** — 12 structural validation checks:
35+
- DAG validity (no cycles, valid references)
36+
- Executor compatibility
37+
- Preregistration requirements for confirmatory experiments
38+
- Input/output consistency
39+
40+
- **`tools/topology-analyzer.ts`** — DAG analysis utilities:
41+
- Entry/exit point detection
42+
- Critical path calculation
43+
- Parallel group identification
44+
45+
### Scoring System
46+
47+
- **`scorers/plan-scorers.ts`** — 4 deterministic scorers (no LLM, fast):
48+
- `scorePlanStructure` — DAG validity, parallelism, step type diversity
49+
- `scorePlanCoverage` — Requirement/hypothesis coverage
50+
- `scoreExperimentRigor` — Preregistration, success criteria
51+
- `scoreUnknownsCoverage` — Epistemic completeness
52+
53+
- **`scorers/plan-llm-scorers.ts`** — 3 LLM-based judges:
54+
- `goalAlignmentScorer` — Does plan address the goal?
55+
- `planGranularityScorer` — Are steps appropriately sized?
56+
- `hypothesisTestabilityScorer` — Are hypotheses testable?
57+
58+
### Planning Agent
59+
60+
- **`agents/planner-agent.ts`**`generatePlan(goal, context)` function that uses structured output to produce valid `PlanSpec` instances
61+
62+
### Test Fixtures
63+
64+
4 fixtures of increasing complexity in `fixtures/decomposition-prompts/`:
65+
66+
| Fixture | Complexity | Step Types |
67+
|---------|------------|------------|
68+
| `summarize-papers` | Simple linear | research → synthesize |
69+
| `explore-and-recommend` | Parallel research | research (parallel) → synthesize (evaluative) |
70+
| `hypothesis-validation` | With experiments | research → experiment → synthesize |
71+
| `ct-database-goal` | Full R&D cycle | All 4 types, hypotheses, experiments |
72+
73+
### E2E Test Suite
74+
75+
- **`workflows/planning-workflow.test.ts`** — Comprehensive E2E tests:
76+
- Runs all 4 fixtures through the full pipeline
77+
- Validates generated plans
78+
- Runs deterministic scorers
79+
- Optional LLM scorers via `RUN_LLM_SCORERS=true`
80+
- Generates summary report with score table
81+
82+
## Pre-Merge Checklist 🚀
83+
84+
### 🚢 Has this modified a publishable library?
85+
86+
This PR:
87+
88+
- [x] does not modify any publishable blocks or libraries, or modifications do not need publishing
89+
90+
### 📜 Does this require a change to the docs?
91+
92+
The changes in this PR:
93+
94+
- [x] are internal and do not require a docs change
95+
96+
### 🕸️ Does this require a change to the Turbo Graph?
97+
98+
The changes in this PR:
99+
100+
- [x] do not affect the execution graph
101+
102+
## ⚠️ Known issues
103+
104+
1. **ct-database-goal fixture fails validation** — The LLM occasionally generates confirmatory experiments without `preregisteredCommitments`. This is a known prompt engineering issue that will be addressed in the revision workflow.
105+
106+
2. **explore-and-recommend generates unexpected content** — The LLM adds hypotheses and experiments not specified in the fixture expectations. This is valid behavior (more thorough than minimum), but indicates fixture expectations may need adjustment.
107+
108+
## 🐾 Next steps
109+
110+
Per `PLAN-task-decomposition.md` Section 18:
111+
112+
1. **Revision workflow loop** — Implement `dountil` loop: generate → validate → feedback → regenerate (max 3 attempts)
113+
2. **Supervisor agent** — LLM approval gate before plan finalization
114+
3. **Prompt improvements** — Strengthen preregisteredCommitments requirement
115+
4. **Stub execution** — Low priority, deferred
116+
117+
## 🛡 What tests cover this?
118+
119+
- `plan-validator.test.ts` — 25 negative fixture tests for validation
120+
- `plan-scorers.test.ts` — 23 unit tests for deterministic scorers
121+
- `plan-llm-scorers.test.ts` — 6 tests for LLM judges
122+
- `fixtures.test.ts` — 4 fixture validation tests
123+
- `planning-workflow.test.ts` — E2E pipeline tests (3/4 passing)
124+
125+
## ❓ How to test this?
126+
127+
1. Checkout the branch
128+
2. `cd apps/hash-ai-agent`
129+
3. Run unit tests: `npx vitest run src/mastra/scorers/plan-scorers.test.ts`
130+
4. Run E2E tests: `npx vitest run src/mastra/workflows/planning-workflow.test.ts`
131+
5. (Optional) Run with LLM scorers: `RUN_LLM_SCORERS=true npx vitest run src/mastra/workflows/planning-workflow.test.ts`
132+
133+
## 📹 Demo
134+
135+
See `agent/docs/E2E-test-results-2024-12-17.md` for full test output, including:
136+
137+
```
138+
Deterministic Scores:
139+
Fixture | Overall | Structure | Coverage | Rigor | Unknowns
140+
-------------------------------------------------------------------------------------
141+
summarize-papers | 93% | 77% | 100% | 100% | 93%
142+
explore-and-recommend | 92% | 86% | 93% | 93% | 100%
143+
hypothesis-validation | 95% | 86% | 100% | 95% | 100%
144+
```

apps/hash-ai-agent/agent/docs/PROMPTS-meta-cognitive.md renamed to apps/hash-ai-agent/agent/plans/PROMPTS-meta-cognitive.md

File renamed without changes.

0 commit comments

Comments
 (0)