hashintel
diff --git a/‎…gent/docs/E2E-test-results-2024-12-17.md‎ ‎…ent/plans/E2E-test-results-2024-12-17.md‎apps/hash-ai-agent/agent/docs/E2E-test-results-2024-12-17.md renamed to apps/hash-ai-agent/agent/plans/E2E-test-results-2024-12-17.md b/‎…gent/docs/E2E-test-results-2024-12-17.md‎ ‎…ent/plans/E2E-test-results-2024-12-17.md‎apps/hash-ai-agent/agent/docs/E2E-test-results-2024-12-17.md renamed to apps/hash-ai-agent/agent/plans/E2E-test-results-2024-12-17.md
diff --git a/‎…nt/agent/docs/PLAN-task-decomposition.md‎ ‎…t/agent/plans/PLAN-task-decomposition.md‎apps/hash-ai-agent/agent/docs/PLAN-task-decomposition.md renamed to apps/hash-ai-agent/agent/plans/PLAN-task-decomposition.md
Lines changed: 56 additions & 56 deletions b/‎…nt/agent/docs/PLAN-task-decomposition.md‎ ‎…t/agent/plans/PLAN-task-decomposition.md‎apps/hash-ai-agent/agent/docs/PLAN-task-decomposition.md renamed to apps/hash-ai-agent/agent/plans/PLAN-task-decomposition.md
Lines changed: 56 additions & 56 deletions
diff --git a/‎apps/hash-ai-agent/agent/plans/PR-task-decomposition-framework.md‎
Lines changed: 144 additions & 0 deletions b/‎apps/hash-ai-agent/agent/plans/PR-task-decomposition-framework.md‎
Lines changed: 144 additions & 0 deletions
diff --git a/‎…ent/agent/docs/PROMPTS-meta-cognitive.md‎ ‎…nt/agent/plans/PROMPTS-meta-cognitive.md‎apps/hash-ai-agent/agent/docs/PROMPTS-meta-cognitive.md renamed to apps/hash-ai-agent/agent/plans/PROMPTS-meta-cognitive.md b/‎…ent/agent/docs/PROMPTS-meta-cognitive.md‎ ‎…nt/agent/plans/PROMPTS-meta-cognitive.md‎apps/hash-ai-agent/agent/docs/PROMPTS-meta-cognitive.md renamed to apps/hash-ai-agent/agent/plans/PROMPTS-meta-cognitive.md
@@ -115,19 +115,19 @@ This enables bounded dynamism — the loop structure is static, but the step fun
 
 ### 3.1 MVP Step Types (v1)
 
-| Type        | Category  | Description                                      | Parallelizable |
-| ----------- | --------- | ------------------------------------------------ | -------------- |
-| `research`  | Strategic | Gather existing knowledge                        | Yes            |
-| `synthesize`| Strategic | Combine results OR evaluate against criteria     | No             |
-| `experiment`| Tactical  | Test hypothesis empirically                      | Yes            |
-| `develop`   | Tactical  | Build/implement something                        | Sometimes      |
+| Type         | Category  | Description                                  | Parallelizable |
+| ------------ | --------- | -------------------------------------------- | -------------- |
+| `research`   | Strategic | Gather existing knowledge                    | Yes            |
+| `synthesize` | Strategic | Combine results OR evaluate against criteria | No             |
+| `experiment` | Tactical  | Test hypothesis empirically                  | Yes            |
+| `develop`    | Tactical  | Build/implement something                    | Sometimes      |
 
 ### 3.2 Future Step Types
 
-| Type          | Category  | Description                                      | Notes                           |
-| ------------- | --------- | ------------------------------------------------ | ------------------------------- |
-| `hypothesize` | Strategic | Generate testable hypotheses from findings       | Currently implicit in workflow  |
-| `transform`   | Tactical  | Pure data manipulation                           | Can be done via `develop`       |
+| Type          | Category  | Description                                | Notes                          |
+| ------------- | --------- | ------------------------------------------ | ------------------------------ |
+| `hypothesize` | Strategic | Generate testable hypotheses from findings | Currently implicit in workflow |
+| `transform`   | Tactical  | Pure data manipulation                     | Can be done via `develop`      |
 
 ### 3.3 Subsumed Types
 
@@ -663,17 +663,17 @@ apps/hash-ai-agent/src/mastra/
 
 ## 8. Key Design Decisions
 
-| Decision | Choice | Rationale |
-| -------- | ------ | --------- |
-| Schema complexity | MVP-first (4 step types) | LLMs struggle with deeply nested discriminated unions |
-| Execution mode | Interpreted (deferred) | Supports dynamism, but plan quality is primary focus |
-| Stub executors | Low priority | Plan quality scoring is the real signal |
-| Plan evaluation | LLM judge + structural | No single "correct" decomposition |
-| Hypotheses | Explicit top-level + referenced | Makes scientific structure visible |
-| Unknowns | Epistemically rigorous partition | Surfaces uncertainty rather than hiding it |
-| Experiment modes | Exploratory vs confirmatory | Reduces researcher degrees of freedom |
-| Synthesis modes | Integrative vs evaluative | Subsumes "assess" step type |
-| Decision points | Deferred to v2 | Simplifies MVP |
+| Decision          | Choice                           | Rationale                                             |
+| ----------------- | -------------------------------- | ----------------------------------------------------- |
+| Schema complexity | MVP-first (4 step types)         | LLMs struggle with deeply nested discriminated unions |
+| Execution mode    | Interpreted (deferred)           | Supports dynamism, but plan quality is primary focus  |
+| Stub executors    | Low priority                     | Plan quality scoring is the real signal               |
+| Plan evaluation   | LLM judge + structural           | No single "correct" decomposition                     |
+| Hypotheses        | Explicit top-level + referenced  | Makes scientific structure visible                    |
+| Unknowns          | Epistemically rigorous partition | Surfaces uncertainty rather than hiding it            |
+| Experiment modes  | Exploratory vs confirmatory      | Reduces researcher degrees of freedom                 |
+| Synthesis modes   | Integrative vs evaluative        | Subsumes "assess" step type                           |
+| Decision points   | Deferred to v2                   | Simplifies MVP                                        |
 
 ---
 
@@ -698,21 +698,21 @@ interface ExecutionState {
 
 ### 10.1 Deterministic Scorers
 
-| Scorer | Measures |
-| ------ | -------- |
-| `plan-structure` | DAG valid, refs exist, no orphans |
-| `plan-coverage` | All requirements addressed by steps |
-| `experiment-rigor` | Confirmatory experiments have preregistration |
+| Scorer              | Measures                                                        |
+| ------------------- | --------------------------------------------------------------- |
+| `plan-structure`    | DAG valid, refs exist, no orphans                               |
+| `plan-coverage`     | All requirements addressed by steps                             |
+| `experiment-rigor`  | Confirmatory experiments have preregistration                   |
 | `unknowns-coverage` | All three unknown categories populated, community check present |
 
 ### 10.2 LLM Judge Scorers
 
-| Scorer | Measures |
-| ------ | -------- |
-| `plan-testability` | Hypotheses can actually be tested |
-| `plan-granularity` | Steps appropriately scoped (not too broad, not too narrow) |
-| `plan-coherence` | Steps make sense together |
-| `plan-completeness` | No obvious gaps for the goal |
+| Scorer              | Measures                                                   |
+| ------------------- | ---------------------------------------------------------- |
+| `plan-testability`  | Hypotheses can actually be tested                          |
+| `plan-granularity`  | Steps appropriately scoped (not too broad, not too narrow) |
+| `plan-coherence`    | Steps make sense together                                  |
+| `plan-completeness` | No obvious gaps for the goal                               |
 
 ### 10.3 Phase 1 Approach
 
@@ -918,13 +918,13 @@ Validation is deterministic (DAG acyclic, refs exist). Scoring is evaluative (pl
 
 ### 15.1 Input Injection
 
-| Element | Location | Rationale |
-| ------- | -------- | --------- |
-| Goal | User message | Primary input, should be prominent |
-| Context | User message (after goal) | Supplements goal |
-| Available agents | System prompt | Static context the planner reasons about |
-| Constraints | User message (after context) | Goal-specific constraints |
-| Schema description | System prompt | Defines output structure |
+| Element            | Location                     | Rationale                                |
+| ------------------ | ---------------------------- | ---------------------------------------- |
+| Goal               | User message                 | Primary input, should be prominent       |
+| Context            | User message (after goal)    | Supplements goal                         |
+| Available agents   | System prompt                | Static context the planner reasons about |
+| Constraints        | User message (after context) | Goal-specific constraints                |
+| Schema description | System prompt                | Defines output structure                 |
 
 ### 15.2 Agent Presentation
 
@@ -980,26 +980,26 @@ See `docs/PROMPTS-meta-cognitive.md` for research-planning and experiment-design
 
 ### Completed (Phase 1-4)
 
-| Component | File | Status | Tests |
-|-----------|------|--------|-------|
-| PlanSpec schema | `schemas/plan-spec.ts` | ✅ Complete | — |
-| Agent profiles | `constants.ts` | ✅ Complete | — |
-| Plan validator | `tools/plan-validator.ts` | ✅ Complete | 25 tests |
-| Topology analyzer | `tools/topology-analyzer.ts` | ✅ Complete | — |
-| Planner agent | `agents/planner-agent.ts` | ✅ Complete | — |
-| Positive fixtures (4) | `fixtures/decomposition-prompts/` | ✅ Complete | — |
-| Negative fixtures | `tools/plan-validator.test.ts` | ✅ Complete | 25 tests |
-| Deterministic scorers (4) | `scorers/plan-scorers.ts` | ✅ Complete | 23 tests |
-| LLM judge scorers (3) | `scorers/plan-llm-scorers.ts` | ✅ Complete | 6 tests |
+| Component                 | File                              | Status     | Tests    |
+| ------------------------- | --------------------------------- | ---------- | -------- |
+| PlanSpec schema           | `schemas/plan-spec.ts`            | ✅ Complete | —        |
+| Agent profiles            | `constants.ts`                    | ✅ Complete | —        |
+| Plan validator            | `tools/plan-validator.ts`         | ✅ Complete | 25 tests |
+| Topology analyzer         | `tools/topology-analyzer.ts`      | ✅ Complete | —        |
+| Planner agent             | `agents/planner-agent.ts`         | ✅ Complete | —        |
+| Positive fixtures (4)     | `fixtures/decomposition-prompts/` | ✅ Complete | —        |
+| Negative fixtures         | `tools/plan-validator.test.ts`    | ✅ Complete | 25 tests |
+| Deterministic scorers (4) | `scorers/plan-scorers.ts`         | ✅ Complete | 23 tests |
+| LLM judge scorers (3)     | `scorers/plan-llm-scorers.ts`     | ✅ Complete | 6 tests  |
 
 ### Not Yet Implemented
 
-| Component | File | Status | Notes |
-|-----------|------|--------|-------|
-| Planning workflow loop | `workflows/planning-workflow.ts` | ⚠️ Stub only | No `.dountil()` revision logic |
-| Supervisor agent | `agents/supervisor-agent.ts` | ❌ Not started | LLM approval gate |
-| End-to-end tests | `workflows/planning-workflow.test.ts` | ⚠️ Partial | Uses hand-crafted plans, not generated |
-| Stub execution (Level 1) | — | ❌ Not started | Low priority |
+| Component                | File                                  | Status        | Notes                                  |
+| ------------------------ | ------------------------------------- | ------------- | -------------------------------------- |
+| Planning workflow loop   | `workflows/planning-workflow.ts`      | ⚠️ Stub only   | No `.dountil()` revision logic         |
+| Supervisor agent         | `agents/supervisor-agent.ts`          | ❌ Not started | LLM approval gate                      |
+| End-to-end tests         | `workflows/planning-workflow.test.ts` | ⚠️ Partial     | Uses hand-crafted plans, not generated |
+| Stub execution (Level 1) | —                                     | ❌ Not started | Low priority                           |
 
 ---
 
 
@@ -0,0 +1,144 @@
+# PR: Task Decomposition & Planning Framework
+
+## 🌟 What is the purpose of this PR?
+
+Introduces a framework for decomposing complex R&D goals into structured, executable plans using LLM-based planning agents. The core insight is treating LLM planning as a "compiler front-end" that produces an **Intermediate Representation (IR)** — the `PlanSpec` — which can be validated, scored, and eventually compiled into executable workflows.
+
+This PR establishes the foundational infrastructure for plan generation and quality evaluation, with the goal of enabling autonomous research and development workflows.
+
+## 🔗 Related links
+
+- `agent/docs/PLAN-task-decomposition.md` — Full design document and implementation plan
+- `agent/docs/E2E-test-results-2024-12-17.md` — Latest E2E test outputs
+
+## 🚫 Blocked by
+
+_None_
+
+## 🔍 What does this change?
+
+### Core Schema & Types
+
+- **`schemas/plan-spec.ts`** — Full Zod schema for `PlanSpec` with 4 step types:
+  - `research` — Parallelizable information gathering
+  - `synthesize` — Combining findings (integrative) or evaluating results (evaluative)
+  - `experiment` — Testing hypotheses (exploratory or confirmatory with preregistration)
+  - `develop` — Building/implementing artifacts
+
+- **`schemas/planning-fixture.ts`** — Types for test fixtures (`PlanningFixture`, `ExpectedPlanCharacteristics`)
+
+- **`constants.ts`** — 12 agent capability profiles with `canHandle` mappings for executor assignment
+
+### Validation & Analysis
+
+- **`tools/plan-validator.ts`** — 12 structural validation checks:
+  - DAG validity (no cycles, valid references)
+  - Executor compatibility
+  - Preregistration requirements for confirmatory experiments
+  - Input/output consistency
+
+- **`tools/topology-analyzer.ts`** — DAG analysis utilities:
+  - Entry/exit point detection
+  - Critical path calculation
+  - Parallel group identification
+
+### Scoring System
+
+- **`scorers/plan-scorers.ts`** — 4 deterministic scorers (no LLM, fast):
+  - `scorePlanStructure` — DAG validity, parallelism, step type diversity
+  - `scorePlanCoverage` — Requirement/hypothesis coverage
+  - `scoreExperimentRigor` — Preregistration, success criteria
+  - `scoreUnknownsCoverage` — Epistemic completeness
+
+- **`scorers/plan-llm-scorers.ts`** — 3 LLM-based judges:
+  - `goalAlignmentScorer` — Does plan address the goal?
+  - `planGranularityScorer` — Are steps appropriately sized?
+  - `hypothesisTestabilityScorer` — Are hypotheses testable?
+
+### Planning Agent
+
+- **`agents/planner-agent.ts`** — `generatePlan(goal, context)` function that uses structured output to produce valid `PlanSpec` instances
+
+### Test Fixtures
+
+4 fixtures of increasing complexity in `fixtures/decomposition-prompts/`:
+
+| Fixture | Complexity | Step Types |
+|---------|------------|------------|
+| `summarize-papers` | Simple linear | research → synthesize |
+| `explore-and-recommend` | Parallel research | research (parallel) → synthesize (evaluative) |
+| `hypothesis-validation` | With experiments | research → experiment → synthesize |
+| `ct-database-goal` | Full R&D cycle | All 4 types, hypotheses, experiments |
+
+### E2E Test Suite
+
+- **`workflows/planning-workflow.test.ts`** — Comprehensive E2E tests:
+  - Runs all 4 fixtures through the full pipeline
+  - Validates generated plans
+  - Runs deterministic scorers
+  - Optional LLM scorers via `RUN_LLM_SCORERS=true`
+  - Generates summary report with score table
+
+## Pre-Merge Checklist 🚀
+
+### 🚢 Has this modified a publishable library?
+
+This PR:
+
+- [x] does not modify any publishable blocks or libraries, or modifications do not need publishing
+
+### 📜 Does this require a change to the docs?
+
+The changes in this PR:
+
+- [x] are internal and do not require a docs change
+
+### 🕸️ Does this require a change to the Turbo Graph?
+
+The changes in this PR:
+
+- [x] do not affect the execution graph
+
+## ⚠️ Known issues
+
+1. **ct-database-goal fixture fails validation** — The LLM occasionally generates confirmatory experiments without `preregisteredCommitments`. This is a known prompt engineering issue that will be addressed in the revision workflow.
+
+2. **explore-and-recommend generates unexpected content** — The LLM adds hypotheses and experiments not specified in the fixture expectations. This is valid behavior (more thorough than minimum), but indicates fixture expectations may need adjustment.
+
+## 🐾 Next steps
+
+Per `PLAN-task-decomposition.md` Section 18:
+
+1. **Revision workflow loop** — Implement `dountil` loop: generate → validate → feedback → regenerate (max 3 attempts)
+2. **Supervisor agent** — LLM approval gate before plan finalization
+3. **Prompt improvements** — Strengthen preregisteredCommitments requirement
+4. **Stub execution** — Low priority, deferred
+
+## 🛡 What tests cover this?
+
+- `plan-validator.test.ts` — 25 negative fixture tests for validation
+- `plan-scorers.test.ts` — 23 unit tests for deterministic scorers
+- `plan-llm-scorers.test.ts` — 6 tests for LLM judges
+- `fixtures.test.ts` — 4 fixture validation tests
+- `planning-workflow.test.ts` — E2E pipeline tests (3/4 passing)
+
+## ❓ How to test this?
+
+1. Checkout the branch
+2. `cd apps/hash-ai-agent`
+3. Run unit tests: `npx vitest run src/mastra/scorers/plan-scorers.test.ts`
+4. Run E2E tests: `npx vitest run src/mastra/workflows/planning-workflow.test.ts`
+5. (Optional) Run with LLM scorers: `RUN_LLM_SCORERS=true npx vitest run src/mastra/workflows/planning-workflow.test.ts`
+
+## 📹 Demo
+
+See `agent/docs/E2E-test-results-2024-12-17.md` for full test output, including:
+
+```
+Deterministic Scores:
+  Fixture                     | Overall | Structure | Coverage | Rigor | Unknowns
+  -------------------------------------------------------------------------------------
+  summarize-papers             |     93% |       77% |     100% |  100% |      93%
+  explore-and-recommend        |     92% |       86% |      93% |   93% |     100%
+  hypothesis-validation        |     95% |       86% |     100% |   95% |     100%
+```