A TRM-inspired (Test-time Recursive Memory) MCP server for recursive code refinement with LLM development tools.
This server implements a recursive improvement cycle where:
- The LLM (Claude Code, Cursor, Codex CLI) acts as the optimizer proposing code changes
- This MCP server acts as the critic/evaluator with stateful tracking
- Evaluations include: build, test, lint, and benchmark
- Scores are tracked using EMA (Exponential Moving Average)
- A halting policy (simplified ACT - Adaptive Computation Time) determines when to stop
- Multi-signal evaluation: Build, test, lint, and performance benchmarking
- Weighted scoring: Configurable weights for different evaluation signals
- EMA tracking: Smooth score tracking across iterations
- Intelligent halting: Stop when tests pass + score threshold, no improvement, or max steps
- Flexible candidate submission: Support for multiple modes (files, patch, diff, modify, create)
- Safe execution: Commands run in isolated directories with configurable timeouts
- Actionable feedback: Compact, LLM-friendly error messages with TypeScript parsing and correlation
- Advanced features: Quick undo, incremental file reading, AI-powered fix suggestions
npm install
npm run build- Open VS Code settings
- Navigate to "Model Context Protocol"
- Add a new MCP Server:
- Command:
node /absolute/path/to/code_trm_mcp/dist/server.js - Args: (leave empty)
- Command:
- Open Settings → MCP / "Custom MCP Servers"
- Add server:
- Command:
node /absolute/path/to/code_trm_mcp/dist/server.js
- Command:
{
"command": "node",
"args": ["/absolute/path/to/code_trm_mcp/dist/server.js"]
}Initialize a TRM session on a local repository.
Parameters:
repoPath(required): Absolute path to projectbuildCmd,testCmd,lintCmd,benchCmd: Evaluation commandstimeoutSec: Timeout per command (default: 120)weights: Score weights (build: 0.3, test: 0.5, lint: 0.1, perf: 0.1)halt: Halting policy (maxSteps, passThreshold, patienceNoImprove, minSteps)emaAlpha: EMA smoothing factor (default: 0.9)zNotes: Optional initial reasoning notespreflight: Run validation checks (default: false)
Returns: sessionId, message, optional preflight results
Apply candidate changes, run evaluation, return feedback.
Parameters:
sessionId(required)candidate(required): One of these modes:- files: Complete file contents
- patch: Unified diff format
- diff: Per-file diffs
- modify: Semantic edit operations
- create: New files only
rationale: LLM reasoning notes
Returns: step, score, emaScore, bestScore, tests, okBuild, okLint, shouldHalt, reasons, feedback, modeSuggestion
Key Features:
- Error correlation showing which iteration caused errors
- Intelligent mode suggestions based on change patterns
- TypeScript error parsing with actionable suggestions
Read current file state with metadata.
Parameters:
sessionId,paths(required)offset,limit: Optional line range
Returns: File contents with metadata (lineCount, sizeBytes, lastModified)
Return current session state snapshot.
Returns: sessionId, step, emaScore, bestScore, noImproveStreak, last, zNotes
Check halting decision.
Returns: shouldHalt, reasons
Clean up session.
Returns: ok
Dry-run validation with detailed preview before applying changes.
Parameters: sessionId, candidate
Returns: valid, errors, warnings, preview (filesAffected, linesAdded/Removed/Modified, before/after previews)
Benefits:
- Catch errors before submission (invalid line numbers, duplicates)
- See exactly what will change with before/after context
- Significantly reduces failed iterations
Get AI-powered improvement suggestions based on evaluation results and code analysis.
Returns: Top 5 suggestions sorted by priority (critical → high → medium → low)
Save/restore session state for snapshot-based workflows.
Reset repository to initial git commit state.
Quick undo with full state restoration.
Returns: message, currentStep, score, emaScore, filesRestored
How it works:
- Captures file contents before applying each candidate
- On undo: restores files, rolls back step counter, recalculates scores/EMA/streak
- No git commands needed - uses internal snapshots
Example:
// Submit fails badly (score drops from 0.85 to 0.25)
await trm.submitCandidate({ sessionId: "...", candidate: {...} });
// Immediately undo - back to previous state
await trm.undoLastCandidate({ sessionId: "..." });
// Session restored to previous step with score 0.85 ✅Read specific line range from a file with line numbers.
Parameters: sessionId, file, startLine, endLine
Returns: Lines with formatted line numbers, total lineCount
Benefits:
- 10-15% token savings on large files
- Line numbers included for easy reference
- Perfect for targeted fixes around error locations
Example:
// Error at line 50 - read context (lines 45-56)
const context = await trm.getFileLines({
sessionId: "...",
file: "src/parser.ts",
startLine: 45,
endLine: 56
});
// Returns: ["45: export function...", "46: try {", ...]AI-powered fix candidate generation based on error analysis.
Supported errors: TS2304 (missing imports), TS7006 (implicit any), TS2339 (void property access)
Returns: Array of suggestions with priority, issue, candidateToFix, rationale
Example:
// Iteration fails with TypeScript errors
const result = await trm.submitCandidate({ /* ... */ });
// Get AI-generated fixes
const fixes = await trm.suggestFix({ sessionId: "..." });
// Apply suggested fix (or validate first)
await trm.submitCandidate({
sessionId: "...",
candidate: fixes.suggestions[0].candidateToFix,
rationale: fixes.suggestions[0].rationale
});Perform detailed code review on pull requests from GitHub URLs or direct diffs.
Parameters:
prUrl: GitHub PR URL (e.g.,https://github.com/owner/repo/pull/123)diff: Direct unified diff contentfiles: Array of files withpath,content, and optionaloriginalContentfocus: Optional array to filter review categories
Focus categories:
type-safety: Detect usage ofanytypelogging: Flag console statementstodos: Identify TODO/FIXME commentscode-quality: Magic numbers, long linesformatting: Line length validation (>120 chars)error-handling: Missing try-catch in async functionstesting: Suggest adding testssize: Flag large changesets
Returns:
{
summary: {
filesChanged: number,
linesAdded: number,
linesRemoved: number,
commentsCount: number,
criticalCount: number,
warningCount: number,
infoCount: number,
assessment: "approved" | "needs-changes" | "comments",
highlights: string[]
},
comments: [{
file: string,
line: number,
severity: "error" | "warning" | "info",
category: string,
message: string,
suggestion?: string
}],
issues: string[],
suggestions: string[],
prInfo?: { title?: string, url?: string }
}Example:
// Review from GitHub URL
const review = await trm.reviewPR({
prUrl: "https://github.com/owner/repo/pull/123",
focus: ["type-safety", "error-handling", "code-quality"]
});
console.log(`Assessment: ${review.summary.assessment}`);
console.log(`Found ${review.comments.length} comments`);
// Review from direct diff
const review2 = await trm.reviewPR({
diff: "diff --git a/file.ts...",
focus: ["logging", "todos"]
});const session = await trm.startSession({
repoPath: "/absolute/path/to/project",
buildCmd: "tsc -p . --noEmit",
testCmd: "npm test --silent -- --reporter=json",
preflight: true, // Validate setup before iterating
halt: { maxSteps: 12, passThreshold: 0.97, patienceNoImprove: 3 }
});
if (!session.preflight.initialBuild.success) {
console.log("Fix build before iterating");
return;
}Key principles:
- Keep patches small and focused (one issue at a time)
- Maximize delta information per step (TRM philosophy)
- Use
rationaleto maintain context across steps - Trust the score/feedback signals for guidance
Pattern:
- Get file metadata to avoid line number errors
- Validate changes before submitting
- Submit candidate with rationale
- If fails: use
suggestFixorundoLastCandidate - Repeat until
shouldHalt=true
// 1. Get file metadata
const { files } = await trm.getFileContent({
sessionId: session.sessionId,
paths: ["src/parser.ts"]
});
const lineCount = files["src/parser.ts"].metadata.lineCount;
// 2. Validate before submitting
const validation = await trm.validateCandidate({
sessionId: session.sessionId,
candidate: {
mode: "modify",
changes: [{
file: "src/parser.ts",
edits: [{ type: "insertAfter", line: lineCount, content: "..." }]
}]
}
});
if (!validation.valid) {
console.log("Fix errors:", validation.errors);
return;
}
// 3. Submit
const result = await trm.submitCandidate({
sessionId: session.sessionId,
candidate: validation.preview.candidate,
rationale: "Adding error handling"
});
// 4. Handle failures
if (!result.okBuild) {
// Try AI-generated fixes
const fixes = await trm.suggestFix({ sessionId: session.sessionId });
if (fixes.suggestions.length > 0) {
await trm.submitCandidate({
sessionId: session.sessionId,
candidate: fixes.suggestions[0].candidateToFix,
rationale: `Auto-fix: ${fixes.suggestions[0].rationale}`
});
} else {
// Or undo and try different approach
await trm.undoLastCandidate({ sessionId: session.sessionId });
}
}
// 5. For targeted fixes, read just relevant lines
if (result.feedback.includes("line 145")) {
const context = await trm.getFileLines({
sessionId: session.sessionId,
file: "src/parser.ts",
startLine: 135,
endLine: 155
});
// Use context with line numbers for precise fix
}Recommended (new):
create: New files only (validates file doesn't exist)modify: Semantic edit operations (replace, insertBefore, insertAfter, replaceLine, deleteRange, etc.)
Example modify mode:
{
mode: "modify",
changes: [{
file: "src/server.ts",
edits: [
{ type: "replace", oldText: "err: any", newText: "err: unknown", all: true },
{ type: "insertAfter", line: 150, content: "const NEW_CONSTANT = 42;" }
]
}]
}Legacy (still supported):
diff: Per-file unified diffs (uses custom fuzzy-matching patcher)patch: Single unified diff for multiple filesfiles: Complete file contents (for rewrites)
| Feature | Time Savings | Token Savings | Use Case |
|---|---|---|---|
| Quick Undo | 5-10% | - | Instant recovery from failed iterations |
| Incremental File Reading | 10-15% | 30-50% | Large files, focused edits |
| Auto-Suggest Fixes | 15-20% | - | TypeScript errors, common patterns |
| Pre-Apply Validation | 20-30% | - | Catch errors before submission |
| Error Correlation | 10-15% | - | Faster debugging with context |
| Combined Benefits | Up to 40% | 30-50% | Overall efficiency improvement |
Real-world impact:
- Significantly faster iteration sessions on error-heavy workloads
- Reduced token usage when working with large files
- Fewer wasted iterations due to validation and error correlation
The MCP tool schemas have been optimized to minimize token usage while preserving full functionality:
Optimization results:
- 4% reduction in total MCP token usage (384 tokens saved)
- 16 tools optimized from 9,618 tokens to 9,234 tokens
- No functionality loss - all parameters, types, and features unchanged
What was optimized:
- Concise tool descriptions without verbose explanations
- Removed redundant property descriptions
- Eliminated inline examples from schemas
- Streamlined text while maintaining clarity
Impact:
- Frees up 384 tokens in your context window
- Equivalent to ~100 additional lines of code context
- MCP protocol overhead (~7,200 tokens) remains the main bottleneck
Implementation:
Ultra-optimized schemas are active. For analysis and bigger token savings strategies, see TOKEN_OPTIMIZATION.md and token-comparison.md.
Score is a weighted average in [0, 1]:
score = (w.build * sBuild + w.test * sTests + w.lint * sLint + w.perf * sPerf) / sumWeights
where:
sBuild = 1 if build succeeds, 0 otherwise
sTests = passed / total (0 if tests fail to parse)
sLint = 1 if lint succeeds, 0 otherwise
sPerf = normalized performance score (best/current, lower is better)
Iteration stops when:
- Success:
step >= minStepsAND all tests pass ANDscore >= passThreshold - Plateau: No improvement for
patienceNoImproveconsecutive steps - Limit: Reached
maxSteps
- y (current solution): The repo state after each patch applied by the LLM
- z (latent reasoning):
rationaleandzNotesmaintain context of how/why we reached current state - Deep supervision: Each
submitCandidateis a refinement step; score/EMA provide objective feedback - ACT halting:
shouldHaltuses clear rules (tests pass + threshold, patience exhausted, maxSteps) - Small patches: Maximize information per step (TRM principle)
- No training needed: Pure test-time refinement using existing dev tools
- Enable JSON test reporters (Jest/Vitest) for accurate score calculation
- Keep patches small to maximize information per step (TRM principle)
- Adjust
weightsbased on objective (e.g., more weight toperfwhen tests are green) - Use
benchCmdthat outputs a single number (e.g., milliseconds) for performance tracking - For TypeScript: Use
tsc --noEmitinbuildCmdfor fast type error detection - Use preflight validation to catch setup issues before iterating
- Validate candidates before submitting to reduce failed iterations
- Use
getFileLinesfor large files to save tokens - Try
suggestFixwhen stuck on TypeScript errors - Use
undoLastCandidateto quickly recover from bad changes
┌─────────────────────────────────────────────────────────────┐
│ LLM Client │
│ (Claude Code / Cursor / Codex CLI) │
│ │
│ • Proposes code changes (optimizer role) │
│ • Submits candidates via MCP tools │
│ • Interprets feedback and iterates │
└────────────────────┬────────────────────────────────────────┘
│ MCP Protocol
▼
┌─────────────────────────────────────────────────────────────┐
│ MCP TRM Server │
│ │
│ Session State: │
│ • Current score, EMA, best score │
│ • Test results, build status │
│ • Improvement streak tracking │
│ • History of evaluations │
│ • Candidate snapshots (for undo) │
│ │
│ Evaluation Pipeline: │
│ 1. Apply candidate changes │
│ 2. Run: build → test → lint → bench │
│ 3. Parse outputs, extract signals │
│ 4. Compute weighted score │
│ 5. Update EMA and improvement tracking │
│ 6. Check halting policy │
│ 7. Return structured feedback │
└────────────────────┬────────────────────────────────────────┘
│ Shell Commands
▼
┌─────────────────────────────────────────────────────────────┐
│ Target Repository │
│ │
│ • Source code files │
│ • Build system (tsc, webpack, etc.) │
│ • Test framework (jest, vitest, etc.) │
│ • Linter (eslint, etc.) │
│ • Benchmark scripts (optional) │
└─────────────────────────────────────────────────────────────┘
This implementation is inspired by the Test-time Recursive Memory (TRM) approach from the paper:
"Recursive Introspection: Teaching Language Model Agents How to Self-Improve" (arXiv:2510.04871v1)
Key adaptations for MCP/LLM development:
- TRM's recursive refinement → Iterative code improvement with LLM proposals
- Latent reasoning (z) → Rationale/notes passed between iterations
- ACT halting → Configurable stopping policy based on score + improvement
- Deep supervision → Build/test/lint/perf signals as training-free feedback
MIT
Issues and pull requests welcome at the project repository.