Skip to content

Enhance wiki generation to achieve DeepWiki-quality output#7

Open
GhostScientist wants to merge 12 commits intomasterfrom
claude/document-jcl-jobs-qGVa9
Open

Enhance wiki generation to achieve DeepWiki-quality output#7
GhostScientist wants to merge 12 commits intomasterfrom
claude/document-jcl-jobs-qGVa9

Conversation

@GhostScientist
Copy link
Owner

@GhostScientist GhostScientist commented Jan 16, 2026

Major improvements to the system prompt for generating architectural wikis:

  • Add hierarchical page structure guidance with domain-based organization
  • Introduce multi-phase "Deep Discovery" process before content generation
  • Add mandatory "Key Components" table format for all feature pages
  • Add "How It Works" sections with numbered workflow walkthroughs
  • Enhance source traceability requirements (2+ refs per concept)
  • Add comprehensive Mermaid diagram requirements by page type:
    • Architecture pages: 3+ diagrams (overview, data flow, integration)
    • Feature pages: 2+ diagrams (component interaction + workflow)
    • Data model pages: ER diagrams
    • Batch processing: job flow diagrams
    • State machine diagrams for stateful workflows
  • Add cross-reference diagram guidance for module interactions
  • Update quality checklist with content, diagram, and navigation sections
  • Add mainframe/COBOL-specific structure templates

This brings the wiki output closer to professional documentation
platforms like DeepWiki with better organization and traceability.


Note

Brings DeepWiki-quality docs, discovery, and new packaging.

  • Add src/discovery module to infer domains/relationships and generate hierarchical wiki plans, index, and component tables
  • Implement contextual retrieval (src/rag/contextual-retrieval.ts) with Claude/Ollama/local modes, caching, stats, and --contextual* CLI flags plus preview flow
  • Overhaul system prompt for hierarchical, traceable, diagram-rich documentation
  • Switch portable package format from .archiwiki to .semantics across CLI, format code, README/CHANGELOG, and logs; rename MCP server from archiwiki to semanticwiki
  • Extend LLM support: add completion family and Qwen2.5-Coder model; wire through provider/types
  • CLI: pass contextual options to generation; update pack/unpack defaults/messages
  • Dependencies: move faiss-node to optional deps; various version bumps

Written by Cursor Bugbot for commit 783eb50. This will update automatically on new commits. Configure here.

Major improvements to the system prompt for generating architectural wikis:

- Add hierarchical page structure guidance with domain-based organization
- Introduce multi-phase "Deep Discovery" process before content generation
- Add mandatory "Key Components" table format for all feature pages
- Add "How It Works" sections with numbered workflow walkthroughs
- Enhance source traceability requirements (2+ refs per concept)
- Add comprehensive Mermaid diagram requirements by page type:
  - Architecture pages: 3+ diagrams (overview, data flow, integration)
  - Feature pages: 2+ diagrams (component interaction + workflow)
  - Data model pages: ER diagrams
  - Batch processing: job flow diagrams
  - State machine diagrams for stateful workflows
- Add cross-reference diagram guidance for module interactions
- Update quality checklist with content, diagram, and navigation sections
- Add mainframe/COBOL-specific structure templates

This brings the wiki output closer to professional documentation
platforms like DeepWiki with better organization and traceability.
Update all references to the portable package format:
- Rename .archiwiki extension to .semantics
- Update CLI commands (pack/unpack) descriptions and defaults
- Update MCP server name from 'archiwiki' to 'semanticwiki'
- Update README and CHANGELOG documentation

This aligns the package format name with the project name.
Implements Anthropic's Contextual Retrieval technique to enhance chunk
understanding for better search results.

Features:
- New contextual-retrieval.ts module for context generation
- Support for both Claude API (with prompt caching) and local LLMs (Ollama)
- Context caching to avoid regeneration on subsequent runs
- Fallback to AST metadata when LLM unavailable

New CLI flags:
- --contextual: Enable contextual retrieval (uses Claude API)
- --contextual-local: Use local Ollama for context generation
- --contextual-model: Specify Claude model (default: claude-3-haiku)
- --contextual-ollama-model: Specify Ollama model (default: qwen2.5-coder:7b)

For each code chunk, generates a brief context explaining:
- What file/module the chunk belongs to
- What the specific code does in context
- Relationships to other parts of the codebase

This can reduce retrieval failures by up to 67% when combined with
existing hybrid search (BM25 + vectors) and reranking.

Reference: https://www.anthropic.com/news/contextual-retrieval
Adds a third mode for contextual retrieval that uses the bundled
node-llama-cpp inference engine, requiring no external services.

Three modes now available:
1. Claude API (--contextual): Uses Claude with prompt caching
2. Ollama (--contextual-local --use-ollama): Uses external Ollama server
3. Bundled local (--contextual-local): Uses node-llama-cpp, fully offline

The bundled local mode:
- Downloads and caches the model automatically (first run)
- Runs entirely offline after initial setup
- Uses the same model infrastructure as --full-local wiki generation
- Processes chunks sequentially to manage memory (concurrency=1)

Usage examples:
  # Fully local, no external services
  semanticwiki generate -r ./repo --full-local --contextual-local

  # Local with Ollama for faster inference
  semanticwiki generate -r ./repo --contextual-local --use-ollama

  # Cloud API (fastest, costs money)
  semanticwiki generate -r ./repo --contextual
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

This is the final PR Bugbot will review for you during this billing cycle

Your free Bugbot reviews will reset on February 16

Details

You are on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle.

To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

- Add contextual retrieval properties to WikiGenerationOptions interface
- Fix this.apiKey -> this.config.apiKey reference
- Fix modelFamily type to use only 'gpt-oss' (only supported value)
- Fix LLMProvider.complete() -> LLMProvider.chat() method call
Don't pass localModel to bundled local provider - the qwen model name
was causing 'Model not found' errors. Let it use the default gpt-oss
model selection instead.
- Add 30 unit tests covering configuration, fallback context generation,
  content truncation, cache management, cost estimation, and mode selection
- Move faiss-node to optionalDependencies to fix npm install on systems
  without BLAS libraries (fallback similarity search is used when unavailable)
- Add system prompt for local LLM to guide context generation
- Fix "No sequences left" error by reinitializing provider after 15 chunks
- Detect and recover from sequence exhaustion automatically
- Track empty responses and use fallback context when LLM returns empty
- Add --contextual-preview flag to test enrichment on sample chunks
- Show detailed stats: success count, empty responses, errors, context resets

The preview command allows debugging contextual retrieval before full generation:
  semanticwiki generate -r ./repo --contextual-preview 10 --contextual-local
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

This major enhancement adds intelligent codebase analysis to generate
wiki documentation that mirrors the actual system architecture:

Discovery Module (src/discovery/index.ts):
- Detects project type (mainframe-cobol, web-app, api-service, etc.)
- Analyzes file patterns and content to identify logical domains
- Groups components by business function, not just file type
- Discovers relationships between domains for data flow documentation
- Generates hierarchical wiki structure with sections and subsections

Wiki Generation Integration:
- Runs discovery phase after indexing, before page generation
- Creates pages from discovered domains and sections
- Generates hierarchical index.md with domain-grouped navigation
- Adds relationship pages (Data Flow, Integration Points) for complex systems
- Maps domain categories to meaningful section names

Hierarchical Index Features:
- Project metadata header with technologies and type
- Domain-grouped sections (Core Application, Data Layer, Batch Processing, etc.)
- Quick reference table with all discovered domains
- Proper iconography for different page types
- Generation statistics in footer

This moves the wiki output from flat file-type lists to intelligent
domain-based organization that better represents how the system actually works.
- Add isDocumentationFile() to filter out README, LICENSE, etc. from component lists
- Add generateComponentTable() for DeepWiki-style summary tables (Job|Program|Function)
- Extract JCL program info (IDCAMS, IEBGENER, etc.) and infer job functions
- Add BMS screen and COBOL program function inference
- Integrate component tables into section overview and domain page contexts
- Only include relevant source code files in component listings
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 3 potential issues.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.


console.log(chalk.green(` ✓ ${statusCounts.success} successful`));
if (statusCounts.empty > 0) console.log(chalk.yellow(` ⚠ ${statusCounts.empty} empty responses`));
if (statusCounts.error > 0) console.log(chalk.red(` ✗ ${statusCounts.error} errors`));
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fallback status never displayed in contextual preview

Low Severity

The statusCounts object includes a fallback key, and the previewEnrichment method returns samples with status 'success' | 'empty' | 'fallback' | 'error'. However, the display logic only outputs counts for success, empty, and error — the fallback count is never shown to users. Additionally, individual samples with fallback status are styled with the error icon () and red color since the ternary chain treats any non-success/empty status as an error.

Additional Locations (1)

Fix in Cursor Fix in Web

console.log(chalk.gray('Cloning repository...'));
const git = simpleGit.simpleGit();
await git.clone(options.repo, repoDir, ['--depth', '1']);
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Local paths with '@' incorrectly treated as remote URLs

Medium Severity

The check options.repo.includes('@') used to detect SSH-style Git URLs (like git@github.com:user/repo) is too broad. It incorrectly matches local filesystem paths containing '@' characters (e.g., /home/user@work/project). This causes the code to attempt git clone on what is actually a local path, resulting in either a confusing error if the path isn't a git repository, or unnecessary cloning if it is. The same condition is used for cleanup, potentially attempting to delete the wrong directory.

Additional Locations (1)

Fix in Cursor Fix in Web

// Clean up temp dir if we cloned
if (options.repo.startsWith('http') || options.repo.includes('@')) {
fsModule.rmSync(repoDir, { recursive: true, force: true });
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cloned temp directory not cleaned up on error

Low Severity

When using --contextual-preview with a remote repository, the code clones to /tmp/semanticwiki-preview-* at line 345, but cleanup at lines 426-428 only runs if all intermediate operations succeed. There's no try/finally block, so if glob(), ContextualRetrieval initialization, initialize(), or previewEnrichment() throws an error, the cloned temp directory is never deleted. Repeated failures would leave orphaned directories in /tmp.

Fix in Cursor Fix in Web

GPT-OSS has function-calling tokens (<|call|>, <|return|>) that cause
node-llama-cpp to emit stop tokens immediately, resulting in ~95% empty
responses for simple text completion tasks.

- Add 'completion' model family with Qwen2.5-Coder-1.5B-Instruct (1.68GB)
- Update contextual retrieval to use completion model instead of gpt-oss
- Keep gpt-oss for main wiki generation (works well with tool calling)
- Qwen2.5-Coder is optimized for code understanding without function tokens
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

spinner.stop();
console.log(chalk.cyan.bold('\n🔍 Contextual Retrieval Preview\n'));

const sampleSize = typeof options.contextualPreview === 'number' ? options.contextualPreview : 10;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NaN validation missing for contextual preview sample size

Low Severity

When a user provides a non-numeric value to --contextual-preview (e.g., --contextual-preview foo), parseInt returns NaN. The validation check typeof options.contextualPreview === 'number' passes because typeof NaN === 'number' is true in JavaScript. This causes sampleSize to be NaN, resulting in confusing output like "Generating context for NaN sample chunks..." and undefined behavior in previewEnrichment.

Additional Locations (1)

Fix in Cursor Fix in Web

const contextualRetrieval = new ContextualRetrieval({
enabled: true,
useLocal: options.contextualLocal || options.fullLocal,
useOllama: options.useOllama,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--full-local flag inconsistently affects contextual retrieval modes

Medium Severity

The --full-local flag behaves differently for contextual retrieval between preview and generation modes. In preview mode (line 375), useLocal is set to options.contextualLocal || options.fullLocal, so --full-local enables local LLM. In generation mode (line 316), contextualLocal only checks options.contextualLocal, ignoring --full-local. A user who tests with --full-local --contextual-preview would see local mode, but --full-local --contextual in actual generation would unexpectedly use the cloud API.

Additional Locations (1)

Fix in Cursor Fix in Web

The local LLM provider allocates 20 sequences, but generating 70+ pages
exhausted them causing "No sequences left" errors after ~20 pages.

- Add provider reinitialization every 15 pages (before hitting limit)
- Add retry logic for sequence exhaustion errors
- Apply same fix to verification loop that generates missing pages
- Provider is now properly reset to release sequences
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

} else if (file.includes('config')) {
type = 'config';
} else if (file.includes('test') || file.includes('spec')) {
type = 'test';
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Case-sensitive component type detection misses common patterns

Medium Severity

The component type detection in extractComponents uses case-sensitive string matching (file.includes('controller'), file.includes('service'), etc.) without converting to lowercase first. This means common file naming patterns like UserController.ts, AuthService.ts, or DataRepository.ts won't be detected as their respective component types. This is inconsistent with other methods in the same class like isDocumentationFile which correctly uses fileName.toLowerCase() for comparison.

Fix in Cursor Fix in Web

const lines: string[] = [];
const components = domain.components.filter(c =>
c.type !== 'unknown' && c.function
);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Component table filter excludes all web/modern components

Medium Severity

The generateComponentTable function filters components with c.type !== 'unknown' && c.function, but the function property is only populated for mainframe components (JCL jobs, BMS screens, copybooks, COBOL programs). Web components like controllers, services, and repositories have function set to undefined in extractComponents, so they're all filtered out. This contradicts the table generation code which already handles missing functions with defaults like program.function || 'Application Logic'.

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants