Enhance wiki generation to achieve DeepWiki-quality output by GhostScientist · Pull Request #7 · GhostScientist/semanticwiki

GhostScientist · 2026-01-16T04:00:28Z

Major improvements to the system prompt for generating architectural wikis:

Add hierarchical page structure guidance with domain-based organization
Introduce multi-phase "Deep Discovery" process before content generation
Add mandatory "Key Components" table format for all feature pages
Add "How It Works" sections with numbered workflow walkthroughs
Enhance source traceability requirements (2+ refs per concept)
Add comprehensive Mermaid diagram requirements by page type:
- Architecture pages: 3+ diagrams (overview, data flow, integration)
- Feature pages: 2+ diagrams (component interaction + workflow)
- Data model pages: ER diagrams
- Batch processing: job flow diagrams
- State machine diagrams for stateful workflows
Add cross-reference diagram guidance for module interactions
Update quality checklist with content, diagram, and navigation sections
Add mainframe/COBOL-specific structure templates

This brings the wiki output closer to professional documentation
platforms like DeepWiki with better organization and traceability.

Note

Brings DeepWiki-quality docs, discovery, and new packaging.

Add src/discovery module to infer domains/relationships and generate hierarchical wiki plans, index, and component tables
Implement contextual retrieval (src/rag/contextual-retrieval.ts) with Claude/Ollama/local modes, caching, stats, and --contextual* CLI flags plus preview flow
Overhaul system prompt for hierarchical, traceable, diagram-rich documentation
Switch portable package format from .archiwiki to .semantics across CLI, format code, README/CHANGELOG, and logs; rename MCP server from archiwiki to semanticwiki
Extend LLM support: add completion family and Qwen2.5-Coder model; wire through provider/types
CLI: pass contextual options to generation; update pack/unpack defaults/messages
Dependencies: move faiss-node to optional deps; various version bumps

^{Written by Cursor Bugbot for commit 783eb50. This will update automatically on new commits. Configure here.}

Major improvements to the system prompt for generating architectural wikis: - Add hierarchical page structure guidance with domain-based organization - Introduce multi-phase "Deep Discovery" process before content generation - Add mandatory "Key Components" table format for all feature pages - Add "How It Works" sections with numbered workflow walkthroughs - Enhance source traceability requirements (2+ refs per concept) - Add comprehensive Mermaid diagram requirements by page type: - Architecture pages: 3+ diagrams (overview, data flow, integration) - Feature pages: 2+ diagrams (component interaction + workflow) - Data model pages: ER diagrams - Batch processing: job flow diagrams - State machine diagrams for stateful workflows - Add cross-reference diagram guidance for module interactions - Update quality checklist with content, diagram, and navigation sections - Add mainframe/COBOL-specific structure templates This brings the wiki output closer to professional documentation platforms like DeepWiki with better organization and traceability.

Update all references to the portable package format: - Rename .archiwiki extension to .semantics - Update CLI commands (pack/unpack) descriptions and defaults - Update MCP server name from 'archiwiki' to 'semanticwiki' - Update README and CHANGELOG documentation This aligns the package format name with the project name.

Implements Anthropic's Contextual Retrieval technique to enhance chunk understanding for better search results. Features: - New contextual-retrieval.ts module for context generation - Support for both Claude API (with prompt caching) and local LLMs (Ollama) - Context caching to avoid regeneration on subsequent runs - Fallback to AST metadata when LLM unavailable New CLI flags: - --contextual: Enable contextual retrieval (uses Claude API) - --contextual-local: Use local Ollama for context generation - --contextual-model: Specify Claude model (default: claude-3-haiku) - --contextual-ollama-model: Specify Ollama model (default: qwen2.5-coder:7b) For each code chunk, generates a brief context explaining: - What file/module the chunk belongs to - What the specific code does in context - Relationships to other parts of the codebase This can reduce retrieval failures by up to 67% when combined with existing hybrid search (BM25 + vectors) and reranking. Reference: https://www.anthropic.com/news/contextual-retrieval

Adds a third mode for contextual retrieval that uses the bundled node-llama-cpp inference engine, requiring no external services. Three modes now available: 1. Claude API (--contextual): Uses Claude with prompt caching 2. Ollama (--contextual-local --use-ollama): Uses external Ollama server 3. Bundled local (--contextual-local): Uses node-llama-cpp, fully offline The bundled local mode: - Downloads and caches the model automatically (first run) - Runs entirely offline after initial setup - Uses the same model infrastructure as --full-local wiki generation - Processes chunks sequentially to manage memory (concurrency=1) Usage examples: # Fully local, no external services semanticwiki generate -r ./repo --full-local --contextual-local # Local with Ollama for faster inference semanticwiki generate -r ./repo --contextual-local --use-ollama # Cloud API (fastest, costs money) semanticwiki generate -r ./repo --contextual

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.}

This is the final PR Bugbot will review for you during this billing cycle

Your free Bugbot reviews will reset on February 16

Details

You are on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle.

To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

src/rag/index.ts

- Add contextual retrieval properties to WikiGenerationOptions interface - Fix this.apiKey -> this.config.apiKey reference - Fix modelFamily type to use only 'gpt-oss' (only supported value) - Fix LLMProvider.complete() -> LLMProvider.chat() method call

Don't pass localModel to bundled local provider - the qwen model name was causing 'Model not found' errors. Let it use the default gpt-oss model selection instead.

- Add 30 unit tests covering configuration, fallback context generation, content truncation, cache management, cost estimation, and mode selection - Move faiss-node to optionalDependencies to fix npm install on systems without BLAS libraries (fallback similarity search is used when unavailable)

- Add system prompt for local LLM to guide context generation - Fix "No sequences left" error by reinitializing provider after 15 chunks - Detect and recover from sequence exhaustion automatically - Track empty responses and use fallback context when LLM returns empty - Add --contextual-preview flag to test enrichment on sample chunks - Show detailed stats: success count, empty responses, errors, context resets The preview command allows debugging contextual retrieval before full generation: semanticwiki generate -r ./repo --contextual-preview 10 --contextual-local

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.}

src/rag/contextual-retrieval.ts

This major enhancement adds intelligent codebase analysis to generate wiki documentation that mirrors the actual system architecture: Discovery Module (src/discovery/index.ts): - Detects project type (mainframe-cobol, web-app, api-service, etc.) - Analyzes file patterns and content to identify logical domains - Groups components by business function, not just file type - Discovers relationships between domains for data flow documentation - Generates hierarchical wiki structure with sections and subsections Wiki Generation Integration: - Runs discovery phase after indexing, before page generation - Creates pages from discovered domains and sections - Generates hierarchical index.md with domain-grouped navigation - Adds relationship pages (Data Flow, Integration Points) for complex systems - Maps domain categories to meaningful section names Hierarchical Index Features: - Project metadata header with technologies and type - Domain-grouped sections (Core Application, Data Layer, Batch Processing, etc.) - Quick reference table with all discovered domains - Proper iconography for different page types - Generation statistics in footer This moves the wiki output from flat file-type lists to intelligent domain-based organization that better represents how the system actually works.

- Add isDocumentationFile() to filter out README, LICENSE, etc. from component lists - Add generateComponentTable() for DeepWiki-style summary tables (Job|Program|Function) - Extract JCL program info (IDCAMS, IEBGENER, etc.) and infer job functions - Add BMS screen and COBOL program function inference - Integrate component tables into section overview and domain page contexts - Only include relevant source code files in component listings

cursor

Cursor Bugbot has reviewed your changes and found 3 potential issues.

^{Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.}

cursor · 2026-01-16T06:05:32Z

src/cli.ts

+
+        console.log(chalk.green(`  ✓ ${statusCounts.success} successful`));
+        if (statusCounts.empty > 0) console.log(chalk.yellow(`  ⚠ ${statusCounts.empty} empty responses`));
+        if (statusCounts.error > 0) console.log(chalk.red(`  ✗ ${statusCounts.error} errors`));


Fallback status never displayed in contextual preview

Low Severity

The statusCounts object includes a fallback key, and the previewEnrichment method returns samples with status 'success' | 'empty' | 'fallback' | 'error'. However, the display logic only outputs counts for success, empty, and error — the fallback count is never shown to users. Additionally, individual samples with fallback status are styled with the error icon (✗) and red color since the ternary chain treats any non-success/empty status as an error.

Additional Locations (1)

src/cli.ts#L410-L412

cursor · 2026-01-16T06:05:32Z

src/cli.ts

+          console.log(chalk.gray('Cloning repository...'));
+          const git = simpleGit.simpleGit();
+          await git.clone(options.repo, repoDir, ['--depth', '1']);
+        }


Local paths with '@' incorrectly treated as remote URLs

Medium Severity

The check options.repo.includes('@') used to detect SSH-style Git URLs (like git@github.com:user/repo) is too broad. It incorrectly matches local filesystem paths containing '@' characters (e.g., /home/user@work/project). This causes the code to attempt git clone on what is actually a local path, resulting in either a confusing error if the path isn't a git repository, or unnecessary cloning if it is. The same condition is used for cleanup, potentially attempting to delete the wrong directory.

Additional Locations (1)

src/cli.ts#L425-L428

cursor · 2026-01-16T06:05:32Z

src/cli.ts

+        // Clean up temp dir if we cloned
+        if (options.repo.startsWith('http') || options.repo.includes('@')) {
+          fsModule.rmSync(repoDir, { recursive: true, force: true });
+        }


Cloned temp directory not cleaned up on error

Low Severity

When using --contextual-preview with a remote repository, the code clones to /tmp/semanticwiki-preview-* at line 345, but cleanup at lines 426-428 only runs if all intermediate operations succeed. There's no try/finally block, so if glob(), ContextualRetrieval initialization, initialize(), or previewEnrichment() throws an error, the cloned temp directory is never deleted. Repeated failures would leave orphaned directories in /tmp.

GPT-OSS has function-calling tokens (<|call|>, <|return|>) that cause node-llama-cpp to emit stop tokens immediately, resulting in ~95% empty responses for simple text completion tasks. - Add 'completion' model family with Qwen2.5-Coder-1.5B-Instruct (1.68GB) - Update contextual retrieval to use completion model instead of gpt-oss - Keep gpt-oss for main wiki generation (works well with tool calling) - Qwen2.5-Coder is optimized for code understanding without function tokens

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

^{Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.}

cursor · 2026-01-16T06:45:22Z

src/cli.ts

+        spinner.stop();
+        console.log(chalk.cyan.bold('\n🔍 Contextual Retrieval Preview\n'));
+
+        const sampleSize = typeof options.contextualPreview === 'number' ? options.contextualPreview : 10;


NaN validation missing for contextual preview sample size

Low Severity

When a user provides a non-numeric value to --contextual-preview (e.g., --contextual-preview foo), parseInt returns NaN. The validation check typeof options.contextualPreview === 'number' passes because typeof NaN === 'number' is true in JavaScript. This causes sampleSize to be NaN, resulting in confusing output like "Generating context for NaN sample chunks..." and undefined behavior in previewEnrichment.

Additional Locations (1)

src/cli.ts#L101-L102

cursor · 2026-01-16T06:45:22Z

src/cli.ts

+        const contextualRetrieval = new ContextualRetrieval({
+          enabled: true,
+          useLocal: options.contextualLocal || options.fullLocal,
+          useOllama: options.useOllama,


--full-local flag inconsistently affects contextual retrieval modes

Medium Severity

The --full-local flag behaves differently for contextual retrieval between preview and generation modes. In preview mode (line 375), useLocal is set to options.contextualLocal || options.fullLocal, so --full-local enables local LLM. In generation mode (line 316), contextualLocal only checks options.contextualLocal, ignoring --full-local. A user who tests with --full-local --contextual-preview would see local mode, but --full-local --contextual in actual generation would unexpectedly use the cloud API.

Additional Locations (1)

src/cli.ts#L314-L317

The local LLM provider allocates 20 sequences, but generating 70+ pages exhausted them causing "No sequences left" errors after ~20 pages. - Add provider reinitialization every 15 pages (before hitting limit) - Add retry logic for sequence exhaustion errors - Apply same fix to verification loop that generates missing pages - Provider is now properly reset to release sequences

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

^{Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.}

cursor · 2026-01-16T12:58:03Z

src/discovery/index.ts

+      } else if (file.includes('config')) {
+        type = 'config';
+      } else if (file.includes('test') || file.includes('spec')) {
+        type = 'test';


Case-sensitive component type detection misses common patterns

Medium Severity

The component type detection in extractComponents uses case-sensitive string matching (file.includes('controller'), file.includes('service'), etc.) without converting to lowercase first. This means common file naming patterns like UserController.ts, AuthService.ts, or DataRepository.ts won't be detected as their respective component types. This is inconsistent with other methods in the same class like isDocumentationFile which correctly uses fileName.toLowerCase() for comparison.

cursor · 2026-01-16T12:58:04Z

src/discovery/index.ts

+  const lines: string[] = [];
+  const components = domain.components.filter(c =>
+    c.type !== 'unknown' && c.function
+  );


Component table filter excludes all web/modern components

Medium Severity

The generateComponentTable function filters components with c.type !== 'unknown' && c.function, but the function property is only populated for mainframe components (JCL jobs, BMS screens, copybooks, COBOL programs). Web components like controllers, services, and repositories have function set to undefined in extractComponents, so they're all filtered out. This contradicts the table generation code which already handles missing functions with defaults like program.function || 'Application Logic'.

claude added 4 commits January 16, 2026 03:24

cursor bot reviewed Jan 16, 2026

View reviewed changes

src/rag/index.ts Show resolved Hide resolved

claude added 4 commits January 16, 2026 04:25

Fix bundled local contextual retrieval to use gpt-oss model

f03b5ee

Don't pass localModel to bundled local provider - the qwen model name was causing 'Model not found' errors. Let it use the default gpt-oss model selection instead.

cursor bot reviewed Jan 16, 2026

View reviewed changes

src/rag/contextual-retrieval.ts Show resolved Hide resolved

claude added 2 commits January 16, 2026 05:51

cursor bot reviewed Jan 16, 2026

View reviewed changes

Conversation

GhostScientist commented Jan 16, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

This is the final PR Bugbot will review for you during this billing cycle

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Jan 16, 2026

Choose a reason for hiding this comment

Fallback status never displayed in contextual preview

Uh oh!

cursor bot Jan 16, 2026

Choose a reason for hiding this comment

Local paths with '@' incorrectly treated as remote URLs

Uh oh!

cursor bot Jan 16, 2026

Choose a reason for hiding this comment

Cloned temp directory not cleaned up on error

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Jan 16, 2026

Choose a reason for hiding this comment

NaN validation missing for contextual preview sample size

Uh oh!

cursor bot Jan 16, 2026

Choose a reason for hiding this comment

--full-local flag inconsistently affects contextual retrieval modes

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Jan 16, 2026

Choose a reason for hiding this comment

Case-sensitive component type detection misses common patterns

Uh oh!

cursor bot Jan 16, 2026

Choose a reason for hiding this comment

Component table filter excludes all web/modern components

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

GhostScientist commented Jan 16, 2026 •

edited by cursor bot

Loading