Enhance wiki generation to achieve DeepWiki-quality output#7
Enhance wiki generation to achieve DeepWiki-quality output#7GhostScientist wants to merge 12 commits intomasterfrom
Conversation
Major improvements to the system prompt for generating architectural wikis: - Add hierarchical page structure guidance with domain-based organization - Introduce multi-phase "Deep Discovery" process before content generation - Add mandatory "Key Components" table format for all feature pages - Add "How It Works" sections with numbered workflow walkthroughs - Enhance source traceability requirements (2+ refs per concept) - Add comprehensive Mermaid diagram requirements by page type: - Architecture pages: 3+ diagrams (overview, data flow, integration) - Feature pages: 2+ diagrams (component interaction + workflow) - Data model pages: ER diagrams - Batch processing: job flow diagrams - State machine diagrams for stateful workflows - Add cross-reference diagram guidance for module interactions - Update quality checklist with content, diagram, and navigation sections - Add mainframe/COBOL-specific structure templates This brings the wiki output closer to professional documentation platforms like DeepWiki with better organization and traceability.
Update all references to the portable package format: - Rename .archiwiki extension to .semantics - Update CLI commands (pack/unpack) descriptions and defaults - Update MCP server name from 'archiwiki' to 'semanticwiki' - Update README and CHANGELOG documentation This aligns the package format name with the project name.
Implements Anthropic's Contextual Retrieval technique to enhance chunk understanding for better search results. Features: - New contextual-retrieval.ts module for context generation - Support for both Claude API (with prompt caching) and local LLMs (Ollama) - Context caching to avoid regeneration on subsequent runs - Fallback to AST metadata when LLM unavailable New CLI flags: - --contextual: Enable contextual retrieval (uses Claude API) - --contextual-local: Use local Ollama for context generation - --contextual-model: Specify Claude model (default: claude-3-haiku) - --contextual-ollama-model: Specify Ollama model (default: qwen2.5-coder:7b) For each code chunk, generates a brief context explaining: - What file/module the chunk belongs to - What the specific code does in context - Relationships to other parts of the codebase This can reduce retrieval failures by up to 67% when combined with existing hybrid search (BM25 + vectors) and reranking. Reference: https://www.anthropic.com/news/contextual-retrieval
Adds a third mode for contextual retrieval that uses the bundled node-llama-cpp inference engine, requiring no external services. Three modes now available: 1. Claude API (--contextual): Uses Claude with prompt caching 2. Ollama (--contextual-local --use-ollama): Uses external Ollama server 3. Bundled local (--contextual-local): Uses node-llama-cpp, fully offline The bundled local mode: - Downloads and caches the model automatically (first run) - Runs entirely offline after initial setup - Uses the same model infrastructure as --full-local wiki generation - Processes chunks sequentially to manage memory (concurrency=1) Usage examples: # Fully local, no external services semanticwiki generate -r ./repo --full-local --contextual-local # Local with Ollama for faster inference semanticwiki generate -r ./repo --contextual-local --use-ollama # Cloud API (fastest, costs money) semanticwiki generate -r ./repo --contextual
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
This is the final PR Bugbot will review for you during this billing cycle
Your free Bugbot reviews will reset on February 16
Details
You are on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle.
To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.
- Add contextual retrieval properties to WikiGenerationOptions interface - Fix this.apiKey -> this.config.apiKey reference - Fix modelFamily type to use only 'gpt-oss' (only supported value) - Fix LLMProvider.complete() -> LLMProvider.chat() method call
Don't pass localModel to bundled local provider - the qwen model name was causing 'Model not found' errors. Let it use the default gpt-oss model selection instead.
- Add 30 unit tests covering configuration, fallback context generation, content truncation, cache management, cost estimation, and mode selection - Move faiss-node to optionalDependencies to fix npm install on systems without BLAS libraries (fallback similarity search is used when unavailable)
- Add system prompt for local LLM to guide context generation - Fix "No sequences left" error by reinitializing provider after 15 chunks - Detect and recover from sequence exhaustion automatically - Track empty responses and use fallback context when LLM returns empty - Add --contextual-preview flag to test enrichment on sample chunks - Show detailed stats: success count, empty responses, errors, context resets The preview command allows debugging contextual retrieval before full generation: semanticwiki generate -r ./repo --contextual-preview 10 --contextual-local
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
This major enhancement adds intelligent codebase analysis to generate wiki documentation that mirrors the actual system architecture: Discovery Module (src/discovery/index.ts): - Detects project type (mainframe-cobol, web-app, api-service, etc.) - Analyzes file patterns and content to identify logical domains - Groups components by business function, not just file type - Discovers relationships between domains for data flow documentation - Generates hierarchical wiki structure with sections and subsections Wiki Generation Integration: - Runs discovery phase after indexing, before page generation - Creates pages from discovered domains and sections - Generates hierarchical index.md with domain-grouped navigation - Adds relationship pages (Data Flow, Integration Points) for complex systems - Maps domain categories to meaningful section names Hierarchical Index Features: - Project metadata header with technologies and type - Domain-grouped sections (Core Application, Data Layer, Batch Processing, etc.) - Quick reference table with all discovered domains - Proper iconography for different page types - Generation statistics in footer This moves the wiki output from flat file-type lists to intelligent domain-based organization that better represents how the system actually works.
- Add isDocumentationFile() to filter out README, LICENSE, etc. from component lists - Add generateComponentTable() for DeepWiki-style summary tables (Job|Program|Function) - Extract JCL program info (IDCAMS, IEBGENER, etc.) and infer job functions - Add BMS screen and COBOL program function inference - Integrate component tables into section overview and domain page contexts - Only include relevant source code files in component listings
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 3 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
|
|
||
| console.log(chalk.green(` ✓ ${statusCounts.success} successful`)); | ||
| if (statusCounts.empty > 0) console.log(chalk.yellow(` ⚠ ${statusCounts.empty} empty responses`)); | ||
| if (statusCounts.error > 0) console.log(chalk.red(` ✗ ${statusCounts.error} errors`)); |
There was a problem hiding this comment.
Fallback status never displayed in contextual preview
Low Severity
The statusCounts object includes a fallback key, and the previewEnrichment method returns samples with status 'success' | 'empty' | 'fallback' | 'error'. However, the display logic only outputs counts for success, empty, and error — the fallback count is never shown to users. Additionally, individual samples with fallback status are styled with the error icon (✗) and red color since the ternary chain treats any non-success/empty status as an error.
Additional Locations (1)
| console.log(chalk.gray('Cloning repository...')); | ||
| const git = simpleGit.simpleGit(); | ||
| await git.clone(options.repo, repoDir, ['--depth', '1']); | ||
| } |
There was a problem hiding this comment.
Local paths with '@' incorrectly treated as remote URLs
Medium Severity
The check options.repo.includes('@') used to detect SSH-style Git URLs (like git@github.com:user/repo) is too broad. It incorrectly matches local filesystem paths containing '@' characters (e.g., /home/user@work/project). This causes the code to attempt git clone on what is actually a local path, resulting in either a confusing error if the path isn't a git repository, or unnecessary cloning if it is. The same condition is used for cleanup, potentially attempting to delete the wrong directory.
Additional Locations (1)
| // Clean up temp dir if we cloned | ||
| if (options.repo.startsWith('http') || options.repo.includes('@')) { | ||
| fsModule.rmSync(repoDir, { recursive: true, force: true }); | ||
| } |
There was a problem hiding this comment.
Cloned temp directory not cleaned up on error
Low Severity
When using --contextual-preview with a remote repository, the code clones to /tmp/semanticwiki-preview-* at line 345, but cleanup at lines 426-428 only runs if all intermediate operations succeed. There's no try/finally block, so if glob(), ContextualRetrieval initialization, initialize(), or previewEnrichment() throws an error, the cloned temp directory is never deleted. Repeated failures would leave orphaned directories in /tmp.
GPT-OSS has function-calling tokens (<|call|>, <|return|>) that cause node-llama-cpp to emit stop tokens immediately, resulting in ~95% empty responses for simple text completion tasks. - Add 'completion' model family with Qwen2.5-Coder-1.5B-Instruct (1.68GB) - Update contextual retrieval to use completion model instead of gpt-oss - Keep gpt-oss for main wiki generation (works well with tool calling) - Qwen2.5-Coder is optimized for code understanding without function tokens
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
| spinner.stop(); | ||
| console.log(chalk.cyan.bold('\n🔍 Contextual Retrieval Preview\n')); | ||
|
|
||
| const sampleSize = typeof options.contextualPreview === 'number' ? options.contextualPreview : 10; |
There was a problem hiding this comment.
NaN validation missing for contextual preview sample size
Low Severity
When a user provides a non-numeric value to --contextual-preview (e.g., --contextual-preview foo), parseInt returns NaN. The validation check typeof options.contextualPreview === 'number' passes because typeof NaN === 'number' is true in JavaScript. This causes sampleSize to be NaN, resulting in confusing output like "Generating context for NaN sample chunks..." and undefined behavior in previewEnrichment.
Additional Locations (1)
| const contextualRetrieval = new ContextualRetrieval({ | ||
| enabled: true, | ||
| useLocal: options.contextualLocal || options.fullLocal, | ||
| useOllama: options.useOllama, |
There was a problem hiding this comment.
--full-local flag inconsistently affects contextual retrieval modes
Medium Severity
The --full-local flag behaves differently for contextual retrieval between preview and generation modes. In preview mode (line 375), useLocal is set to options.contextualLocal || options.fullLocal, so --full-local enables local LLM. In generation mode (line 316), contextualLocal only checks options.contextualLocal, ignoring --full-local. A user who tests with --full-local --contextual-preview would see local mode, but --full-local --contextual in actual generation would unexpectedly use the cloud API.
Additional Locations (1)
The local LLM provider allocates 20 sequences, but generating 70+ pages exhausted them causing "No sequences left" errors after ~20 pages. - Add provider reinitialization every 15 pages (before hitting limit) - Add retry logic for sequence exhaustion errors - Apply same fix to verification loop that generates missing pages - Provider is now properly reset to release sequences
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
| } else if (file.includes('config')) { | ||
| type = 'config'; | ||
| } else if (file.includes('test') || file.includes('spec')) { | ||
| type = 'test'; |
There was a problem hiding this comment.
Case-sensitive component type detection misses common patterns
Medium Severity
The component type detection in extractComponents uses case-sensitive string matching (file.includes('controller'), file.includes('service'), etc.) without converting to lowercase first. This means common file naming patterns like UserController.ts, AuthService.ts, or DataRepository.ts won't be detected as their respective component types. This is inconsistent with other methods in the same class like isDocumentationFile which correctly uses fileName.toLowerCase() for comparison.
| const lines: string[] = []; | ||
| const components = domain.components.filter(c => | ||
| c.type !== 'unknown' && c.function | ||
| ); |
There was a problem hiding this comment.
Component table filter excludes all web/modern components
Medium Severity
The generateComponentTable function filters components with c.type !== 'unknown' && c.function, but the function property is only populated for mainframe components (JCL jobs, BMS screens, copybooks, COBOL programs). Web components like controllers, services, and repositories have function set to undefined in extractComponents, so they're all filtered out. This contradicts the table generation code which already handles missing functions with defaults like program.function || 'Application Logic'.
Major improvements to the system prompt for generating architectural wikis:
This brings the wiki output closer to professional documentation
platforms like DeepWiki with better organization and traceability.
Note
Brings DeepWiki-quality docs, discovery, and new packaging.
src/discoverymodule to infer domains/relationships and generate hierarchical wiki plans, index, and component tablessrc/rag/contextual-retrieval.ts) with Claude/Ollama/local modes, caching, stats, and--contextual*CLI flags plus preview flow.archiwikito.semanticsacross CLI, format code, README/CHANGELOG, and logs; rename MCP server fromarchiwikitosemanticwikicompletionfamily and Qwen2.5-Coder model; wire through provider/typesfaiss-nodeto optional deps; various version bumpsWritten by Cursor Bugbot for commit 783eb50. This will update automatically on new commits. Configure here.