An offline-first PDF + Image Text Extraction tool built with SvelteKit.
- Extraction of selectable text layers from PDFs.
- OCR for scanned pages (image-only PDFs)using Tesseract.js.
- Support for PNG, JPEG, and WebP.
- On-device storage for document management (import, list, delete).
- Search across your entire "local library" (titles and extracted text).
- Best-effort detection of tables (CSV export) and AcroForm fields.
Concurrency Model
- Main Thread: Manages UI state and user interactions.
- Worker Threads: Dedicated workers for PDF parsing (PDF.js) and OCR (Tesseract).
- Orchestration: A central orchestrator sequences tasks, manages cancellation, and reports progress to the main thread.
Offline Strategy
- PWA: Service worker caching for the application shell and OCR language data.
- Local Persistence: All document metadata, original blobs, extraction results, and derived artifacts are stored in IndexedDB.
- Search Index: A local-only index built from extracted content, rebuildable on startup.
Data Model
documents: Metadata includingkind(pdf/image),name,size, andcontentHash.blobs: Original file content associated with a document ID.runs: Execution history for extraction modes (pdf_text,ocr_image,ocr_pdf).pages: Extracted text and stats per page index.tables: Structured row/column data and confidence scores.form_fields: AcroForm metadata (name, value, type).
- Framework: SvelteKit in SPA mode.
- Parsing: PDF.js for document rendering and text extraction.
- OCR: Tesseract.js (WASM) for client-side optical character recognition.
- Storage: Dexie.js for IndexedDB management, providing schema versioning and migrations.
- Styling: Tailwind v4
- Local-First: All processing happens on your device. No data is ever uploaded to a server.
- Deterministic: Extraction logic (table clustering, text normalization) is designed to produce consistent results for identical inputs.
- Deduplication: Files are identified by SHA-256 hashes to prevent redundant storage and extraction runs.