Skip to content

stormlightlabs/garuda

Repository files navigation

Garuda

An offline-first PDF + Image Text Extraction tool built with SvelteKit.

Features

  • Extraction of selectable text layers from PDFs.
  • OCR for scanned pages (image-only PDFs)using Tesseract.js.
  • Support for PNG, JPEG, and WebP.
  • On-device storage for document management (import, list, delete).
  • Search across your entire "local library" (titles and extracted text).
  • Best-effort detection of tables (CSV export) and AcroForm fields.

Architecture

Concurrency Model
  • Main Thread: Manages UI state and user interactions.
  • Worker Threads: Dedicated workers for PDF parsing (PDF.js) and OCR (Tesseract).
  • Orchestration: A central orchestrator sequences tasks, manages cancellation, and reports progress to the main thread.
Offline Strategy
  • PWA: Service worker caching for the application shell and OCR language data.
  • Local Persistence: All document metadata, original blobs, extraction results, and derived artifacts are stored in IndexedDB.
  • Search Index: A local-only index built from extracted content, rebuildable on startup.
Data Model

Document Persistence (IndexedDB)

  • documents: Metadata including kind (pdf/image), name, size, and contentHash.
  • blobs: Original file content associated with a document ID.
  • runs: Execution history for extraction modes (pdf_text, ocr_image, ocr_pdf).
  • pages: Extracted text and stats per page index.
  • tables: Structured row/column data and confidence scores.
  • form_fields: AcroForm metadata (name, value, type).

Tech Stack

  • Framework: SvelteKit in SPA mode.
  • Parsing: PDF.js for document rendering and text extraction.
  • OCR: Tesseract.js (WASM) for client-side optical character recognition.
  • Storage: Dexie.js for IndexedDB management, providing schema versioning and migrations.
  • Styling: Tailwind v4

Principles

  • Local-First: All processing happens on your device. No data is ever uploaded to a server.
  • Deterministic: Extraction logic (table clustering, text normalization) is designed to produce consistent results for identical inputs.
  • Deduplication: Files are identified by SHA-256 hashes to prevent redundant storage and extraction runs.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published