Garuda

An offline-first PDF + Image Text Extraction tool built with SvelteKit.

Features

Extraction of selectable text layers from PDFs.
OCR for scanned pages (image-only PDFs)using Tesseract.js.
Support for PNG, JPEG, and WebP.
On-device storage for document management (import, list, delete).
Search across your entire "local library" (titles and extracted text).
Best-effort detection of tables (CSV export) and AcroForm fields.

Architecture

Concurrency Model

Main Thread: Manages UI state and user interactions.
Worker Threads: Dedicated workers for PDF parsing (PDF.js) and OCR (Tesseract).
Orchestration: A central orchestrator sequences tasks, manages cancellation, and reports progress to the main thread.

Offline Strategy

PWA: Service worker caching for the application shell and OCR language data.
Local Persistence: All document metadata, original blobs, extraction results, and derived artifacts are stored in IndexedDB.
Search Index: A local-only index built from extracted content, rebuildable on startup.

Data Model

Document Persistence (IndexedDB)

documents: Metadata including kind (pdf/image), name, size, and contentHash.
blobs: Original file content associated with a document ID.
runs: Execution history for extraction modes (pdf_text, ocr_image, ocr_pdf).
pages: Extracted text and stats per page index.
tables: Structured row/column data and confidence scores.
form_fields: AcroForm metadata (name, value, type).

Tech Stack

Framework: SvelteKit in SPA mode.
Parsing: PDF.js for document rendering and text extraction.
OCR: Tesseract.js (WASM) for client-side optical character recognition.
Storage: Dexie.js for IndexedDB management, providing schema versioning and migrations.
Styling: Tailwind v4

Principles

Local-First: All processing happens on your device. No data is ever uploaded to a server.
Deterministic: Extraction logic (table clustering, text normalization) is designed to produce consistent results for identical inputs.
Deduplication: Files are identified by SHA-256 hashes to prevent redundant storage and extraction runs.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
scripts/pdf_gen		scripts/pdf_gen
src		src
static		static
.gitignore		.gitignore
.npmrc		.npmrc
.prettierignore		.prettierignore
.prettierrc		.prettierrc
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
README.md		README.md
ROADMAP.md		ROADMAP.md
eslint.config.js		eslint.config.js
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
pnpm-workspace.yaml		pnpm-workspace.yaml
svelte.config.js		svelte.config.js
tsconfig.json		tsconfig.json
vite.config.ts		vite.config.ts
wrangler.jsonc		wrangler.jsonc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Garuda

Features

Architecture

Document Persistence (IndexedDB)

Tech Stack

Principles

About

Uh oh!

Releases

Packages

Languages

stormlightlabs/garuda

Folders and files

Latest commit

History

Repository files navigation

Garuda

Features

Architecture

Document Persistence (IndexedDB)

Tech Stack

Principles

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages