Skip to content

A lightweight, offline Python tool that extracts text from PDF documents using Tesseract OCR and outputs structured JSON. No internet or cloud services required.

License

Notifications You must be signed in to change notification settings

ObliviousK0t/tesseract-pdf-parser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

5 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🧠 Local Text Extraction from PDF using Tesseract OCR

This offline Python project extracts text from PDF documents using Tesseract OCR and pdf2image, without relying on any cloud API like IBM Watsonx.


πŸ“¦ Features

  • Convert PDFs to images
  • Use Tesseract to extract text from each page
  • Save output as a structured JSON file
  • Fully local β€” no cloud credentials required

πŸ“‚ Folder Structure

.
β”œβ”€β”€ text_extraction_local.py       # Main script
β”œβ”€β”€ requirements_local.txt         # Local-only dependencies
β”œβ”€β”€ README.md                      # Project documentation
└── sample/                        # Input/output folder
    β”œβ”€β”€ input.pdf
    └── output.json

πŸ”§ Installation & Setup

  1. Install system tools

  2. Clone the repo & install dependencies

    git clone https://github.com/yourusername/Local-PDF-Text-Extractor.git
    cd Local-PDF-Text-Extractor
    pip install -r requirements_local.txt
  3. Run the script

    python text_extraction_local.py

πŸ“ˆ Output Format (JSON)

Each PDF page is processed into:

[
  {
    "page": 1,
    "text": "Extracted text from page 1..."
  },
  ...
]

πŸ”’ License

MIT License β€” Free to use and modify.

About

A lightweight, offline Python tool that extracts text from PDF documents using Tesseract OCR and outputs structured JSON. No internet or cloud services required.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages