This offline Python project extracts text from PDF documents using Tesseract OCR and pdf2image, without relying on any cloud API like IBM Watsonx.
- Convert PDFs to images
- Use Tesseract to extract text from each page
- Save output as a structured JSON file
- Fully local β no cloud credentials required
.
βββ text_extraction_local.py # Main script
βββ requirements_local.txt # Local-only dependencies
βββ README.md # Project documentation
βββ sample/ # Input/output folder
βββ input.pdf
βββ output.json
-
Install system tools
- Install Tesseract OCR
- Install
poppler(required bypdf2image)- Windows: Poppler for Windows
- macOS:
brew install poppler - Linux:
sudo apt install poppler-utils
-
Clone the repo & install dependencies
git clone https://github.com/yourusername/Local-PDF-Text-Extractor.git cd Local-PDF-Text-Extractor pip install -r requirements_local.txt -
Run the script
python text_extraction_local.py
Each PDF page is processed into:
[
{
"page": 1,
"text": "Extracted text from page 1..."
},
...
]MIT License β Free to use and modify.