A Rust CLI tool that converts Arabic PDFs to text using Google's Gemini API.
- ✅ Uploads full PDF once using resumable upload protocol
- ✅ Processes PDFs in page ranges with MapReduce pattern
- ✅ Concurrent processing with rate limit handling
- ✅ Preserves Arabic text formatting exactly
- ✅ Progress tracking for each page range
- ✅ Automatic retry on rate limit errors
- Single Upload: The entire PDF is uploaded once to Gemini API
- Page Range Processing: The tool requests text extraction for specific page ranges (e.g., pages 1-5, 6-10, etc.)
- MapReduce Pattern: Multiple page ranges are processed concurrently
- Rate Limiting: Automatic delays and retries handle API rate limits
- Result Aggregation: All extracted text is combined in the correct order
The tool successfully:
- Uploads PDFs using Gemini's resumable upload API
- Processes page ranges concurrently (default: 5 pages per chunk)
- Handles rate limiting with 6-second delays between requests
- Retries failed requests up to 3 times with 30-second delays
- Combines results maintaining page order
cargo install arabic_pdf_to_text# Clone the repository
git clone https://github.com/RustSandbox/arabic_pdf_to_text.git
cd arabic_pdf_to_text
# Build the project
cargo build --release
# The binary will be at target/release/arabic_pdf_to_text- Rust 1.70 or later
- Google Gemini API key (Get one here)
-
Copy
.env.exampleto.env:cp .env.example .env
-
Add your Gemini API key:
export GEMINI_API_KEY="your-api-key"
# Process a PDF
./arabic_pdf_to_text "path/to/arabic.pdf" -o output.txt
# With custom chunk size
./arabic_pdf_to_text "path/to/arabic.pdf" --chunk-size 524288 -o output.txt
# See all options
./arabic_pdf_to_text --helpFor production use, consider:
- Using a PDF library to split PDFs at page boundaries instead of byte boundaries
- Implementing a queue system to respect rate limits
- Using a paid API tier for higher quotas
- Caching processed chunks to avoid reprocessing
cargo build --releasecargo testWe welcome contributions! Please see our Contributing Guide for details.
This project is licensed under the MIT License - see the LICENSE file for details.
While this software is free for any use including commercial, if you use it in a commercial product or service, we kindly request (but do not require) that you include the following attribution:
This product includes software developed by the arabic_pdf_to_text project
(https://github.com/RustSandbox/arabic_pdf_to_text)
- Google Gemini API for providing the PDF processing capabilities
- The Rust community for excellent libraries and tools