A full-stack PDF reading assistant with Text-to-Speech (TTS), RAG (Retrieval Augmented Generation), and AI chat—all designed to run privately and locally on your own machine. Upload a PDF, have it read aloud with synchronized text highlighting, and chat with your document using AI. Everything works for free using open-source models like Docker Model Runner, Ollama, or LMStudio—no cloud/subscriptions required.
- Unified Experience: Seamlessly switch between reading the PDF and listening to chat responses
- Multiple PDF Tabs: Open and switch between multiple PDFs using a tabbed interface at the top of the viewer
- Intelligent Text Processing: Robust sentence segmentation with support for Markdown and non-punctuated text
- High-Quality TTS: Local speech synthesis using Kokoro-82M
- Visual Tracking: Synchronized sentence highlighting in PDF and message highlighting in Chat
- Interactive Navigation: Double-click any sentence in the PDF or any message in the Chat to start playback
- Centralized Controls: Unified player in the footer manages all audio sources (Speed 0.5x - 2.0x)
- Threaded Chat: Organize conversations into threads with persistent SQLite storage for messages and file associations
- Per-Thread Collections: Each thread has its own isolated vector collection in Qdrant, locked to a specific embedding model
- Dual-Search Retrieval: AI searches both document chunks AND past Q&A pairs (semantic memory) simultaneously
- Semantic Recollection: The UI highlights which past chat messages were "recalled" and used by the AI to answer the current question
- Internet Search (DuckDuckGo): Optionally augment answers with live web search results for up-to-date or external information
- Context Management: Intelligent token budgeting ensures the most relevant PDF chunks, recent history, and semantic memories fit the LLM context window
You can enable Internet Search in the chat panel to let the AI answer questions using both your PDF and live web results (via DuckDuckGo). This is useful for:
- Getting up-to-date facts, news, or background not present in your PDF
- Clarifying ambiguous or missing information
How it works:
- When enabled, the app performs a DuckDuckGo search for your question and injects the top results into the LLM's context window, along with PDF content.
- The LLM then answers using both sources.
Privacy:
- All queries are sent to DuckDuckGo only when Internet Search is enabled.
- No PDF content is sent to DuckDuckGo—only your question.
Rate Limits:
- DuckDuckGo and other free search APIs may rate limit requests if used too frequently.
- If rate limited, the app will notify you and fall back to PDF-only answers.
Model Compatibility:
- Any OpenAI-compatible LLM can use this feature. The search results are injected as plain text context, so no special model/tool-calling support is required.
- Unified Navigation: Double-click sentences or chat bubbles to start reading immediately
- Dynamic Visual Feedback: PDF sentence highlighting and Chat bubble illumination during playback
- Resizable Chat Panel: Drag to adjust the chat interface width (300-800px)
- Auto-Scroll: Both PDF and Chat automatically keep the active being-read content in view
- Model Selection: Centralized embedding model selection and dynamic LLM discovery
All features of this app are designed to run entirely on your own machine or laptop, using only local resources by default. Document processing, AI chat, TTS, and chat/thread management all happen locally—no data is sent to external servers unless you explicitly enable Internet Search.
Privacy Note:
- When Internet Search is enabled, only your question (not your PDF content or chat history) is sent to DuckDuckGo for web search. All other processing, including PDF parsing, vector search, LLM inference, and chat/thread/message storage, remains local and private.
- If Internet Search is disabled, no data ever leaves your machine.
You can use free, open-source models with Docker Model Runner, Ollama, or LMStudio, so there are no required cloud costs or subscriptions.
┌─────────────────────────────────────────────────────────────────────────────┐
│ Docker Compose │
├─────────────────┬─────────────────┬─────────────────┬───────────────────────┤
│ Frontend │ Backend │ RAG Service │ Qdrant │
│ (Next.js) │ (FastAPI) │ (FastAPI) │ (Vector DB) │
│ Port: 3000 │ Port: 8000 │ Port: 8001 │ Port: 6333 │
└─────────────────┴─────────────────┴─────────────────┴───────────────────────┘
│
▼
┌──────────────────────────────────────────────┐
│ DMR / Ollama / LMStudio / LLM │
│ (OpenAI-compatible) │
│ Port: 12434 (default) │
└──────────────────────────────────────────────┘
| Service | Port | Description |
|---|---|---|
| Frontend | 3000 | Next.js React app with PDF viewer, chat UI, and thread management |
| Backend | 8000 | FastAPI server for PDF processing and TTS |
| RAG Service | 8001 | FastAPI server for document indexing, AI chat, thread/message/file management |
| Qdrant | 6333 | Vector database for semantic and memory search |
| DMR/Ollama/LMStudio | 12434 | Local LLM server (external, user-provided) |
- Docker and Docker Compose
- Local LLM Server: The app is configured to use an OpenAI-compatible API by default on port
12434.- Option A: DMR (Default) - Built into Docker Desktop.
- Option B: Ollama - Requires running on port
12434or updating configuration. - Option C: LMStudio - Desktop app, exposes OpenAI-compatible API (default:
http://localhost:1234/v1).
- LLM Model: e.g.,
ai/qwen3:latest(DMR),llama3(Ollama), or any chat model supported by LMStudio - Embedding Model: e.g.,
ai/nomic-embed-text-v1.5:latest(DMR),nomic-embed-text(Ollama), or any embedding model supported by LMStudio
git clone https://github.com/raghu13590/askpdf.git
cd askpdfAt the root of the project directory (the same folder as docker-compose.yml), create a file named .env with the following content:
LLM_API_URL=http://host.docker.internal:12434This variable configures the LLM server endpoint. If you are using Ollama on its default port, set:
LLM_API_URL=http://host.docker.internal:11434Note: After editing
.env, restart your containers for changes to take effect.
The application requires an OpenAI-compatible API for LLM and embeddings. You can use Docker Model Runner (DMR), Ollama, or LMStudio as your local LLM server.
- Ensure Docker Desktop is running and the DMR extension is installed.
- Set
LLM_API_URLin your.envfile to:LLM_API_URL=http://host.docker.internal:12434
- Download the required models:
- LLM Model (e.g.,
ai/qwen3:latest): Qwen3 on Hugging Face - Embedding Model (e.g.,
ai/nomic-embed-text-v1.5:latest): nomic-embed-text-v1.5 on Hugging Face
- LLM Model (e.g.,
- Import models into DMR:
- Open Docker Desktop, go to the DMR extension, and use the "Import Model" button to add the downloaded models.
- Or, use the DMR CLI:
dmr import <path-to-model-directory>
- Verify both models are listed as Ready in the DMR UI.
- Download and install LMStudio on your machine.
- Launch LMStudio and load your desired LLM and embedding models.
- LMStudio exposes an OpenAI-compatible API at
http://localhost:1234/v1by default. - Edit your
.envfile and set:
LLM_API_URL=http://host.docker.internal:1234/v1(If running outside Docker, use http://localhost:1234/v1.)
5. Restart your containers for changes to take effect.
Note: LMStudio supports a wide range of models and provides a user-friendly interface for model management. Ensure the models you want to use are loaded and available in LMStudio.
Ollama runs on port 11434 by default. The easiest way to use Ollama with this app is to update your .env file (recommended):
Note: Ollama supports the OpenAI-compatible API (used by this app) starting from version 0.1.34 and above. Ensure your Ollama installation is up to date.
Option 1 (Recommended): Change the API endpoint in your .env file
Edit your .env file at the project root and set:
LLM_API_URL=http://host.docker.internal:11434This will direct the app to use Ollama's default port. (If running outside Docker, you can use http://localhost:11434.)
Note: After changing
.env, restart your containers for the new value to take effect.
Option 2: Change Ollama's port to 12434
If you prefer, you can start Ollama on port 12434 to match the default expected by the app:
# Start Ollama on the expected port
OLLAMA_HOST=0.0.0.0:12434 ollama serve
# In a new terminal, pull the models
ollama pull llama3
ollama pull nomic-embed-textdocker-compose up --build- Main App: http://localhost:3000
- Backend API: http://localhost:8000
- RAG API: http://localhost:8001
- Qdrant Dashboard: http://localhost:6333/dashboard
- Manage Threads: Use the Sidebar to create new threads or select existing ones.
- Select Embedding Model: When creating a new thread, choose the embedding model. This model is locked to the thread for consistency.
- Upload PDFs: Within a thread, click "Upload PDF". You can upload multiple PDFs to the same thread.
- Switch Tabs: Different PDFs in the same thread appear as tabs at the top of the viewer.
- PDF Processing: Each uploaded PDF is parsed, sentences extracted, and indexed for RAG within that thread's collection.
- Play Audio: Click "Play" at the top to start text-to-speech.
- Navigate: Use playback controls or double-click any sentence in the PDF or any chat bubble to jump audio to that point.
- Adjust Voice: Select different voice styles and adjust playback speed (0.5x to 2.0x).
- Auto-Scroll: The app automatically keeps the current sentence in view.
- Select LLM Model: Choose an LLM from the chat panel dropdown.
- (Optional) Enable Internet Search: Toggle the "Use Internet Search" switch above the chat input to allow the AI to use live web results.
- Ask Questions: Type your question. The AI will search both the current PDFs and past conversations in the current thread.
- Semantic Identification: If the AI uses past conversations to answer, the relevant messages will glow with a purple border in the chat history.
- Follow-up: The system maintains context for follow-up questions within the thread.
- Read AI Answers: Double-click any assistant chat bubble to have the response read aloud.
| Technology | Purpose |
|---|---|
| FastAPI | Web framework for REST APIs |
| PyMuPDF (fitz) | PDF parsing with character-level coordinates |
| spaCy | NLP for sentence segmentation |
| Kokoro | Neural TTS with 82M parameters |
| Technology | Purpose |
|---|---|
| FastAPI | Web framework |
| LangChain | LLM/Embedding integration |
| LangGraph | Stateful RAG workflow |
| Qdrant Client | Vector database operations |
| Technology | Purpose |
|---|---|
| Next.js | React framework |
| Material-UI (MUI) | UI components |
| react-pdf | PDF rendering |
| react-markdown | Chat message rendering |
askpdf/
├── docker-compose.yml # Multi-service orchestration
├── backend/
│ ├── Dockerfile
│ ├── requirements.txt
│ └── app/
│ ├── main.py # FastAPI app, upload & TTS endpoints
│ ├── pdf_parser.py # PyMuPDF text extraction with coordinates
│ ├── nlp.py # spaCy sentence segmentation
│ └── tts.py # Kokoro TTS synthesis
├── rag_service/
│ ├── Dockerfile
│ ├── requirements.txt
│ ├── main.py # FastAPI app, index, chat, thread, file, and message endpoints
│ ├── rag.py # Document chunking & indexing (thread-aware)
│ ├── agent.py # LangGraph RAG workflow
│ ├── models.py # LLM/Embedding model clients
│ ├── database.py # SQLite thread/message/file management
│ └── vectordb/
│ ├── base.py # Abstract vector DB interface
│ └── qdrant.py # Qdrant adapter implementation (threaded collections)
└── frontend/
├── Dockerfile
├── package.json
└── src/
├── pages/
│ └── index.tsx # Main application page
├── components/
│ ├── PdfUploader.tsx # File upload with model selection
│ ├── PdfViewer.tsx # PDF rendering with overlays
│ ├── PlayerControls.tsx # Audio playback controls
│ ├── ChatInterface.tsx # RAG chat UI (thread-aware)
│ ├── ThreadSidebar.tsx # Thread management UI
│ └── TextViewer.tsx # Alternative text display
└── lib/
├── api.ts # Backend & RAG API client (thread/message/file)
└── tts-api.ts # TTS API client
The application expects an OpenAI-compatible API at the URL specified by LLM_API_URL in your .env file (default: http://host.docker.internal:12434).
Upload a PDF and extract sentences with bounding boxes.
Request: multipart/form-data
file: PDF file
All environment variables, including LLM_API_URL, are now managed via a .env file at the project root. This file is loaded by both Docker Compose and the Python services.
embedding_model: Model name for RAG indexing
Response:
{
"sentences": [
{
"id": 0,
| `LLM_API_URL` | RAG Service | `http://host.docker.internal:12434` | LLM server URL (set in `.env`; change to `...:11434` for default Ollama) |
"bboxes": [
{"page": 1, "x": 72, "y": 700, "width": 50, "height": 12, "page_height": 792, "page_width": 612}
]
}
],
"pdfUrl": "/abc123.pdf"
}List available TTS voice styles.
Response:
{
"voices": ["M1.json", "F1.json", "M2.json"]
}Synthesize speech for text.
Request:
{
"text": "Text to synthesize",
"voice": "M1.json",
"speed": 1.0
}Response:
{
"audioUrl": "/data/audio/tmp_xyz.wav"
}Index document text into vector database (legacy, single collection).
Create, list, update, and delete chat threads. Each thread has its own context, files, and messages.
Add a file to a thread and trigger background indexing. Associates PDFs with threads for context-aware chat.
Chat with a thread using semantic memory (retrieves both PDF chunks and previous chat answers for context).
Request:
{
"thread_id": "abc123",
"question": "What is this document about?",
"llm_model": "ai/qwen3:latest",
"use_web_search": false
}Response:
{
"answer": "This document discusses...",
"used_chat_ids": ["msg1", "msg2"],
"pdf_sources": [ ... ]
}List and delete messages in a thread. Supports per-thread chat history management.
Fetch available models from LLM server.
Health check endpoint.
| Variable | Service | Default | Description |
|---|---|---|---|
NEXT_PUBLIC_API_URL |
Frontend | http://localhost:8000 |
Backend API URL |
NEXT_PUBLIC_RAG_API_URL |
Frontend | http://localhost:8001 |
RAG API URL |
RAG_SERVICE_URL |
Backend | http://rag-service:8000 |
Internal RAG service URL |
QDRANT_HOST |
RAG Service | qdrant |
Qdrant hostname |
QDRANT_PORT |
RAG Service | 6333 |
Qdrant port |
LLM_API_URL |
RAG Service | http://host.docker.internal:12434 |
LLM server URL (Change to ...:11434 for default Ollama) |
Voice styles (voices) are handled by the Kokoro engine. Available options are discovered dynamically from the system and populated in the UI dropdown.
In backend/app/tts.py:
total_step: Diffusion steps (default: 5) - higher = better quality, slowerspeed: Playback speed (0.5 - 2.0)
User uploads PDF
↓
Backend: Save PDF → Extract text + coordinates (PyMuPDF)
↓
Backend: Split into sentences (spaCy)
↓
Backend: Map sentences to bounding boxes
↓
Backend: Trigger async RAG indexing (per-thread if using threads)
↓
RAG Service: Chunk text → Generate embeddings → Store in Qdrant (threaded collections)
↓
Frontend: Display PDF with clickable sentence overlays
User creates/selects thread
↓
User asks question in thread
↓
RAG Service: Embed question
↓
RAG Service: Search Qdrant for top-5 relevant PDF chunks (thread collection)
↓
RAG Service: Search Qdrant for relevant previous chat answers (semantic memory)
↓
RAG Service: Build prompt (system + context + history + question + semantic memory)
↓
RAG Service: Call LLM via OpenAI-compatible API
↓
Frontend: Display markdown-rendered answer in thread
User clicks Play or double-clicks sentence
↓
Frontend: Request /api/tts with sentence text
↓
Backend: Kokoro synthesizes audio → WAV file
↓
Frontend: Play audio, highlight current sentence
↓
On audio end: Auto-advance to next sentence
The application uses Docker Compose with four services:
- frontend: Next.js dev server with hot reload
- backend: FastAPI with TTS models mounted (Supertonic cloned from HuggingFace at build)
- rag-service: FastAPI with LangChain/LangGraph
- qdrant: Official Qdrant image with persistent storage
- `qdrant_data`: Persistent vector storage
- Source directories mounted for development hot-reload
Contributions are welcome! Please feel free to submit a Pull Request.
This project uses the following third-party technologies:
- Kokoro - Text-to-speech model
- spaCy - Natural language processing
- LangChain - LLM framework
- LangGraph - Stateful AI workflows
- Qdrant - Vector database
- FastAPI - Web framework
- Next.js - React framework
- hexgrad for the amazing Kokoro-82M model
- spaCy for robust NLP capabilities
- LangChain team for the excellent LLM framework
- Qdrant for the powerful vector database
- The open-source community for all the amazing tools
For questions, issues, or suggestions, please open an issue on the GitHub repository.