Advanced semantic document search using multilingual neural embeddings and vector database technology.
Search any PDF in any language.
Explore the code »
View Demo
·
Report Bug
·
Request Feature
DocVision-AI is a state-of-the-art document intelligence system that leverages neural semantic embeddings and vector databases to enable contextual, cross-lingual document search. The system processes PDF documents through advanced chunking algorithms and generates high-dimensional vector representations using transformer-based neural networks, enabling semantic understanding beyond traditional keyword matching.
🧠 Multilingual semantic embeddings using paraphrase-multilingual-MiniLM-L12-v2 from SentenceTransformers
⚡ Asynchronous processing with multi-threaded page embedding generation
🔄 Advanced chunking with configurable overlap for context preservation
🗄️ Vector database integration with Milvus for high-performance similarity search
🔍 Cross-lingual semantic search - query in any language, find relevant content regardless of document language
🎯 Context-aware retrieval with document highlighting and page-level precision
📊 Real-time performance - average search latency ~0.85 seconds on RTX 3060 (8-page documents)
🧪 Comprehensive unit test coverage for all neural processing pipelines
📄 Interactive PDF viewer with semantic highlighting and result visualization
- Document ingestion - PDF upload and page-level content extraction
- Asynchronous text processing - Multi-threaded chunking with overlap preservation
- Neural embedding generation - Transformer-based semantic vector encoding using multilingual models
- Vector database storage - High-dimensional embeddings indexed in Milvus vector database
- Semantic query processing - Real-time query embedding and similarity matching
- Contextual result retrieval - Chunk-level text extraction with document highlighting
- Interactive visualization - PDF viewer with highlighted semantic matches
-
Backend.api Module:
Api using FastAPI -
Backend.core Module:
PDF ingestion, core of the backend using every module -
Backend.db Module:
Milvus integration for high-performance vector storage and similarity search operations -
Backend.embeddings Module:
Creating Embeddings, chunking algorithms. -
Frontend/index.html:
Interactive PDF viewer with highlighting and search result visualization
🛠️ Pro tip: You can put your own file in test_files for custom unit testing.
- SentenceTransformers - Neural embedding models
- Milvus - Vector database engine
- FastAPI - API Python library
- Asyncio - Asynchronous processing
- Transformers - Hugging Face neural networks
DocVision-AI includes extensive unit testing for all modules. Run the complete test suite with:
python -m module_name.unit_testReplace module_name with valid name! All embedding generation, and vector database operations are thoroughly validated. Performance benchmarking and accuracy metrics are included for comprehensive system validation.
File Main.js is not under THIS license, it was generated by Claude (i dont like coding in .js)
This project is licensed under CC BY-NC 4.0. That means:
- ✅ Free for non-commercial use
- ✅ Attribution required
- ❌ Commercial use is not allowed
Author: Mikołaj Jaros (aka MixDevv)
- SentenceTransformers - Multilingual neural embedding models
- Milvus - High-performance vector database
- Hugging Face - Transformer neural networks
- Python Software Foundation
- FastAPI - Quick API creating.
0807.mp4