Skip to content

Intelligent PDF Document Understanding System with semantic document search using multilingual neural embeddings and vector database technology. Search any PDF in any language.

Notifications You must be signed in to change notification settings

Mielone2Good/DocVision-AI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Contributors Forks Stargazers Issues License LinkedIn


Python Logo

DocVision-AI - Intelligent PDF Document Understanding System

Advanced semantic document search using multilingual neural embeddings and vector database technology.
Search any PDF in any language.
Explore the code »

View Demo · Report Bug · Request Feature


📌 About The Project

DocVision-AI is a state-of-the-art document intelligence system that leverages neural semantic embeddings and vector databases to enable contextual, cross-lingual document search. The system processes PDF documents through advanced chunking algorithms and generates high-dimensional vector representations using transformer-based neural networks, enabling semantic understanding beyond traditional keyword matching.

💡Final Goal: You can search any PDF in any language

e.g. a Polish document by writing in English


Version 1.0

🧠 Multilingual semantic embeddings using paraphrase-multilingual-MiniLM-L12-v2 from SentenceTransformers
Asynchronous processing with multi-threaded page embedding generation
🔄 Advanced chunking with configurable overlap for context preservation
🗄️ Vector database integration with Milvus for high-performance similarity search
🔍 Cross-lingual semantic search - query in any language, find relevant content regardless of document language
🎯 Context-aware retrieval with document highlighting and page-level precision
📊 Real-time performance - average search latency ~0.85 seconds on RTX 3060 (8-page documents)
🧪 Comprehensive unit test coverage for all neural processing pipelines
📄 Interactive PDF viewer with semantic highlighting and result visualization


🔍 How It Works

  1. Document ingestion - PDF upload and page-level content extraction
  2. Asynchronous text processing - Multi-threaded chunking with overlap preservation
  3. Neural embedding generation - Transformer-based semantic vector encoding using multilingual models
  4. Vector database storage - High-dimensional embeddings indexed in Milvus vector database
  5. Semantic query processing - Real-time query embedding and similarity matching
  6. Contextual result retrieval - Chunk-level text extraction with document highlighting
  7. Interactive visualization - PDF viewer with highlighted semantic matches

⚙️ Architecture

  • Backend.api Module:
    Api using FastAPI

  • Backend.core Module:
    PDF ingestion, core of the backend using every module

  • Backend.db Module:
    Milvus integration for high-performance vector storage and similarity search operations

  • Backend.embeddings Module:
    Creating Embeddings, chunking algorithms.

  • Frontend/index.html:
    Interactive PDF viewer with highlighting and search result visualization

🛠️ Pro tip: You can put your own file in test_files for custom unit testing.


🚀 Built With


🧪 Unit Tests

DocVision-AI includes extensive unit testing for all modules. Run the complete test suite with:

python -m module_name.unit_test

Replace module_name with valid name! All embedding generation, and vector database operations are thoroughly validated. Performance benchmarking and accuracy metrics are included for comprehensive system validation.

You need to have Milvus db installed and ran in the background!


🚫 License

File Main.js is not under THIS license, it was generated by Claude (i dont like coding in .js)

This project is licensed under CC BY-NC 4.0. That means:

  • ✅ Free for non-commercial use
  • ✅ Attribution required
  • Commercial use is not allowed

Author: Mikołaj Jaros (aka MixDevv) LinkedIn


🙏 Acknowledgments


📷 Screenshots

Main Interface

interface

Full App Video

0807.mp4

Search Test- Multilingual Query Processing (How to do factory reset en/pl)

pl_eng_test1

Search Test- Multilingual Query Processing (What username should i use for login en/pl)

pl_eng_test2

Uploading Menu

MENU

Document processing unit_test (8 pages)

image

About

Intelligent PDF Document Understanding System with semantic document search using multilingual neural embeddings and vector database technology. Search any PDF in any language.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published