Skip to content

manyasharma1008/Rag-pipeline-pdf

Repository files navigation

Rag-pipeline-pdf

Rag-pipeline-pdf is a Python-based project implementing a Retrieval-Augmented Generation (RAG) pipeline specifically for PDF documents. It combines traditional information retrieval techniques with AI-powered generation, allowing users to query large PDF datasets efficiently and get accurate, context-aware answers.

Key Features

  • PDF Text Extraction: Automatically extracts and preprocesses text from PDF documents.
  • Vector Database Storage: Uses ChromaDB to store embeddings of the extracted text for fast and efficient retrieval.
  • RAG Pipeline: Integrates retrieval with AI models to provide context-aware answers from the documents.
  • Interactive Notebooks: Jupyter notebooks included for testing, exploring, and demonstrating the pipeline.
  • Scalable & Modular: Can be extended to larger datasets or integrated with other AI applications.

What I Did

  • Built the end-to-end pipeline from PDF ingestion to AI query.
  • Implemented vectorization and storage for quick retrieval.
  • Enabled AI-assisted querying to fetch precise answers from PDFs.
  • Organized the project for easy experimentation and extension.

This project serves as a base for building intelligent document search systems and can be extended to handle multiple document types or integrated into web applications.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published