This project deploys a private Retrieval-Augmented Generation (RAG) API using LLaMA 3.2 and vLLM.
✅ Serverless (scale to zero) ✅ Private API ✅ Your own infrastructure ✅ Multi-GPU support
-
Clone this repository:
git clone <your-repo-url> cd <your-repo-directory> -
Install required packages:
pip install -r requirements.txt -
Ensure these modules are in your project directory:
- ingestion.py
- retriever.py
- prompt_template.py
- Download LLaMA model weights from [appropriate source].
- Place weights in [appropriate directory].
- Update
model_nameinrag.pyif necessary.
-
Add documents to chat with in the
./docsfolder. -
Start the server:
python server.py -
Use the API:
python client.py --query "Your question here"
- Expose the server to the internet (authentication optional)
- Enable "auto start" for serverless operation
- Optimize performance with LitServe features (batching, multi-GPU, etc.)
This project utilizes:
- RAG (Retrieval-Augmented Generation)
- vLLM for efficient LLM serving
- Vector database (self-hosted Qdrant)
- LitServe for scalable inference
For more details on these components, refer to the full documentation.