A powerful, modular RAG system that processes documents from URLs and provides intelligent question-answering capabilities using GPU-accelerated Llama 3.3-70B-Instruct via vLLM.
|
|
Live API Endpoint: https://4145182fdfba.ngrok-free.app/api/v1/hackrx/run
# Test the live endpoint
curl -X POST "https://4145182fdfba.ngrok-free.app/api/v1/hackrx/run" \
-H "Content-Type: application/json" \
-d '{
"question": "What is this document about?",
"url": "https://example.com/document.pdf"
}'|
Hardware
|
Software
|
git clone <repository-url>
cd Bajaj-Hackrx
./setup.sh# 1. Environment setup
cp env.example .env
nano .env # Add your HuggingFace token
# 2. Start services
docker-compose up -d
# 3. Monitor startup
docker-compose logs -f vllm-serverClick to expand local setup instructions
# Clone and setup
git clone <repository-url>
cd Bajaj-Hackrx
python -m venv venv
# Activate environment
# Windows:
venv\Scripts\activate
# Linux/Mac:
source venv/bin/activate
# Install dependencies
pip install -r requirements.txt
# Configure and run
cp env.example .env
python main.pyAll API requests require authentication using Bearer token:
Authorization: Bearer 82e98b40bb2546d8eea6db9bed3c61ef6cafdf3b2a22c0d16edcf3f795e679cf| Endpoint | Method | Description |
|---|---|---|
/ |
GET |
Health check and system status |
/api/v1/hackrx/run |
POST |
Process document and answer questions |
/api/v1/validate-file |
GET |
Validate URL before processing |
/api/v1/documents |
GET |
List processed documents |
/api/v1/llama/status |
GET |
Check vLLM server status |
curl -X POST "http://localhost:8000/api/v1/hackrx/run" \
-H "Authorization: Bearer <your-token>" \
-H "Content-Type: application/json" \
-d '{
"documents": "https://example.com/document.pdf",
"questions": [
"What is the main topic?",
"What are the key findings?"
]
}'{
"answers": [
"The document discusses advanced AI techniques for document processing...",
"Key findings include improved accuracy and reduced processing time..."
]
}| Variable | Default | Description |
|---|---|---|
HF_TOKEN |
required | HuggingFace token for model access |
API_KEY |
Generated key | Authentication key for RAG API |
LLAMA_API_URL |
localhost:8001 | vLLM server endpoint |
USE_EXTERNAL_LLAMA |
true | Enable vLLM integration |
MAX_CONCURRENT_REQUESTS |
10 | Parallel request limit |
| Service | Port | Purpose |
|---|---|---|
| RAG API | 8000 | Main application interface |
| vLLM Server | 8001 | Llama model inference |
|
2-10x Faster inference vs standard implementations |
3-8 sec PDF processing + questions (with GPU) |
192GB HBM3 memory on MI300X for optimal performance |
2-3 min Initial model loading time |
|
Minimum Requirements
|
Recommended Setup
|
| GPU Model | Memory | Status |
|---|---|---|
| AMD MI210 | 64GB HBM2e | |
| AMD MI250X | 128GB HBM2e | β Supported |
| AMD MI300A | 128GB HBM3 | β Good |
| AMD MI300X | 192GB HBM3 | β Optimal |
Common Issues & Solutions
- Won't start: Check AMD GPU memory and HuggingFace token
- Slow loading: Monitor with
docker-compose logs -f vllm-server - Out of memory: Reduce tensor-parallel-size in docker-compose.yml
- ROCm issues: Verify ROCm installation and GPU visibility
- Connection refused: Wait for health checks to pass
- API errors: Verify authentication token
- Network issues: Check service status with
docker-compose ps
# Service management
docker-compose ps # Check status
docker-compose logs -f # View logs
docker-compose restart rag-api # Restart service
docker-compose down && docker-compose up -d # Clean restart
# System monitoring
rocm-smi # AMD GPU usage
docker info | grep rocm # ROCm runtime checksrc/
βββ core/ # Configuration and models
βββ api/ # FastAPI application
βββ rag/ # RAG system components
βββ document/ # Document processing
βββ utils/ # Utility functions
- Fork the repository
- Create a feature branch (
git checkout -b feature/new-feature) - Make your changes following the modular structure
- Test your changes thoroughly
- Submit a pull request
