A professional, production-ready document search service that demonstrates core information retrieval concepts using inverted indexes. Built with FastAPI and designed for learning modern search engine principles.
This project demonstrates the core concepts behind modern search engines:
- Inverted Index Data Structures - The foundation of search engines like Google
- Term Frequency Ranking - How documents are scored and ranked
- FastAPI REST API Design - Professional API development practices
- Data Persistence - Saving and loading search indexes
- Test-Driven Development - Comprehensive testing strategies
- Python 3.10 or higher
- pip (Python package installer)
-
Clone the repository
git clone https://github.com/mojitote/python-inverted-search-api.git cd python-inverted-search-api -
Create a virtual environment
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install dependencies
pip install -r requirements.txt
-
Run the API
python -m app.main
-
Access the API
- API Documentation: http://localhost:8000/docs
- Health Check: http://localhost:8000/health
- Root Endpoint: http://localhost:8000/
Add documents to the search index:
curl -X POST "http://localhost:8000/upload" \
-H "Content-Type: application/json" \
-d '{
"doc_id": "python_intro",
"content": "Python is a high-level programming language known for its simplicity and readability.",
"title": "Python Introduction",
"author": "John Doe"
}'Search for documents by keywords:
curl "http://localhost:8000/search?query=python&limit=5"Response:
{
"query": "python",
"results": [
{
"doc_id": "python_intro",
"score": 0.125,
"title": "Python Introduction",
"author": "John Doe",
"snippet": "Python is a high-level programming language..."
}
],
"total_results": 1,
"search_time_ms": 2.45
}Check the current state of the inverted index:
curl "http://localhost:8000/index"Retrieve a document by its ID:
curl "http://localhost:8000/documents/python_intro"Remove a document from the index:
curl -X DELETE "http://localhost:8000/documents/python_intro"pytestpytest tests/test_index.py -vpytest tests/test_api.py -vpytest --cov=app --cov-report=html| Component | File | Purpose |
|---|---|---|
| FastAPI App | app/main.py |
REST API endpoints and request handling |
| Inverted Index | app/index.py |
Core search logic and data structures |
| Storage Layer | app/storage.py |
Persistence using pickle with backup |
| Data Models | app/models.py |
Pydantic schemas for validation |
| Tests | tests/ |
Comprehensive test suite |
- Document Upload: Text → Tokenization → Inverted Index Update
- Search Query: Query → Tokenization → Index Lookup → Ranking → Results
- Persistence: Index → Pickle Serialization → Disk Storage
{
"python": {
"doc1": 3, // Term appears 3 times in doc1
"doc3": 1 // Term appears 1 time in doc3
},
"programming": {
"doc1": 2,
"doc2": 1
}
}
- Indexing Speed: ~1000 documents/second (varies by content)
- Search Speed: <10ms for typical queries
- Memory Usage: ~2MB per 1000 documents
- Storage: ~1MB per 1000 documents (pickle format)
Create a .env file for custom configuration:
# API Configuration
HOST=0.0.0.0
PORT=8000
LOG_LEVEL=INFO
# Storage Configuration
DATA_DIR=data
BACKUP_COUNT=5
# Search Configuration
DEFAULT_SEARCH_LIMIT=10
MAX_SEARCH_LIMIT=100Modify app/main.py for custom behavior:
# Change default search limit
@app.get("/search")
async def search_documents(
query: str = Query(..., min_length=1, max_length=200),
limit: int = Query(20, ge=1, le=100), # Changed from 10 to 20
index_instance: InvertedIndex = Depends(get_index)
):uvicorn app.main:app --reload --host 0.0.0.0 --port 8000FROM python:3.10-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
EXPOSE 8000
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]Render:
- Connect your GitHub repository
- Set build command:
pip install -r requirements.txt - Set start command:
uvicorn app.main:app --host 0.0.0.0 --port $PORT
Fly.io:
fly launch
fly deploy- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
# Install development dependencies
pip install -r requirements.txt
pip install pytest-cov black flake8
# Format code
black app/ tests/
# Lint code
flake8 app/ tests/
# Run tests with coverage
pytest --cov=app --cov-report=htmlThis project is licensed under the MIT License - see the LICENSE file for details.
- FastAPI for the excellent web framework
- Pydantic for data validation
- pytest for testing framework
- The information retrieval community for foundational concepts
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Email: yuanxinghao2005@gamil.com
Happy Learning! 🎓
This project is designed to help you understand how search engines work under the hood. Start with the basic concepts and gradually explore the more advanced features.