🔍 Inverted Index Search API

A professional, production-ready document search service that demonstrates core information retrieval concepts using inverted indexes. Built with FastAPI and designed for learning modern search engine principles.

🎯 What You'll Learn

This project demonstrates the core concepts behind modern search engines:

Inverted Index Data Structures - The foundation of search engines like Google
Term Frequency Ranking - How documents are scored and ranked
FastAPI REST API Design - Professional API development practices
Data Persistence - Saving and loading search indexes
Test-Driven Development - Comprehensive testing strategies

🚀 Quick Start

Prerequisites

Python 3.10 or higher
pip (Python package installer)

Installation

Clone the repository

git clone https://github.com/mojitote/python-inverted-search-api.git
cd python-inverted-search-api

Create a virtual environment

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies
```
pip install -r requirements.txt
```
Run the API
```
python -m app.main
```
Access the API
- API Documentation: http://localhost:8000/docs
- Health Check: http://localhost:8000/health
- Root Endpoint: http://localhost:8000/

📚 API Usage

1. Upload Documents

Add documents to the search index:

curl -X POST "http://localhost:8000/upload" \
  -H "Content-Type: application/json" \
  -d '{
    "doc_id": "python_intro",
    "content": "Python is a high-level programming language known for its simplicity and readability.",
    "title": "Python Introduction",
    "author": "John Doe"
  }'

2. Search Documents

Search for documents by keywords:

curl "http://localhost:8000/search?query=python&limit=5"

Response:

{
  "query": "python",
  "results": [
    {
      "doc_id": "python_intro",
      "score": 0.125,
      "title": "Python Introduction",
      "author": "John Doe",
      "snippet": "Python is a high-level programming language..."
    }
  ],
  "total_results": 1,
  "search_time_ms": 2.45
}

3. View Index Statistics

Check the current state of the inverted index:

curl "http://localhost:8000/index"

4. Get Specific Document

Retrieve a document by its ID:

curl "http://localhost:8000/documents/python_intro"

5. Delete Document

Remove a document from the index:

curl -X DELETE "http://localhost:8000/documents/python_intro"

🧪 Running Tests

Run All Tests

pytest

Run Unit Tests Only

pytest tests/test_index.py -v

Run Integration Tests Only

pytest tests/test_api.py -v

Run with Coverage

pytest --cov=app --cov-report=html

🏗️ Architecture

Core Components

Component	File	Purpose
FastAPI App	`app/main.py`	REST API endpoints and request handling
Inverted Index	`app/index.py`	Core search logic and data structures
Storage Layer	`app/storage.py`	Persistence using pickle with backup
Data Models	`app/models.py`	Pydantic schemas for validation
Tests	`tests/`	Comprehensive test suite

Data Flow

Document Upload: Text → Tokenization → Inverted Index Update
Search Query: Query → Tokenization → Index Lookup → Ranking → Results
Persistence: Index → Pickle Serialization → Disk Storage

Inverted Index Structure

{
  "python": {
    "doc1": 3,    // Term appears 3 times in doc1
    "doc3": 1     // Term appears 1 time in doc3
  },
  "programming": {
    "doc1": 2,
    "doc2": 1
  }
}

📊 Performance

Indexing Speed: ~1000 documents/second (varies by content)
Search Speed: <10ms for typical queries
Memory Usage: ~2MB per 1000 documents
Storage: ~1MB per 1000 documents (pickle format)

🔧 Configuration

Environment Variables

Create a .env file for custom configuration:

# API Configuration
HOST=0.0.0.0
PORT=8000
LOG_LEVEL=INFO

# Storage Configuration
DATA_DIR=data
BACKUP_COUNT=5

# Search Configuration
DEFAULT_SEARCH_LIMIT=10
MAX_SEARCH_LIMIT=100

Custom Settings

Modify app/main.py for custom behavior:

# Change default search limit
@app.get("/search")
async def search_documents(
    query: str = Query(..., min_length=1, max_length=200),
    limit: int = Query(20, ge=1, le=100),  # Changed from 10 to 20
    index_instance: InvertedIndex = Depends(get_index)
):

🚀 Deployment

Local Development

uvicorn app.main:app --reload --host 0.0.0.0 --port 8000

Production (Docker)

FROM python:3.10-slim

WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt

COPY . .
EXPOSE 8000

CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]

Cloud Deployment

Render:

Connect your GitHub repository
Set build command: pip install -r requirements.txt
Set start command: uvicorn app.main:app --host 0.0.0.0 --port $PORT

Fly.io:

fly launch
fly deploy

📖 Learning Resources

Core Concepts

FastAPI Resources

Search Engine Concepts

🤝 Contributing

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

Development Setup

# Install development dependencies
pip install -r requirements.txt
pip install pytest-cov black flake8

# Format code
black app/ tests/

# Lint code
flake8 app/ tests/

# Run tests with coverage
pytest --cov=app --cov-report=html

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

FastAPI for the excellent web framework
Pydantic for data validation
pytest for testing framework
The information retrieval community for foundational concepts

📞 Support

Issues: GitHub Issues
Discussions: GitHub Discussions
Email: yuanxinghao2005@gamil.com

Happy Learning! 🎓

This project is designed to help you understand how search engines work under the hood. Start with the basic concepts and gradually explore the more advanced features.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
app		app
docs		docs
scripts		scripts
tests		tests
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
PROJECT_SUMMARY.md		PROJECT_SUMMARY.md
README.md		README.md
docker-compose.yml		docker-compose.yml
pytest.ini		pytest.ini
requirements.txt		requirements.txt

License

mojitote/python-inverted-search-api

Folders and files

Latest commit

History

Repository files navigation