g A production-ready Python FastAPI server for LLM chat interactions using LangChain. Supports multiple providers (OpenAI, Anthropic, Google), streaming and batched responses, with intelligent in-memory caching.
- Multi-Provider Support: OpenAI, Anthropic (Claude), Google (Gemini)
- OpenAI-Compatible API: Drop-in replacement for OpenAI API endpoints
- Streaming & Batched Responses: Real-time streaming or complete responses
- Smart Caching: In-memory cache with TTL for identical prompts/history
- LangChain Integration: Leverages LangChain for robust LLM interactions
- Type-Safe: Full Pydantic validation for requests and responses
- Docker Ready: Containerized for easy cloud deployment
- Production-Ready: Health checks, CORS, proper error handling
py-demo-api/
βββ app/
β βββ __init__.py # Package initialization
β βββ main.py # FastAPI application entry point
β βββ config.py # Configuration settings
β βββ api/
β β βββ __init__.py
β β βββ routes.py # API endpoints (chat, health, cache)
β βββ models/
β β βββ __init__.py
β β βββ chat.py # Request/response models
β βββ services/
β β βββ __init__.py
β β βββ llm_service.py # LLM provider integration
β βββ utils/
β βββ __init__.py
β βββ cache.py # In-memory caching implementation
βββ requirements.txt # Python dependencies
βββ Dockerfile # Docker image definition
βββ docker-compose.yml # Docker Compose configuration
βββ .gitignore # Git ignore rules
βββ README.md # This file
- Python 3.11+
- pip or poetry
- API keys for desired providers (OpenAI, Anthropic, or Google)
-
Clone and navigate to the directory:
cd py-demo-api -
Create a virtual environment:
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install dependencies:
pip install -r requirements.txt
-
Configure environment variables: Create a
.envfile in the root directory:# Required: At least one provider API key OPENAI_API_KEY=sk-your-openai-key-here ANTHROPIC_API_KEY=sk-ant-your-anthropic-key-here GOOGLE_API_KEY=your-google-key-here # Optional: Configuration DEFAULT_PROVIDER=openai DEFAULT_MODEL=gpt-4o-mini CACHE_MAX_SIZE=1000 CACHE_TTL_SECONDS=3600
-
Run the server:
uvicorn app.main:app --reload
Or using the Python script:
python -m app.main
-
Access the API:
- API Documentation: http://localhost:8000/docs
- ReDoc: http://localhost:8000/redoc
- Health Check: http://localhost:8000/api/health
-
Create
.envfile with your API keys (see above) -
Start the service:
docker-compose up -d
-
View logs:
docker-compose logs -f
-
Stop the service:
docker-compose down
-
Build the image:
docker build -t py-demo-api . -
Run the container:
docker run -d \ -p 8000:8000 \ -e OPENAI_API_KEY=your-key \ -e ANTHROPIC_API_KEY=your-key \ --name py-demo-api \ py-demo-api
Batched Response:
curl -X POST "http://localhost:8000/api/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "user", "content": "What is the capital of France?"}
],
"provider": "openai",
"model": "gpt-3.5-turbo",
"temperature": 0.7
}'Streaming Response:
curl -X POST "http://localhost:8000/api/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "user", "content": "Tell me a story"}
],
"stream": true,
"provider": "openai"
}'import requests
# Batched response
response = requests.post(
"http://localhost:8000/api/v1/chat/completions",
json={
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is machine learning?"}
],
"provider": "openai",
"temperature": 0.7
}
)
data = response.json()
print(data["message"]["content"])
print(f"Cached: {data['cached']}")
# Streaming response
response = requests.post(
"http://localhost:8000/api/v1/chat/completions",
json={
"messages": [{"role": "user", "content": "Count to 10"}],
"stream": True
},
stream=True
)
for line in response.iter_lines():
if line:
print(line.decode())gpt-4o- Latest GPT-4 Optimized model βgpt-4o-mini- Faster, cheaper GPT-4 (recommended for most use cases) βgpt-4-turbo- Previous generation GPT-4 Turbogpt-4- Standard GPT-4gpt-3.5-turbo- Legacy, cheaper option
claude-3-5-sonnet-20241022- Latest Claude 3.5 Sonnet βclaude-3-5-sonnet- Alias for latest 3.5 Sonnetclaude-3-opus-20240229- Most capable Claude 3 modelclaude-3-opus- Alias for Claude 3 Opusclaude-3-sonnet-20240229- Balanced Claude 3 modelclaude-3-sonnet- Alias for Claude 3 Sonnetclaude-3-haiku-20240307- Fastest, most affordable Claude 3claude-3-haiku- Alias for Claude 3 Haiku
gemini-1.5-pro- Most capable Gemini model βgemini-1.5-flash- Faster, more affordable Gemini βgemini-pro- Previous generation Geminigemini-pro-vision- Previous generation with vision support
β = Recommended models
The API implements intelligent in-memory caching:
- Cache Key: Hash of messages + model + provider + temperature + max_tokens
- TTL: Configurable (default: 1 hour)
- Max Size: Configurable (default: 1000 entries)
- Behavior: Identical requests return cached responses instantly
Clear cache manually:
curl -X POST "http://localhost:8000/api/cache/clear"- Client sends chat request
- Cache lookup by request hash
- If cached β return immediately
- If not β LangChain β LLM Provider
- Cache response for future use
- Return to client
- FastAPI: Modern async web framework
- LangChain: LLM orchestration and provider abstraction
- Pydantic: Data validation and settings management
- cachetools: TTL-based in-memory caching
All configuration via environment variables (see .env):
| Variable | Description | Default |
|---|---|---|
OPENAI_API_KEY |
OpenAI API key | None |
ANTHROPIC_API_KEY |
Anthropic API key | None |
GOOGLE_API_KEY |
Google API key | None |
DEFAULT_PROVIDER |
Default LLM provider | openai |
DEFAULT_MODEL |
Default model name | gpt-4o-mini |
CACHE_MAX_SIZE |
Max cached responses | 1000 |
CACHE_TTL_SECONDS |
Cache entry lifetime | 3600 |
HOST |
Server host | 0.0.0.0 |
PORT |
Server port | 8000 |
ENVIRONMENT |
Environment mode | development |
Check health endpoint:
curl http://localhost:8000/api/healthExpected response:
{
"status": "healthy",
"cache_stats": {
"size": 0,
"max_size": 1000,
"ttl_seconds": 3600
}
}This API is ready for cloud deployment on:
- AWS: ECS, EKS, or EC2 with Docker
- Google Cloud: Cloud Run, GKE, or Compute Engine
- Azure: Container Instances, AKS, or App Service
- DigitalOcean: App Platform or Droplets
- Railway, Render, Fly.io: Direct Docker deployment
- Set production environment variables
- Configure CORS for your domain
- Set up monitoring and logging
- Enable HTTPS/TLS
- Implement rate limiting (if needed)
- Set appropriate cache sizes
MIT License - Feel free to use in your projects!
Contributions welcome! Please feel free to submit a Pull Request.
Need help? Check the API docs at /docs or open an issue on GitHub.