The Chatbot API is a standalone service that exposes a REST endpoint for natural-language chat. It runs a quantized Mistral-7B model locally via llama.cpp in order to greatly reduce the model's inference cost and allow you to run this large model efficiently on your CPU. It is served through FastAPI.
This API is stateless by default, meaning replies are generated based on the chat history you send in the request. This means you can use it as a standalone service by always including the full message history or integrate it with a custom backend to provide multi-turn memory and persistence. It's currently powers the chatbot feature at danlau.live but can also be deployed and used in other projects.
- FastAPI REST API
- Mistral-7B (quantized
.gguf) running locally withllama.cpp - Includes dockerfile, which aids with containerization with Docker +
docker-compose - Works standalone or behind another backend service
- Deployable behind NGINX/HTTPS (reverse proxy)
git clone https://github.com/your-username/chatbot_API.git
cd chatbot_API
python -m venv venv
venv/bin/activate # On Windows: venv\Scripts\activate
pip install -r requirements.txt
This service requires a quantized Mistral-7B .gguf model, which is not included in this repo.
mkdir -p chatbot_API/models
2. Download a quantized model at this link
This API was designed and tested using mistral-7b-instruct.Q4_K_M.gguf but the others should work as well. You would change the name of the model being used if you choose to use a different quantized Mistral model for each instance it pops up in these setup instructions.
chatbot_API/models/mistral-7b-instruct.Q4_K_M.gguf
Running locally → use a host filesystem path (absolute path recommended) in the .env file
MODEL_PATH = /absolute/host/path/to/chatbot_API/models/mistral-7b-instruct.Q4_K_M.gguf
Use the run.py helper script (auto-reload enabled):
python run.py
- Server: http://localhost:8001
In production, you'll can run the API as a containerized service using Docker or integrate it into a larger deployment stack (docker-compose, Kubernetes etc.)
Example with docker-compose:
# This is just an example of the setup in the docker-compose.yml file. Your setup may differ.
services:
chatbot-api:
build: ./chatbot_API
volumes:
- ./chatbot_API/models:/models
environment:
- MODEL_PATH=/models/mistral-7b-instruct-v0.1.Q4_K_M.gguf
restart: always
Note: Since production environments vary widely, this project does not include a full deployment configuration. You are encouraged to adapt the container build and runtime environment to your own needs (volume mounts, environment variables, reverse proxy settings, etc.).
Sends a conversation history and returns the assistant's next response.
Request Body
{
"messages": [
{ "role": "system", "content": "You are a helpful assistant." },
{ "role": "user", "content": "Hello!" }
]
}
Response
{
"response": "Hi there! How can I help you today?"
}
This project is designed to be modular and easy to adapt. You're encouraged to:
- Modify the system prompt or response formatting logic in `chat_service.py to better fit your use case
- Integrate the API into a broader stack (e.g., add a database for chat history, connect to a frontend, or containerize it within your own ecosystem)
If you encounter a bug, unexpected behavior, or have a suggestion:
- Please open an issue describing the problem
- Include any relevant error messages, sample inputs, or details about your setup
I would appreciate any feedback you have to give!