Skip to content

FastAPI-based local inference API for Mistral-7B using llama.cpp

Notifications You must be signed in to change notification settings

dlau72/chatbot_API

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Chatbot API

The Chatbot API is a standalone service that exposes a REST endpoint for natural-language chat. It runs a quantized Mistral-7B model locally via llama.cpp in order to greatly reduce the model's inference cost and allow you to run this large model efficiently on your CPU. It is served through FastAPI.

This API is stateless by default, meaning replies are generated based on the chat history you send in the request. This means you can use it as a standalone service by always including the full message history or integrate it with a custom backend to provide multi-turn memory and persistence. It's currently powers the chatbot feature at danlau.live but can also be deployed and used in other projects.

Features

  • FastAPI REST API
  • Mistral-7B (quantized .gguf) running locally with llama.cpp
  • Includes dockerfile, which aids with containerization with Docker + docker-compose
  • Works standalone or behind another backend service
  • Deployable behind NGINX/HTTPS (reverse proxy)

🔧 Setup

1. Clone the repo

git clone https://github.com/your-username/chatbot_API.git
cd chatbot_API

2. Install dependencies(local dev)

python -m venv venv
venv/bin/activate   # On Windows: venv\Scripts\activate
pip install -r requirements.txt

3. Download and set up the model

This service requires a quantized Mistral-7B .gguf model, which is not included in this repo.

1. Create a models/ folder:

mkdir -p chatbot_API/models

2. Download a quantized model at this link

This API was designed and tested using mistral-7b-instruct.Q4_K_M.gguf but the others should work as well. You would change the name of the model being used if you choose to use a different quantized Mistral model for each instance it pops up in these setup instructions.

3. Place the model file in chatbot_API/models/:

chatbot_API/models/mistral-7b-instruct.Q4_K_M.gguf

4. Configure MODEL_PATH:

Running locally → use a host filesystem path (absolute path recommended) in the .env file

MODEL_PATH = /absolute/host/path/to/chatbot_API/models/mistral-7b-instruct.Q4_K_M.gguf

Running

Option A - Local development

Use the run.py helper script (auto-reload enabled):

python run.py

Option B — Production Use

In production, you'll can run the API as a containerized service using Docker or integrate it into a larger deployment stack (docker-compose, Kubernetes etc.)

Example with docker-compose:

  # This is just an example of the setup in the docker-compose.yml file. Your setup may differ.
  services:
    chatbot-api:
      build: ./chatbot_API
      volumes:
        - ./chatbot_API/models:/models
      environment:
        - MODEL_PATH=/models/mistral-7b-instruct-v0.1.Q4_K_M.gguf
      restart: always

Note: Since production environments vary widely, this project does not include a full deployment configuration. You are encouraged to adapt the container build and runtime environment to your own needs (volume mounts, environment variables, reverse proxy settings, etc.).


API Endpoints

POST /chat

Sends a conversation history and returns the assistant's next response.

Request Body

{
  "messages": [
    { "role": "system", "content": "You are a helpful assistant." },
    { "role": "user", "content": "Hello!" }
  ]
}

Response

{
  "response": "Hi there! How can I help you today?"
}

Tweaking the Project for Your Own Use

This project is designed to be modular and easy to adapt. You're encouraged to:

  • Modify the system prompt or response formatting logic in `chat_service.py to better fit your use case
  • Integrate the API into a broader stack (e.g., add a database for chat history, connect to a frontend, or containerize it within your own ecosystem)

Found a Bug or Issue?

If you encounter a bug, unexpected behavior, or have a suggestion:

  • Please open an issue describing the problem
  • Include any relevant error messages, sample inputs, or details about your setup

I would appreciate any feedback you have to give!

About

FastAPI-based local inference API for Mistral-7B using llama.cpp

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published