A lightweight Go server that wraps llama-server (from llama.cpp) with dynamic model loading and an OpenAI-compatible API.
The problem: Running multiple LLM models locally requires either loading them all into memory (expensive) or manually restarting servers when switching models.
The solution: smart-llama keeps one model loaded at a time and automatically swaps models when you request a different one. Think of it as a simpler Ollama that preserves full llama-server configuration flexibility.
Client -> smart-llama (:8080) -> llama-server subprocess (:8081)
Key features:
- Single model in memory at a time (automatic swap on request)
- Full control over
llama-serverarguments per model - OpenAI-compatible API for easy integration
- YAML configuration (no CLI flags to remember)
Build or install llama-server from llama.cpp.
Server configuration (config.yaml):
server:
listen_addr: ":8080"
llama_server_port: 8081
paths:
llama_server: "/usr/local/bin/llama-server"
models_dir: "./models"Model configuration (models/llama3-8b.yaml):
name: llama3-8b
model_path: /path/to/llama-3-8b.gguf
args:
- --ctx-size
- "8192"
- --n-gpu-layers
- "99"
- --flash-attn
- "on"The args list accepts any llama-server argument directly.
./smart-llamaYou can specify a custom configuration file:
# Using CLI flag
./smart-llama --config /path/to/config.yaml
# Using environment variable
SL_CONFIG=/path/to/config.yaml ./smart-llamaPriority: --config flag > SL_CONFIG env var > config.yaml (default)
# List available models
curl http://localhost:8080/v1/models
# Chat completion
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3-8b",
"messages": [{"role": "user", "content": "Hello!"}]
}'
# Health check
curl http://localhost:8080/health| Field | Description | Default |
|---|---|---|
server.listen_addr |
Address to listen on | :8080 |
server.llama_server_port |
Port for llama-server subprocess | 8081 |
paths.llama_server |
Path to llama-server binary | llama-server |
paths.models_dir |
Directory containing model configs | ./models |
timeouts.startup |
Model startup timeout (seconds) | 120 |
timeouts.shutdown |
Graceful shutdown timeout (seconds) | 10 |
timeouts.inactivity |
Auto-stop after inactivity (minutes, 0 to disable) | 5 |
| Field | Description |
|---|---|
name |
Model identifier (used in API requests) |
model_path |
Absolute path to the .gguf file |
args |
List of llama-server arguments |
go build ./cmd/smart-llama/go test ./...