smart-llama

A lightweight Go server that wraps llama-server (from llama.cpp) with dynamic model loading and an OpenAI-compatible API.

Why smart-llama?

The problem: Running multiple LLM models locally requires either loading them all into memory (expensive) or manually restarting servers when switching models.

The solution: smart-llama keeps one model loaded at a time and automatically swaps models when you request a different one. Think of it as a simpler Ollama that preserves full llama-server configuration flexibility.

Client -> smart-llama (:8080) -> llama-server subprocess (:8081)

Key features:

Single model in memory at a time (automatic swap on request)
Full control over llama-server arguments per model
OpenAI-compatible API for easy integration
YAML configuration (no CLI flags to remember)

Quick Start

1. Install llama.cpp

Build or install llama-server from llama.cpp.

2. Configure

Server configuration (config.yaml):

server:
  listen_addr: ":8080"
  llama_server_port: 8081

paths:
  llama_server: "/usr/local/bin/llama-server"
  models_dir: "./models"

Model configuration (models/llama3-8b.yaml):

name: llama3-8b
model_path: /path/to/llama-3-8b.gguf

args:
  - --ctx-size
  - "8192"
  - --n-gpu-layers
  - "99"
  - --flash-attn
  - "on"

The args list accepts any llama-server argument directly.

3. Run

./smart-llama

You can specify a custom configuration file:

# Using CLI flag
./smart-llama --config /path/to/config.yaml

# Using environment variable
SL_CONFIG=/path/to/config.yaml ./smart-llama

Priority: --config flag > SL_CONFIG env var > config.yaml (default)

4. Use

# List available models
curl http://localhost:8080/v1/models

# Chat completion
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3-8b",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

# Health check
curl http://localhost:8080/health

Configuration Reference

Server (`config.yaml`)

Field	Description	Default
`server.listen_addr`	Address to listen on	`:8080`
`server.llama_server_port`	Port for llama-server subprocess	`8081`
`paths.llama_server`	Path to llama-server binary	`llama-server`
`paths.models_dir`	Directory containing model configs	`./models`
`timeouts.startup`	Model startup timeout (seconds)	`120`
`timeouts.shutdown`	Graceful shutdown timeout (seconds)	`10`
`timeouts.inactivity`	Auto-stop after inactivity (minutes, 0 to disable)	`5`

Model (`models/*.yaml`)

Field	Description
`name`	Model identifier (used in API requests)
`model_path`	Absolute path to the `.gguf` file
`args`	List of `llama-server` arguments

Build

go build ./cmd/smart-llama/

Test

go test ./...

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.github/workflows		.github/workflows
cmd/smart-llama		cmd/smart-llama
internal		internal
models		models
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
README.md		README.md
config.yaml		config.yaml
go.mod		go.mod
go.sum		go.sum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

smart-llama

Why smart-llama?

Quick Start

1. Install llama.cpp

2. Configure

3. Run

4. Use

Configuration Reference

Server (`config.yaml`)

Model (`models/*.yaml`)

Build

Test

About

Uh oh!

Releases 6

Packages

Languages

smartfrog/smart-llama

Folders and files

Latest commit

History

Repository files navigation

smart-llama

Why smart-llama?

Quick Start

1. Install llama.cpp

2. Configure

3. Run

4. Use

Configuration Reference

Server (config.yaml)

Model (models/*.yaml)

Build

Test

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 6

Packages 0

Languages

Server (`config.yaml`)

Model (`models/*.yaml`)

Packages