Skip to content

smartfrog/smart-llama

Repository files navigation

smart-llama

A lightweight Go server that wraps llama-server (from llama.cpp) with dynamic model loading and an OpenAI-compatible API.

Why smart-llama?

The problem: Running multiple LLM models locally requires either loading them all into memory (expensive) or manually restarting servers when switching models.

The solution: smart-llama keeps one model loaded at a time and automatically swaps models when you request a different one. Think of it as a simpler Ollama that preserves full llama-server configuration flexibility.

Client -> smart-llama (:8080) -> llama-server subprocess (:8081)

Key features:

  • Single model in memory at a time (automatic swap on request)
  • Full control over llama-server arguments per model
  • OpenAI-compatible API for easy integration
  • YAML configuration (no CLI flags to remember)

Quick Start

1. Install llama.cpp

Build or install llama-server from llama.cpp.

2. Configure

Server configuration (config.yaml):

server:
  listen_addr: ":8080"
  llama_server_port: 8081

paths:
  llama_server: "/usr/local/bin/llama-server"
  models_dir: "./models"

Model configuration (models/llama3-8b.yaml):

name: llama3-8b
model_path: /path/to/llama-3-8b.gguf

args:
  - --ctx-size
  - "8192"
  - --n-gpu-layers
  - "99"
  - --flash-attn
  - "on"

The args list accepts any llama-server argument directly.

3. Run

./smart-llama

You can specify a custom configuration file:

# Using CLI flag
./smart-llama --config /path/to/config.yaml

# Using environment variable
SL_CONFIG=/path/to/config.yaml ./smart-llama

Priority: --config flag > SL_CONFIG env var > config.yaml (default)

4. Use

# List available models
curl http://localhost:8080/v1/models

# Chat completion
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3-8b",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

# Health check
curl http://localhost:8080/health

Configuration Reference

Server (config.yaml)

Field Description Default
server.listen_addr Address to listen on :8080
server.llama_server_port Port for llama-server subprocess 8081
paths.llama_server Path to llama-server binary llama-server
paths.models_dir Directory containing model configs ./models
timeouts.startup Model startup timeout (seconds) 120
timeouts.shutdown Graceful shutdown timeout (seconds) 10
timeouts.inactivity Auto-stop after inactivity (minutes, 0 to disable) 5

Model (models/*.yaml)

Field Description
name Model identifier (used in API requests)
model_path Absolute path to the .gguf file
args List of llama-server arguments

Build

go build ./cmd/smart-llama/

Test

go test ./...

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages