VECS is a high-performance Semantic Cache server designed to optimize Large Language Model (LLM) applications.
It acts as a middleware between your user and your LLM (e.g., GPT-4, Claude, Llama 3). Instead of performing expensive and slow inference for every query, VECS stores responses and retrieves them using Semantic Search. If a user asks a question similar to one already answered, VECS returns the cached response instantly.
Scenario:
User A asks: "How do I reset my password?" -> MISS -> Call LLM -> Store Answer.
User B asks: "I forgot my access key, help" -> HIT -> Vecs detects similarity (0.85) -> Returns stored answer instantly.
- 🚀 Native Performance: Written in pure C with zero runtime dependencies (no Python required).
🧠 Smart Dual-Layer Caching:
-
L1 (Exact Match): O(1) Hash Map with text normalization (ignores case & punctuation) for instant retrieval.
-
L2 (Hybrid Semantic): Not just vector search!
- Mean Pooling: Uses state-of-the-art embedding aggregation (not just the [CLS] token) for higher accuracy.
- Hybrid Filtering: Performs keyword analysis to detect negations ("I want..." vs "I do NOT want...") and length mismatch, drastically reducing false positives.
-
⚡ Hardware Acceleration:
-
macOS: Native Metal (GPU) support for Apple Silicon (M1/M2/M3).
-
Linux: Optimized AVX/AVX2 CPU inference with OpenMP.
-
-
♻️ Smart Deduplication: Prevents cache pollution by detecting and rejecting semantically identical entries.
-
🐳 Docker Ready: Production-ready container with environment configuration.
-
⏱️ TTL Support (New): Automatic data expiration. Set a Time-To-Live globally via environment variables or individually per specific key.
-
🔌 VSP Protocol: Simple, text-based TCP protocol (Redis-like).
-
› vecs-cli: Command line tool for interacting with the VPS protocol.
-
🗄️ Persistence: Local persistence of cache data
🤖 Model Support: Vecs automatically detects the model architecture (Encoder-only vs Decoder) to use the correct inference method. It is optimized for modern embedding models like BGE-M3, E5, and Nomic-Embed.
-
Compiler: GCC or Clang
-
Build Tools: Make, CMake
-
Git: For cloning submodules
Clone recursively to include the inference engine (llama.cpp).
git clone --recursive https://github.com/riccardogiuriola/vecs
cd vecs
Vecs requires a GGUF embedding model. We recommend BGE-M3 (high accuracy) or MiniLM-L6-v2 (high speed).
mkdir -p models
# Download BGE-M3 (Quantized 4-bit)
curl -L -o models/default_model.gguf https://huggingface.co/gpustack/bge-m3-GGUF/resolve/main/bge-m3-Q4_K_M.gguf
The Makefile automatically detects your OS and links the correct acceleration libraries.
# 1. Build dependencies (static libs)
make libs
# 2. Build server executable
make
./vecs
Server listening on port 6380...
The easiest way to run Vecs in production. The image is approximately 100MB.
This downloads the default model automatically inside the container.
docker run -d\
--name vecs\
-p 6380:6380\
vecs:latest
You can tune thresholds and capacity using Environment Variables.
docker run -d\
--name vecs\
-p 6380:6380\
-e VECS_L2_THRESHOLD="0.75" \
-e VECS_L2_DEDUPE_THRESHOLD="0.95" \
-e VECS_L2_CAPACITY="10000" \
-e VECS_TTL_DEFAULT="86400" \
riccardogiuriola/vecs:latest
Create a docker-compose.yml:
services:
vecs:
image: riccardogiuriola/vecs:latest
build: .
ports:
- "6380:6380"
environment:
# Soglia minima di somiglianza per considerare una query come HIT (0.0 - 1.0)
- VECS_L2_THRESHOLD=0.65
# Capacità massima di vettori da mantenere in RAM
- VECS_L2_CAPACITY=5000
# Soglia per evitare di salvare risposte già esistenti (Deduplication)
- VECS_L2_DEDUPE_THRESHOLD=0.95
# TTL di default per le nuove voci in secondi (es. 2 ore)
- VECS_TTL_DEFAULT=7200
# Percorso del modello GGUF (se non si usa quello di default)
# - VECS_MODEL_PATH=/app/models/my_model.gguf
volumes:
- ./models:/app/models
- vecs_data:/app/data
volumes:
vecs_data:
driver: local
| Variable | Default | Description |
|---|---|---|
VECS_MODEL_PATH |
models/bge-m3... |
Path to the .gguf embedding model. |
VECS_L2_THRESHOLD |
0.65 |
Minimum cosine similarity (0.0 - 1.0) to consider a request a HIT. Lower = more lenient. |
VECS_L2_DEDUPE_THRESHOLD |
0.95 |
If a new entry is > 95% similar to an existing one, it is NOT saved (Deduplication). |
VECS_L2_CAPACITY |
5000 |
Maximum number of vectors to keep in RAM. |
VECS_TTL_DEFAULT |
3600 |
Default Time-To-Live in seconds (1 hour) for entries without explicit TTL. |
PORT. |
6380 |
Listening port. |
Vecs uses VSP (Vecs Simple Protocol), a text-based protocol similar to RESP. You can interact with it using nc or the provided clients.
Stores a prompt and its response in L1 (Exact) and L2 (Semantic).
SET <Prompt> Metadata_JSON> <Response> [ttl_seconds]
Searches L1 first, then calculates embedding and searches L2.
QUERY <Prompt> <Metadata_JSON>
Removes exact match from L1 and semantically similar vectors from L2.
DELETE <Prompt> <Metadata_JSON>
Clear the entire server cache.
FLUSH
Manual save of cache to disk
SAVE
A zero-dependency client is available via npm.
npm install vecs-client
import { VecsClient } from 'vecs-client';
const client = new VecsClient({ host: 'localhost', port: 6380 });
await client.connect();
// Store
await client.set("How do I reset password?", {}, "Go to settings...");
// Query (Semantic)
const answer = await client.query("I forgot my password");
if (answer) console.log("HIT:", answer);