Skip to content

riccardogiuriola/vecs

Repository files navigation

⚡ VECS (Vector Embedding Caching System)

VECS is a high-performance Semantic Cache server designed to optimize Large Language Model (LLM) applications.

It acts as a middleware between your user and your LLM (e.g., GPT-4, Claude, Llama 3). Instead of performing expensive and slow inference for every query, VECS stores responses and retrieves them using Semantic Search. If a user asks a question similar to one already answered, VECS returns the cached response instantly.

Scenario:

  1. User A asks: "How do I reset my password?" -> MISS -> Call LLM -> Store Answer.

  2. User B asks: "I forgot my access key, help" -> HIT -> Vecs detects similarity (0.85) -> Returns stored answer instantly.

✨ Key Features

  • 🚀 Native Performance: Written in pure C with zero runtime dependencies (no Python required).

🧠 Smart Dual-Layer Caching:

  • L1 (Exact Match): O(1) Hash Map with text normalization (ignores case & punctuation) for instant retrieval.

  • L2 (Hybrid Semantic): Not just vector search!

    • Mean Pooling: Uses state-of-the-art embedding aggregation (not just the [CLS] token) for higher accuracy.
    • Hybrid Filtering: Performs keyword analysis to detect negations ("I want..." vs "I do NOT want...") and length mismatch, drastically reducing false positives.
  • ⚡ Hardware Acceleration:

    • macOS: Native Metal (GPU) support for Apple Silicon (M1/M2/M3).

    • Linux: Optimized AVX/AVX2 CPU inference with OpenMP.

  • ♻️ Smart Deduplication: Prevents cache pollution by detecting and rejecting semantically identical entries.

  • 🐳 Docker Ready: Production-ready container with environment configuration.

  • ⏱️ TTL Support (New): Automatic data expiration. Set a Time-To-Live globally via environment variables or individually per specific key.

  • 🔌 VSP Protocol: Simple, text-based TCP protocol (Redis-like).

  • › vecs-cli: Command line tool for interacting with the VPS protocol.

  • 🗄️ Persistence: Local persistence of cache data

🛠️ Architecture

🤖 Model Support: Vecs automatically detects the model architecture (Encoder-only vs Decoder) to use the correct inference method. It is optimized for modern embedding models like BGE-M3, E5, and Nomic-Embed.

VECS Architecture Diagram

📦 Installation (Native)

Prerequisites

  • Compiler: GCC or Clang

  • Build Tools: Make, CMake

  • Git: For cloning submodules

1. Clone Repository

Clone recursively to include the inference engine (llama.cpp).

git clone --recursive https://github.com/riccardogiuriola/vecs
cd vecs

2. Download AI Model

Vecs requires a GGUF embedding model. We recommend BGE-M3 (high accuracy) or MiniLM-L6-v2 (high speed).

mkdir -p models

# Download BGE-M3 (Quantized 4-bit)
curl -L -o models/default_model.gguf https://huggingface.co/gpustack/bge-m3-GGUF/resolve/main/bge-m3-Q4_K_M.gguf

3. Build

The Makefile automatically detects your OS and links the correct acceleration libraries.

# 1. Build dependencies (static libs)
make libs

# 2. Build server executable
make

4. Run

./vecs

Server listening on port 6380...

🐳 Installation (Docker)

The easiest way to run Vecs in production. The image is approximately 100MB.

1. Run with Default Settings

This downloads the default model automatically inside the container.

docker run -d\
  --name vecs\
  -p 6380:6380\
  vecs:latest

2. Run with Custom Configuration

You can tune thresholds and capacity using Environment Variables.

docker run -d\
  --name vecs\
  -p 6380:6380\
  -e VECS_L2_THRESHOLD="0.75" \
  -e VECS_L2_DEDUPE_THRESHOLD="0.95" \
  -e VECS_L2_CAPACITY="10000" \
  -e VECS_TTL_DEFAULT="86400" \
  riccardogiuriola/vecs:latest

3. Docker Compose

Create a docker-compose.yml:

services:
  vecs:
    image: riccardogiuriola/vecs:latest
    build: .
    ports:
      - "6380:6380"
    environment:
      # Soglia minima di somiglianza per considerare una query come HIT (0.0 - 1.0)
      - VECS_L2_THRESHOLD=0.65
      # Capacità massima di vettori da mantenere in RAM
      - VECS_L2_CAPACITY=5000
      # Soglia per evitare di salvare risposte già esistenti (Deduplication)
      - VECS_L2_DEDUPE_THRESHOLD=0.95
      # TTL di default per le nuove voci in secondi (es. 2 ore)
      - VECS_TTL_DEFAULT=7200
      # Percorso del modello GGUF (se non si usa quello di default)
      # - VECS_MODEL_PATH=/app/models/my_model.gguf
    volumes:
      - ./models:/app/models
      - vecs_data:/app/data
volumes:
  vecs_data:
    driver: local

⚙️ Configuration (Environment Variables)

Variable Default Description
VECS_MODEL_PATH models/bge-m3... Path to the .gguf embedding model.
VECS_L2_THRESHOLD 0.65 Minimum cosine similarity (0.0 - 1.0) to consider a request a HIT. Lower = more lenient.
VECS_L2_DEDUPE_THRESHOLD 0.95 If a new entry is > 95% similar to an existing one, it is NOT saved (Deduplication).
VECS_L2_CAPACITY 5000 Maximum number of vectors to keep in RAM.
VECS_TTL_DEFAULT 3600 Default Time-To-Live in seconds (1 hour) for entries without explicit TTL.
PORT. 6380 Listening port.

📡 API Protocol (VSP)

Vecs uses VSP (Vecs Simple Protocol), a text-based protocol similar to RESP. You can interact with it using nc or the provided clients.

SET (Store Data)

Stores a prompt and its response in L1 (Exact) and L2 (Semantic).

SET <Prompt> Metadata_JSON> <Response> [ttl_seconds]

QUERY (Retrieve Data)

Searches L1 first, then calculates embedding and searches L2.

QUERY <Prompt> <Metadata_JSON>

DELETE (Remove Data)

Removes exact match from L1 and semantically similar vectors from L2.

DELETE <Prompt> <Metadata_JSON>

FLUSH (Clear Cache)

Clear the entire server cache.

FLUSH

SAVE (Save to disk)

Manual save of cache to disk

SAVE

💻 Client Libraries

Node.js / TypeScript

A zero-dependency client is available via npm.

npm install vecs-client

import { VecsClient } from 'vecs-client';

const client = new VecsClient({ host: 'localhost', port: 6380 });
await client.connect();

// Store
await client.set("How do I reset password?", {}, "Go to settings...");

// Query (Semantic)
const answer = await client.query("I forgot my password");
if (answer) console.log("HIT:", answer);

About

Ultra-fast Semantic Cache Proxy written in pure C

Topics

Resources

License

Stars

Watchers

Forks

Packages