Skip to content

Make every token count — an experimental LLM inference layer that optimizes cost through caching, adaptive routing, and ML-assisted decision-making.

License

Notifications You must be signed in to change notification settings

Tokenomics-AI/Tokenomics

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Tokenomics

Intelligent LLM Token Optimization Platform
Reduce LLM costs by 90.7% while maintaining response quality through smart caching, adaptive routing, and compression.

Python 3.8+ License Tested


Proven Results

We ran a comprehensive test with 50 diverse queries across 7 categories. Here are the results:

Metric Value
Cost Savings 90.7%
Baseline Cost (50 queries) $0.150788
Optimized Cost (50 queries) $0.013999
Total Saved $0.136788
Cache Hit Rate 20%
Complexity Accuracy 90% (2-class ML model; complex uses heuristics)

Savings Breakdown by Component

Memory Layer (Cache):     $0.033  (22%)
Bandit Optimizer:         $0.103  (75%)
Compression:              $0.004  (3%)
─────────────────────────────────────────
TOTAL SAVINGS:            $0.137  (90.7%)

📄 Full Test Results: COMPREHENSIVE_E2E_TEST_RESULTS.md


How It Works

┌─────────────────────────────────────────────────────────────────┐
│                        USER QUERY                                │
└───────────────────────────┬─────────────────────────────────────┘
                            │
                            ▼
┌───────────────────────────────────────────────────────────────────┐
│  1. MEMORY LAYER                                                  │
│     ├── Exact Cache: Hash-based O(1) lookup                       │
│     ├── Semantic Cache: Vector similarity search                  │
│     └── Context Injection: Enrich prompts with similar responses  │
│                                                                   │
│     Result: 20% cache hit rate → 0 tokens for cached queries      │
└───────────────────────────┬───────────────────────────────────────┘
                            │
                            ▼
┌───────────────────────────────────────────────────────────────────┐
│  2. TOKEN ORCHESTRATOR                                            │
│     ├── Complexity Analysis: Simple/Medium/Complex                │
│     │   (ML: 2-class simple/medium, heuristic fallback for complex)│
│     ├── Knapsack Optimization: Maximize utility per token         │
│     └── Token Budget Allocation: Smart distribution               │
└───────────────────────────┬───────────────────────────────────────┘
                            │
                            ▼
┌───────────────────────────────────────────────────────────────────┐
│  3. BANDIT OPTIMIZER                                              │
│     ├── Strategy Selection: cheap/balanced/premium                │
│     ├── UCB Algorithm: Exploration vs exploitation                │
│     └── Cost-Aware Routing: Balance quality and cost              │
│                                                                   │
│     Result: Learns to use cheaper models → 75% of savings         │
└───────────────────────────┬───────────────────────────────────────┘
                            │
                            ▼
┌───────────────────────────────────────────────────────────────────┐
│  4. COMPRESSION (LLMLingua)                                       │
│     ├── Long Query Detection: >500 chars or >150 tokens           │
│     └── Intelligent Compression: Preserve meaning, reduce tokens  │
│                                                                   │
│     Result: 376 tokens saved on 5 long queries                    │
└───────────────────────────┬───────────────────────────────────────┘
                            │
                            ▼
┌───────────────────────────────────────────────────────────────────┐
│  5. CASCADING INFERENCE                                           │
│     ├── Start with cheap model (gpt-4o-mini)                      │
│     ├── Quality check (threshold: 0.85)                           │
│     └── Escalate to premium (gpt-4o) if needed                    │
│                                                                   │
│     Result: Quality maintained while minimizing cost              │
└───────────────────────────┬───────────────────────────────────────┘
                            │
                            ▼
┌───────────────────────────────────────────────────────────────────┐
│                      OPTIMIZED RESPONSE                           │
│                  90.7% cheaper, same quality                      │
└───────────────────────────────────────────────────────────────────┘

Learning & Adaptation

Tokenomics gets smarter over time through multiple ML components:

Pre-trained ML Models

Model Purpose Accuracy
Complexity Classifier Categorize queries as simple/medium/complex 90% (2-class ML + heuristic for complex)
Token Predictor Estimate response length before calling LLM 72%
Escalation Predictor Predict when quality escalation is needed 85%

Note on Complexity Classifier: The current pre-trained model is a 2-class classifier (simple/medium). Queries classified as "complex" use a heuristic-based fallback that analyzes keyword indicators, query length, and comparison patterns. This hybrid approach ensures all three complexity levels are supported while the model can be retrained with 3-class data in the future.

Continuous Learning

┌─────────────────────────────────────────────────────────────────┐
│  BANDIT OPTIMIZER (UCB Algorithm)                               │
│                                                                 │
│  Learns from every query:                                       │
│  ├── Tracks cost vs quality for each strategy                   │
│  ├── Balances exploration (try new) vs exploitation (use best)  │
│  └── Adapts routing based on actual outcomes                    │
│                                                                 │
│  Result: Routing improves automatically with usage              │
└─────────────────────────────────────────────────────────────────┘

Data Collection Pipeline

The platform collects training data from real queries to retrain models:

# Collect training data
python scripts/collect_training_data.py

# Retrain models with new data
python scripts/train_ml_models.py

# Evaluate model performance
python scripts/evaluate_ml_models.py

Pre-trained models: models/ directory contains ready-to-use .pkl files.


Quick Start

1. Clone and Setup

git clone https://github.com/yourusername/tokenomics.git
cd tokenomics
python3 -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate
pip install -r requirements.txt
pip install -e .

2. Configure

cp env.template .env
# Edit .env and add your OPENAI_API_KEY

3. Run

from tokenomics.core import TokenomicsPlatform
from tokenomics.config import TokenomicsConfig
import os
from dotenv import load_dotenv

load_dotenv()

config = TokenomicsConfig.from_env()
platform = TokenomicsPlatform(config=config)

# Run optimized query
result = platform.query("What is machine learning?")
print(f"Response: {result['response']}")
print(f"Tokens: {result['tokens_used']}")
print(f"Cache hit: {result['cache_hit']}")

4. Try the Playground

python app.py
# Open http://localhost:5000/playground

Components

Component Purpose Key Metric
Memory Layer Exact + semantic caching 20% cache hit rate
Token Orchestrator Complexity analysis & budget allocation 90% (2-class ML + heuristics)
Bandit Optimizer Cost-aware model routing 75% of savings
LLMLingua Compression Reduce long query tokens 376 tokens saved
Cascading Inference Quality-protected model selection Maintains quality
Token Predictor ML-based response length prediction 72% accuracy

Test Categories

Our comprehensive test included:

Category Count Purpose Result
Simple queries 10 Test cheap strategy ✅ Low cost
Medium queries 10 Test balanced strategy ✅ Appropriate routing
Complex queries 10 Test premium + cascading ✅ Quality maintained
Exact duplicates 5 Test exact cache 100% hit rate
Semantic variations 5 Test semantic cache ✅ 40% hit rate
Long queries (>500 chars) 5 Test compression ✅ All compressed
Mixed scenarios 5 Edge cases ✅ Handled correctly

Project Structure

tokenomics/
├── tokenomics/              # Core platform
│   ├── core.py              # Main entry point
│   ├── memory/              # Caching layer
│   ├── orchestrator/        # Token allocation
│   ├── bandit/              # Strategy selection
│   ├── compression/         # LLMLingua integration
│   └── ml/                  # Token prediction
├── templates/               # Web UI
├── static/                  # CSS/JS
├── app.py                   # Flask server
├── scripts/                 # Utility scripts
├── tests/                   # Test suite
├── docs/                    # Documentation
└── examples/                # Usage examples

Documentation


API Reference

platform.query()

result = platform.query(
    query="Your question here",
    token_budget=4000,          # Optional: max tokens
    use_cache=True,             # Use memory layer
    use_bandit=True,            # Use strategy selection
    use_compression=True,       # Compress long queries
    use_cost_aware_routing=True # Optimize for cost
)

Returns:

{
    "response": "...",           # LLM response
    "tokens_used": 250,          # Total tokens
    "cache_hit": False,          # Was this cached?
    "strategy": "cheap",         # Strategy used
    "model": "gpt-4o-mini",      # Model used
    "latency_ms": 1234,          # Response time
}

Benchmarks

We provide rigorous cost benchmarks comparing BASELINE (no optimization) vs TOKENOMICS (full pipeline).

Synthetic-50 Benchmark (Decision Accuracy Workload)

Metric Value
Mean Cost Savings 90.7%
Total Baseline Cost $0.217602
Total Tokenomics Cost $0.013989
Cache Hit Rate 20%
Exact Duplicate Hit Rate 100%
Semantic Variation Hit Rate 80%

📄 Full Report: BENCHMARK_COST_RESULTS_SYNTHETIC_50.md

10-Prompt Quick Benchmark

Metric Value
Mean Cost Savings 85.5%
Baseline Cost (10 prompts) $0.037800
Tokenomics Cost (10 prompts) $0.001836

📄 Full Report: BENCHMARK_COST_RESULTS.md

Methodology

  • BASELINE: Cache completely disabled, no compression, no routing, fixed gpt-4o model
  • TOKENOMICS: Full pipeline enabled, cache starts cold (empty), warms across prompts
  • Quality check: Minimum output length validation

Run Your Own Benchmark

# 50-query synthetic benchmark (with breakdown analysis)
python scripts/run_cost_benchmark.py \
    --workload benchmarks/synthetic_accuracy_50.json \
    --output BENCHMARK_COST_RESULTS_SYNTHETIC_50.md \
    --include_breakdowns

# 10-query quick benchmark
python scripts/run_cost_benchmark.py \
    --workload benchmarks/workloads_v0.json \
    --output BENCHMARK_COST_RESULTS.md

# Two-pass (cold + warm to measure cache impact)
python scripts/run_cost_benchmark.py --two_pass

Future Work & Open Problems

Tokenomics is intentionally early-stage and experimental.
The following are open technical problems (not roadmap promises) where contributions are especially valuable:

  • 3-class complexity classification
    The current ML model predicts two classes (simple/medium), with heuristic escalation to "complex". An open direction is collecting 3-class training data and evaluating whether ML-based "complex" prediction improves routing outcomes without degrading cost savings.

  • Distributed and reproducible routing state
    Adaptive routing currently uses file-backed state for simplicity. Designing a pluggable, concurrency-safe backend (e.g., SQLite or Redis) that preserves reproducibility and debuggability remains an open systems challenge.

  • Evaluation and quality regression bounds
    Tokenomics measures cost and latency reliably, but defining repeatable benchmarks and enforcing quality regression guardrails for routing and compression decisions is an ongoing area of work.

  • Observability and explainability of routing decisions
    Understanding why a particular model, strategy, or escalation path was chosen is critical for trust and debugging. Improving structured logging, traceability, and decision explainability is an open area.

  • Cache semantics and invalidation strategies
    Both exact and semantic caching are powerful but subtle. Defining robust invalidation, TTL, and trust-decay policies for cached LLM outputs remains an open problem.

These are areas of active exploration; thoughtful experiments, design discussions, and incremental contributions are welcome.


License

MIT License - see LICENSE


Acknowledgments


Built for the LLM optimization community

About

Make every token count — an experimental LLM inference layer that optimizes cost through caching, adaptive routing, and ML-assisted decision-making.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published