Minimal LLM Inference Engine

A from-scratch implementation of an LLM inference engine in Rust, designed to load and run Llama-family models on consumer hardware.

Status: Work in progress — core tensor system and GGUF parser complete, inference pipeline in development.

Features

Tensor System

Generic N-dimensional tensor implementation with shape and stride abstractions
Memory-efficient views for slicing and reshaping without copying
Standard operations: element-wise math, matrix multiplication, broadcasting

GGUF Parser

Full GGUF v2/v3 format support with memory-mapped I/O for efficient large file handling
Metadata extraction (model architecture, tokenizer config, hyperparameters)
Tensor information parsing with automatic alignment handling

Quantization Support

F32/F16: Direct extraction and half-precision conversion
Q8_0: 8-bit block quantization (32 elements/block, ~2x compression)
Q4_0: 4-bit block quantization (32 elements/block, ~4x compression)

Performance

LRU tensor cache with configurable memory limits
Preloading strategies (eager, lazy, selective pattern matching)
Zero-copy memory mapping for model files

Usage

use llm_engine::gguf::{GgufLoader, TensorExtractor};

// Load GGUF model file
let loader = GgufLoader::open("model.gguf")?;

// Inspect model metadata
println!("Architecture: {}", loader.metadata().get_str("general.architecture").unwrap());
println!("Tensors: {}", loader.tensors().len());

// Extract and dequantize tensors
let extractor = TensorExtractor::new(&loader);
let weights = extractor.extract("model.layers.0.attn.wq.weight")?;
println!("Shape: {:?}", weights.dims());

Project Structure

src/
├── tensor/          # N-dimensional tensor implementation
│   ├── shape.rs     # Shape and stride abstractions
│   ├── view.rs      # Non-owning tensor views
│   └── ops.rs       # Tensor operations
└── gguf/            # GGUF file format support
    ├── loader.rs    # File loading and memory mapping
    ├── metadata.rs  # Key-value metadata parsing
    ├── extract.rs   # Tensor extraction API
    ├── dequant.rs   # Quantization/dequantization
    └── cache.rs     # Tensor caching layer

Roadmap

Building

cargo build --release
cargo test

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.github/workflows		.github/workflows
docs		docs
src		src
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
clippy.toml		clippy.toml
rustfmt.toml		rustfmt.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Minimal LLM Inference Engine

Features

Tensor System

GGUF Parser

Quantization Support

Performance

Usage

Project Structure

Roadmap

Building

License

About

Uh oh!

Releases

Packages

Languages

License

tamirkifle/mini-rust-inference

Folders and files

Latest commit

History

Repository files navigation

Minimal LLM Inference Engine

Features

Tensor System

GGUF Parser

Quantization Support

Performance

Usage

Project Structure

Roadmap

Building

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages