Vectorized Columnar Analytics Engine

A high-performance OLAP (Online Analytical Processing) engine prototype implemented in Rust. This project demonstrates the core principles of modern cloud data warehouses like Snowflake, focusing on vectorized execution, columnar storage, and query optimization.

Core Architecture

The engine is built on a modular pipeline designed for high-throughput analytical queries.

1. Columnar Storage Layer

Data is stored by column rather than by row, which is optimal for analytical workloads where queries typically access a subset of columns across many rows.

Custom File Format: Each column is stored in a dedicated .dat file with block-level indexing.
Block-Based Layout: Data is partitioned into blocks, each containing metadata (min/max values, row counts) to enable predicate pushdown and block skipping.
Metadata Footer: Files include a footer for O(1) access to block offsets.

2. Compression & Encoding

To reduce I/O bottlenecks, the engine implements specialized encodings:

Dictionary Encoding: Compresses string columns by mapping unique strings to integer IDs.
Run-Length Encoding (RLE): Compresses low-cardinality integer columns by storing sequences of identical values as (value, count) pairs.
Plain Encoding: Standard little-endian binary representation for high-cardinality numeric data.

3. Vectorized Execution Model

Instead of processing one row at a time (the Volcano model), this engine processes batches of rows (RecordBatches) using ColumnVectors.

CPU Cache Efficiency: Processing batches improves instruction cache locality.
SIMD Friendly: The layout is designed to leverage SIMD (Single Instruction, Multiple Data) instructions for parallel data processing.
Type-Safe Vectors: Implements specialized vectors for Int32, Int64, Float64, and String.

4. Query Engine (SQL to Execution)

The engine provides a full pipeline from raw SQL to results:

SQL Parser: Leverages sqlparser-rs to convert SQL strings into an AST.
Logical Planner: Translates the AST into a relational algebra tree (Projection, Filter, Aggregate, Scan).
Physical Planner: Converts the logical plan into executable operators with optimizations:
- Projection Pruning: Only columns referenced in the query are read from disk.
- Predicate Pushdown: (Architecture ready) Filters are evaluated as early as possible.
Operators: Includes implementations for Scan, Filter, Hash-based Aggregation (Group By), and Projection.

5. Parallelism

Performance is scaled across multiple CPU cores using rayon:

Parallel Scanning: Multiple column blocks are read and decoded in parallel.
Parallel Aggregation: Partial aggregation states are computed on separate threads and merged in a final reduce step.

Performance & Benchmarking

The project uses criterion for high-precision performance measurement. Benchmarks were conducted on a 100,000 row dataset to measure the efficiency of the vectorized execution and the impact of compression.

Actual Results (100k Rows)

Scan (Plain): ~78 µs
High-throughput sequential I/O and minimal per-batch overhead.
Scan (RLE Compressed): ~49 µs
37% faster than plain scan. RLE reduces memory bandwidth pressure by decoding repeated values in bulk.
Aggregation (Serial): ~1.99 ms
Includes scanning and single-threaded hash aggregation.
SQL Full Pipeline: ~2.11 ms
End-to-end execution time from a raw SQL string query to the final results, including parsing, planning, scanning, filtering, and aggregation.

Key Performance Drivers

Vectorized Decoding: Encoders are designed to produce ColumnVector batches directly, maximizing instruction throughput.
Parallel Scaling: The engine uses rayon to scale performance across available CPU cores for large datasets.
Zero-Copy Intent: Where possible, data is moved through the pipeline in batches to minimize allocations and copies.

To run the benchmarks:

cargo bench

Testing & Reliability

The engine features a consolidated comprehensive test suite in tests/test_cases.rs, covering:

End-to-end SQL integration (complex joins and aggregations).
Dictionary and RLE encoding correctness.
Parallel execution stress tests.
Edge cases (empty tables, integer overflows, wide tables).

To run the tests:

cargo test

Implementation Details

Language: Rust (2024 Edition)
Key Dependencies: rayon (parallelism), sqlparser (SQL frontend), serde (metadata serialization), byteorder (binary I/O).

This engine was designed to mirror the "push-based" vectorized execution seen in modern systems. By decoupling the storage format from the execution operators, the system allows for independent scaling of the storage and compute layers, a key characteristic of Snowflake's architecture.

Usage & Interaction

This project is an embedded engine prototype designed for high-performance analytical workloads. You can interact with it in three ways:

Integration Tests: Run the full suite to see the engine in action.
```
cargo test -- --nocapture
```
Direct Execution: Modify src/main.rs to build custom tables and run SQL queries using the SQLParser and PhysicalPlanner modules.
Library Integration: Include this project as a Rust library dependency to leverage vectorized columnar processing in your own applications.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
benches		benches
src		src
tests		tests
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Vectorized Columnar Analytics Engine

Core Architecture

1. Columnar Storage Layer

2. Compression & Encoding

3. Vectorized Execution Model

4. Query Engine (SQL to Execution)

5. Parallelism

Performance & Benchmarking

Actual Results (100k Rows)

Key Performance Drivers

Testing & Reliability

Implementation Details

Usage & Interaction

About

Uh oh!

Releases

Packages

Languages

JoshElkind/columnar-analytics-engine

Folders and files

Latest commit

History

Repository files navigation

Vectorized Columnar Analytics Engine

Core Architecture

1. Columnar Storage Layer

2. Compression & Encoding

3. Vectorized Execution Model

4. Query Engine (SQL to Execution)

5. Parallelism

Performance & Benchmarking

Actual Results (100k Rows)

Key Performance Drivers

Testing & Reliability

Implementation Details

Usage & Interaction

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages