Skip to content

Enhanced Learned Bloom Filter with cache optimization, incremental learning, and adaptive threshold control

Notifications You must be signed in to change notification settings

Miguel2604/EnhancedBloomFilter

Repository files navigation

Enhanced Learned Bloom Filter

A comprehensive implementation of Enhanced Learned Bloom Filters that addresses three critical problems in traditional Learned Bloom Filters: poor cache locality, expensive retraining, and unstable false positive rates.

🎯 Overview

This project implements an enhanced version of Learned Bloom Filters (LBF) that combines machine learning with traditional probabilistic data structures to achieve:

  • Lowest FPR - 0.2% average (5x better than Standard Bloom Filter)
  • O(1) update complexity - 0.007ms per update with incremental learning
  • Β±10% FPR variance - Stable performance through adaptive control
  • Cache-optimized architecture - 64-byte alignment, SIMD vectorization

πŸ“‹ Documentation

πŸ“Š Key Features

Three Core Enhancements

  1. Cache-Aligned Memory Layout

    • 64-byte aligned blocks matching CPU cache lines
    • SIMD vectorization for batch operations
    • Reduces cache misses from 70% to ~25%
  2. Incremental Online Learning

    • Passive-Aggressive classifier for O(1) updates
    • Sliding window and reservoir sampling
    • Eliminates expensive O(n) retraining
  3. Adaptive Threshold Control

    • PID controller for FPR stabilization
    • Real-time monitoring and adjustment
    • Maintains stable performance under varying workloads

πŸš€ Quick Start

Installation

# Clone the repository
git clone https://github.com/yourusername/BloomFilter.git
cd BloomFilter

# Create virtual environment
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

Basic Usage

from src.enhanced_lbf.combined import CombinedEnhancedLBF

# Create enhanced filter with all optimizations
filter = CombinedEnhancedLBF(
    initial_positive_set=positive_items,
    initial_negative_set=negative_items,
    target_fpr=0.01,
    enable_cache_opt=True,
    enable_incremental=True,
    enable_adaptive=True
)

# Add items (O(1) complexity)
filter.add("new_item", label=1)

# Query items
if filter.query("test_item"):
    print("Item might be in the set")

# Batch operations (SIMD optimized)
results = filter.batch_query(["item1", "item2", "item3"])

# Get statistics
stats = filter.get_stats()
print(f"Cache hit rate: {stats['cache_hit_rate']:.1f}%")

πŸ“ Project Structure

BloomFilter/
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ bloom_filter/
β”‚   β”‚   └── standard.py          # Traditional Bloom Filter
β”‚   β”œβ”€β”€ learned_bloom_filter/
β”‚   β”‚   └── basic_lbf.py         # Basic Learned Bloom Filter
β”‚   β”œβ”€β”€ enhanced_lbf/
β”‚   β”‚   β”œβ”€β”€ cache_aligned.py     # Cache optimization
β”‚   β”‚   β”œβ”€β”€ incremental.py       # Online learning
β”‚   β”‚   β”œβ”€β”€ adaptive.py          # Adaptive threshold
β”‚   β”‚   └── combined.py          # All enhancements
β”‚   └── utils/
β”œβ”€β”€ benchmarks/
β”‚   └── comprehensive_benchmark.py # Performance evaluation
β”œβ”€β”€ experiments/
β”‚   └── problem_demonstration.py  # Problem demonstrations
β”œβ”€β”€ tests/
β”‚   └── test_standard_bf.py      # Unit tests
└── requirements.txt

πŸ§ͺ Running Tests

# Run unit tests
pytest tests/

# Run comprehensive benchmarks (with real-world data)
python benchmarks/comparative_analysis_realworld.py

# Validate testing methodology
python docs/testing/verify_testing_methodology.py

# Problem demonstrations
python experiments/problem_demonstration.py

# Test individual enhancements
python src/enhanced_lbf/cache_aligned.py    # Cache solution
python src/enhanced_lbf/incremental.py      # Incremental learning
python src/enhanced_lbf/adaptive.py         # Adaptive threshold
python src/enhanced_lbf/combined.py         # All enhancements

πŸ“ˆ Performance Results

False Positive Rate (Lower is Better)

Filter Average FPR vs Standard BF
Enhanced LBF 0.20% 5x better πŸ†
Counting BF 0.65% 1.6x better
Standard BF 1.03% Baseline
Cuckoo Filter 2.33% 2.3x worse

Query Throughput

Filter Throughput vs Standard BF
Standard BF 3.4M ops/sec Baseline πŸ†
Cuckoo Filter 2.5M ops/sec 0.74x
Counting BF 2.3M ops/sec 0.68x
Enhanced LBF 270K ops/sec 0.08x ⚠️

Enhanced LBF trades throughput for superior accuracy

Update Performance

Implementation Update Time Complexity
Standard BF N/A Rebuild required
Enhanced LBF 0.007ms O(1) βœ…
Counting BF <0.001ms O(1)

FPR Stability

Implementation FPR Variance
Basic LBF Β±800%
Enhanced LBF Β±10% βœ…

See docs/RESULTS.md for detailed performance analysis.

πŸ”¬ Algorithms Used

Problem 1: Cache Misses

  • Cache-line aligned blocks (64 bytes)
  • SIMD vectorization (AVX2, 8-wide)
  • Prefetching strategies
  • Blocked memory layout

Problem 2: Expensive Retraining

  • Passive-Aggressive classifier (online learning)
  • Sliding window (bounded memory)
  • Reservoir sampling (representative history)
  • Dynamic backup filters

Problem 3: FPR Instability

  • PID controller (Kp=2.0, Ki=0.5, Kd=0.1)
  • Count-Min Sketch (frequency tracking)
  • Exponential moving average
  • Adaptive threshold adjustment

πŸŽ“ Academic Context

This implementation is based on the paper:

Kraska, T., et al. (2018). "The Case for Learned Index Structures." SIGMOD '18: Proceedings of the 2018 International Conference on Management of Data.

Our enhancements address critical limitations identified in production deployments of Learned Bloom Filters.

🀝 Contributing

Contributions are welcome! Please feel free to submit pull requests or open issues.

Development Setup

# Install development dependencies
pip install -r requirements-dev.txt

# Run linters
flake8 src/
black src/

# Run type checking
mypy src/

πŸ“Š Benchmark Results

Run the comprehensive benchmark suite to reproduce results:

python benchmarks/comprehensive_benchmark.py

Results will be saved to data/results/ with detailed performance metrics.

πŸ” Key Findings

  1. Lowest FPR achieved - 0.2% average across real-world datasets (5x better than Standard BF)
  2. O(1) incremental updates - Eliminates expensive retraining bottleneck
  3. Adaptive control stabilizes FPR - Β±10% variance vs Β±800% in basic LBF
  4. Trade-offs are real - 12x throughput penalty for superior accuracy
  5. Methodology matters - Fixed data leakage issue in testing (see docs/METHODOLOGY.md)

⚠️ Important Note on Testing

October 25, 2025 Update: We discovered and fixed a data leakage issue in the original testing methodology where training and test sets had 100% overlap. The corrected methodology uses proper 80/20 train/test split with no overlap. FPR results improved with the fix (0.1-0.3% vs 0.6-0.9% previously). See docs/METHODOLOGY.md for details.

πŸ“ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ‘₯ Authors

  • Miguel - Implementation and enhancements
  • Based on research by Tim Kraska et al. (MIT CSAIL)

πŸ™ Acknowledgments

  • Original Learned Bloom Filter paper authors
  • scikit-learn for ML implementations
  • NumPy for efficient numerical operations

πŸ“š Further Reading


Note: This is a research implementation optimized for clarity and correctness. Production deployments should consider additional optimizations and security considerations.

About

Enhanced Learned Bloom Filter with cache optimization, incremental learning, and adaptive threshold control

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages