Enhanced Learned Bloom Filter

A comprehensive implementation of Enhanced Learned Bloom Filters that addresses three critical problems in traditional Learned Bloom Filters: poor cache locality, expensive retraining, and unstable false positive rates.

🎯 Overview

This project implements an enhanced version of Learned Bloom Filters (LBF) that combines machine learning with traditional probabilistic data structures to achieve:

Lowest FPR - 0.2% average (5x better than Standard Bloom Filter)
O(1) update complexity - 0.007ms per update with incremental learning
±10% FPR variance - Stable performance through adaptive control
Cache-optimized architecture - 64-byte alignment, SIMD vectorization

📋 Documentation

Results - Performance metrics and analysis
Comparative Analysis - Comparison with 6 other filters
Methodology - Testing approach and validation
Testing Fix Summary - Data leakage fix details

📊 Key Features

Three Core Enhancements

Cache-Aligned Memory Layout
- 64-byte aligned blocks matching CPU cache lines
- SIMD vectorization for batch operations
- Reduces cache misses from 70% to ~25%
Incremental Online Learning
- Passive-Aggressive classifier for O(1) updates
- Sliding window and reservoir sampling
- Eliminates expensive O(n) retraining
Adaptive Threshold Control
- PID controller for FPR stabilization
- Real-time monitoring and adjustment
- Maintains stable performance under varying workloads

🚀 Quick Start

Installation

# Clone the repository
git clone https://github.com/yourusername/BloomFilter.git
cd BloomFilter

# Create virtual environment
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

Basic Usage

from src.enhanced_lbf.combined import CombinedEnhancedLBF

# Create enhanced filter with all optimizations
filter = CombinedEnhancedLBF(
    initial_positive_set=positive_items,
    initial_negative_set=negative_items,
    target_fpr=0.01,
    enable_cache_opt=True,
    enable_incremental=True,
    enable_adaptive=True
)

# Add items (O(1) complexity)
filter.add("new_item", label=1)

# Query items
if filter.query("test_item"):
    print("Item might be in the set")

# Batch operations (SIMD optimized)
results = filter.batch_query(["item1", "item2", "item3"])

# Get statistics
stats = filter.get_stats()
print(f"Cache hit rate: {stats['cache_hit_rate']:.1f}%")

📁 Project Structure

BloomFilter/
├── src/
│   ├── bloom_filter/
│   │   └── standard.py          # Traditional Bloom Filter
│   ├── learned_bloom_filter/
│   │   └── basic_lbf.py         # Basic Learned Bloom Filter
│   ├── enhanced_lbf/
│   │   ├── cache_aligned.py     # Cache optimization
│   │   ├── incremental.py       # Online learning
│   │   ├── adaptive.py          # Adaptive threshold
│   │   └── combined.py          # All enhancements
│   └── utils/
├── benchmarks/
│   └── comprehensive_benchmark.py # Performance evaluation
├── experiments/
│   └── problem_demonstration.py  # Problem demonstrations
├── tests/
│   └── test_standard_bf.py      # Unit tests
└── requirements.txt

🧪 Running Tests

# Run unit tests
pytest tests/

# Run comprehensive benchmarks (with real-world data)
python benchmarks/comparative_analysis_realworld.py

# Validate testing methodology
python docs/testing/verify_testing_methodology.py

# Problem demonstrations
python experiments/problem_demonstration.py

# Test individual enhancements
python src/enhanced_lbf/cache_aligned.py    # Cache solution
python src/enhanced_lbf/incremental.py      # Incremental learning
python src/enhanced_lbf/adaptive.py         # Adaptive threshold
python src/enhanced_lbf/combined.py         # All enhancements

📈 Performance Results

False Positive Rate (Lower is Better)

Filter	Average FPR	vs Standard BF
Enhanced LBF	0.20%	5x better 🏆
Counting BF	0.65%	1.6x better
Standard BF	1.03%	Baseline
Cuckoo Filter	2.33%	2.3x worse

Query Throughput

Filter	Throughput	vs Standard BF
Standard BF	3.4M ops/sec	Baseline 🏆
Cuckoo Filter	2.5M ops/sec	0.74x
Counting BF	2.3M ops/sec	0.68x
Enhanced LBF	270K ops/sec	0.08x ⚠️

Enhanced LBF trades throughput for superior accuracy

Update Performance

Implementation	Update Time	Complexity
Standard BF	N/A	Rebuild required
Enhanced LBF	0.007ms	O(1) ✅
Counting BF	<0.001ms	O(1)

FPR Stability

Implementation	FPR Variance
Basic LBF	±800%
Enhanced LBF	±10% ✅

See docs/RESULTS.md for detailed performance analysis.

🔬 Algorithms Used

Problem 1: Cache Misses

Cache-line aligned blocks (64 bytes)
SIMD vectorization (AVX2, 8-wide)
Prefetching strategies
Blocked memory layout

Problem 2: Expensive Retraining

Passive-Aggressive classifier (online learning)
Sliding window (bounded memory)
Reservoir sampling (representative history)
Dynamic backup filters

Problem 3: FPR Instability

PID controller (Kp=2.0, Ki=0.5, Kd=0.1)
Count-Min Sketch (frequency tracking)
Exponential moving average
Adaptive threshold adjustment

🎓 Academic Context

This implementation is based on the paper:

Kraska, T., et al. (2018). "The Case for Learned Index Structures." SIGMOD '18: Proceedings of the 2018 International Conference on Management of Data.

Our enhancements address critical limitations identified in production deployments of Learned Bloom Filters.

🤝 Contributing

Contributions are welcome! Please feel free to submit pull requests or open issues.

Development Setup

# Install development dependencies
pip install -r requirements-dev.txt

# Run linters
flake8 src/
black src/

# Run type checking
mypy src/

📊 Benchmark Results

Run the comprehensive benchmark suite to reproduce results:

python benchmarks/comprehensive_benchmark.py

Results will be saved to data/results/ with detailed performance metrics.

🔍 Key Findings

Lowest FPR achieved - 0.2% average across real-world datasets (5x better than Standard BF)
O(1) incremental updates - Eliminates expensive retraining bottleneck
Adaptive control stabilizes FPR - ±10% variance vs ±800% in basic LBF
Trade-offs are real - 12x throughput penalty for superior accuracy
Methodology matters - Fixed data leakage issue in testing (see docs/METHODOLOGY.md)

⚠️ Important Note on Testing

October 25, 2025 Update: We discovered and fixed a data leakage issue in the original testing methodology where training and test sets had 100% overlap. The corrected methodology uses proper 80/20 train/test split with no overlap. FPR results improved with the fix (0.1-0.3% vs 0.6-0.9% previously). See docs/METHODOLOGY.md for details.

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.

👥 Authors

Miguel - Implementation and enhancements
Based on research by Tim Kraska et al. (MIT CSAIL)

🙏 Acknowledgments

Original Learned Bloom Filter paper authors
scikit-learn for ML implementations
NumPy for efficient numerical operations

📚 Further Reading

Note: This is a research implementation optimized for clarity and correctness. Production deployments should consider additional optimizations and security considerations.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
benchmarks		benchmarks
data/datasets		data/datasets
docs		docs
experiments		experiments
scripts		scripts
src		src
tests		tests
validation		validation
.gitignore		.gitignore
AGENTS.md		AGENTS.md
HOUSEKEEPING_SUMMARY.md		HOUSEKEEPING_SUMMARY.md
MALICIOUS_URL_APPLICATION.md		MALICIOUS_URL_APPLICATION.md
README.md		README.md
requirements.txt		requirements.txt

Miguel2604/EnhancedBloomFilter

Folders and files

Latest commit

History

Repository files navigation