A comprehensive implementation of Enhanced Learned Bloom Filters that addresses three critical problems in traditional Learned Bloom Filters: poor cache locality, expensive retraining, and unstable false positive rates.
This project implements an enhanced version of Learned Bloom Filters (LBF) that combines machine learning with traditional probabilistic data structures to achieve:
- Lowest FPR - 0.2% average (5x better than Standard Bloom Filter)
- O(1) update complexity - 0.007ms per update with incremental learning
- Β±10% FPR variance - Stable performance through adaptive control
- Cache-optimized architecture - 64-byte alignment, SIMD vectorization
- Results - Performance metrics and analysis
- Comparative Analysis - Comparison with 6 other filters
- Methodology - Testing approach and validation
- Testing Fix Summary - Data leakage fix details
-
Cache-Aligned Memory Layout
- 64-byte aligned blocks matching CPU cache lines
- SIMD vectorization for batch operations
- Reduces cache misses from 70% to ~25%
-
Incremental Online Learning
- Passive-Aggressive classifier for O(1) updates
- Sliding window and reservoir sampling
- Eliminates expensive O(n) retraining
-
Adaptive Threshold Control
- PID controller for FPR stabilization
- Real-time monitoring and adjustment
- Maintains stable performance under varying workloads
# Clone the repository
git clone https://github.com/yourusername/BloomFilter.git
cd BloomFilter
# Create virtual environment
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txtfrom src.enhanced_lbf.combined import CombinedEnhancedLBF
# Create enhanced filter with all optimizations
filter = CombinedEnhancedLBF(
initial_positive_set=positive_items,
initial_negative_set=negative_items,
target_fpr=0.01,
enable_cache_opt=True,
enable_incremental=True,
enable_adaptive=True
)
# Add items (O(1) complexity)
filter.add("new_item", label=1)
# Query items
if filter.query("test_item"):
print("Item might be in the set")
# Batch operations (SIMD optimized)
results = filter.batch_query(["item1", "item2", "item3"])
# Get statistics
stats = filter.get_stats()
print(f"Cache hit rate: {stats['cache_hit_rate']:.1f}%")BloomFilter/
βββ src/
β βββ bloom_filter/
β β βββ standard.py # Traditional Bloom Filter
β βββ learned_bloom_filter/
β β βββ basic_lbf.py # Basic Learned Bloom Filter
β βββ enhanced_lbf/
β β βββ cache_aligned.py # Cache optimization
β β βββ incremental.py # Online learning
β β βββ adaptive.py # Adaptive threshold
β β βββ combined.py # All enhancements
β βββ utils/
βββ benchmarks/
β βββ comprehensive_benchmark.py # Performance evaluation
βββ experiments/
β βββ problem_demonstration.py # Problem demonstrations
βββ tests/
β βββ test_standard_bf.py # Unit tests
βββ requirements.txt
# Run unit tests
pytest tests/
# Run comprehensive benchmarks (with real-world data)
python benchmarks/comparative_analysis_realworld.py
# Validate testing methodology
python docs/testing/verify_testing_methodology.py
# Problem demonstrations
python experiments/problem_demonstration.py
# Test individual enhancements
python src/enhanced_lbf/cache_aligned.py # Cache solution
python src/enhanced_lbf/incremental.py # Incremental learning
python src/enhanced_lbf/adaptive.py # Adaptive threshold
python src/enhanced_lbf/combined.py # All enhancements| Filter | Average FPR | vs Standard BF |
|---|---|---|
| Enhanced LBF | 0.20% | 5x better π |
| Counting BF | 0.65% | 1.6x better |
| Standard BF | 1.03% | Baseline |
| Cuckoo Filter | 2.33% | 2.3x worse |
| Filter | Throughput | vs Standard BF |
|---|---|---|
| Standard BF | 3.4M ops/sec | Baseline π |
| Cuckoo Filter | 2.5M ops/sec | 0.74x |
| Counting BF | 2.3M ops/sec | 0.68x |
| Enhanced LBF | 270K ops/sec | 0.08x |
Enhanced LBF trades throughput for superior accuracy
| Implementation | Update Time | Complexity |
|---|---|---|
| Standard BF | N/A | Rebuild required |
| Enhanced LBF | 0.007ms | O(1) β |
| Counting BF | <0.001ms | O(1) |
| Implementation | FPR Variance |
|---|---|
| Basic LBF | Β±800% |
| Enhanced LBF | Β±10% β |
See docs/RESULTS.md for detailed performance analysis.
- Cache-line aligned blocks (64 bytes)
- SIMD vectorization (AVX2, 8-wide)
- Prefetching strategies
- Blocked memory layout
- Passive-Aggressive classifier (online learning)
- Sliding window (bounded memory)
- Reservoir sampling (representative history)
- Dynamic backup filters
- PID controller (Kp=2.0, Ki=0.5, Kd=0.1)
- Count-Min Sketch (frequency tracking)
- Exponential moving average
- Adaptive threshold adjustment
This implementation is based on the paper:
Kraska, T., et al. (2018). "The Case for Learned Index Structures." SIGMOD '18: Proceedings of the 2018 International Conference on Management of Data.
Our enhancements address critical limitations identified in production deployments of Learned Bloom Filters.
Contributions are welcome! Please feel free to submit pull requests or open issues.
# Install development dependencies
pip install -r requirements-dev.txt
# Run linters
flake8 src/
black src/
# Run type checking
mypy src/Run the comprehensive benchmark suite to reproduce results:
python benchmarks/comprehensive_benchmark.pyResults will be saved to data/results/ with detailed performance metrics.
- Lowest FPR achieved - 0.2% average across real-world datasets (5x better than Standard BF)
- O(1) incremental updates - Eliminates expensive retraining bottleneck
- Adaptive control stabilizes FPR - Β±10% variance vs Β±800% in basic LBF
- Trade-offs are real - 12x throughput penalty for superior accuracy
- Methodology matters - Fixed data leakage issue in testing (see docs/METHODOLOGY.md)
October 25, 2025 Update: We discovered and fixed a data leakage issue in the original testing methodology where training and test sets had 100% overlap. The corrected methodology uses proper 80/20 train/test split with no overlap. FPR results improved with the fix (0.1-0.3% vs 0.6-0.9% previously). See docs/METHODOLOGY.md for details.
This project is licensed under the MIT License - see the LICENSE file for details.
- Miguel - Implementation and enhancements
- Based on research by Tim Kraska et al. (MIT CSAIL)
- Original Learned Bloom Filter paper authors
- scikit-learn for ML implementations
- NumPy for efficient numerical operations
- The Case for Learned Index Structures
- Bloom Filter Wikipedia
- Cache-Oblivious Algorithms
- Online Learning Algorithms
Note: This is a research implementation optimized for clarity and correctness. Production deployments should consider additional optimizations and security considerations.