Sideways Information Passing for Joins using Bloom Filters#27
Open
Sideways Information Passing for Joins using Bloom Filters#27
Conversation
# Conflicts: # src/execution/join_hashtable.cpp # src/include/duckdb/execution/join_hashtable.hpp
…of sel is assumed to be flat
…incoming and flat sel
# Conflicts: # src/execution/operator/join/physical_hash_join.cpp
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Sideways Information Passing for Joins using Bloom Filters
Hi, dear DuckDB team. In this PR, I would like to add a Bloom Filter Pushdown for the left side of the join.
The main idea is that if we detect that a join is selective, we build a bloom filter during join hash table population and then push it down to the probe side during scan. It reuses much of the existing infrastructure for min/max join table filters.
A Bloom filter is a space-efficient probabilistic data structure that can quickly test whether an element is definitely not in a set or possibly in a set, allowing false positives but never false negatives. Based on the hash of the key, we get an offset to a slot and then set 4 bits in this slot, determined by the hash.
There are still some things unfinished that I am not sure how to approach, like serialization and planning. Looking forward to the feedback on these points!
Advantages
1. Faster Probing
Probing the Bloom Filter is 2-4x times faster than probing the hash table. This has several reasons: (1) The bloom filter requires 12 bits per inserted key, while the hash table allocates 64-bit slots per key, so that the bloom filter will be 5x smaller than the hash table. (2) Probing the bloom filter is done with < 20 instructions and completely branchless, unlike the linear probing code, which needs to follow the probing chain.
2. Low False-Positive Rate
The false positive rate of the bloom filter is ~2%, which is much lower than the range-based filter pushdown that is in duckdb right now. This means that many tuples will be filtered out during the table scan. This leads to fewer tuples needing to be decompressed, which is especially helpful for IMDB on the M4, where FSST decompress is a big part of the whole execution.
3. More Optimal Join Plans
By pushing down bloom filters, we can get more optimal join plans at runtime, which gives us some of the benefits from the Robus Predicate Transfer. This is done by decomposing joins into Lookup and Expand, where the Bloom Filter is the pushed-down lookup, and the actual join then is the expand phase. In a pipeline with two joins, where the first join explodes the data size and the second join filters it back down, pushing the Bloom filter to the table scan ensures only tuples that survive both joins are fed into the expensive first join, avoiding unnecessary intermediate explosion.
Limitations & Todo's
1. Serialization issues and
TableFilter::ToExpressionTableFilters in DuckDB support being serialized and transformed into an expression. With simple predicates, this is trivial, as one only needs to serialize the expression type and the constant. Serializing the BloomFilter Table Filter would imply that the BloomFilter itself needs to be serialized, which can get quite big. I am not sure how to approach this. Also, the Bloom Filter cannot be turned into an expression. The current hack is to turn it into a constant
TRUE. This needs of course, fixing.2. Bloom Filter Build & Probe Overhead
Building a Bloom filter adds approximately 20% overhead to the join's build phase. Since Bloom filters only provide benefits for selective joins, we must decide when this overhead is justified. Also, probing the bloom filter without it beeing selective is also overhead.
The ideal case would be the following: During the Probe pipeline, we measure the selectivity of the join, and if it is below a certain threshold, we build the bloom filter and push it down. For this to happen, we need to pause the pipeline to create a Task to build the BF. Pausing pipelines is, to my knowledge, only possible for
SINKandSOURCE, the probe isa streaming operator. Also, the table scan does not allow updates to the table filters, while the pipeline runs without any changes.Therefore, I settled for a less perfect but easier to implement way: During the Planning, I check whether a build side is selective by looking whether there is a filter. If there is, I build the bloom filter and push it down during the build of the join. When the probe pipeline starts, I check the selectivity of the Bloom Filter for the first 20 vectors. If the selectivity is higher than 25%, I disable the bloom filter.
This means that in the worst case, we only get the overhead of building the Bloom Filter and then not using it. The picture below illustrates this. It shows the speed-up between duckdb main and this PR for a single join query. The join is between the same table A, which contains primary keys, but the build side is filtered by the selectivity shown on the X-axis. This means that for a selectivity of 1%, the build side will be 1% the size of the probe side, while for a selectivity of 100% (i.e., no filter), the build and probe sides will have the same size. We can also see that we get higher speedups for larger probe sides as the bloom filter can reside on lower cache levels.
3. Compatibility with Compressed Materialization of Join Keys
Compressed Materialization sandwiches join operators and temporarily compress columns to materialize less data. While this is, of course, an awesome optimization, it hinders bloom filter pushdown as the compression would also need to be pushed down into the table filters. Currently, DuckDB does not support filter pushdown with compressed materialization, which leaves room for further optimizations.
4. Parallelism
The Bloom Filter Push Down reduces the need to probe the hashtable, which scales well with multiple cores (read-only), but requires building the bloom filter, which does not scale as well as probing because we need atomics to write thread-safely. This means that there are higher speed-ups for lower core counts (See benchmarking)
5. Not all the benefits of RPT
While this PR supports unidirectional, sideways information passing, Robust Predicate Transfer can also push filter information from the probe side of a join to the build side. Benchmarking this PR against the RPT PR showed that RPT's speedups are >2 times higher.
6. Single Key Bloom Filter
Right now, the Bloom Filter only supports single-key Joins, as the TableFilter provides only an API for a single Vector, not for a DataChunk.
7. Interoperability with Min/Max Filters
Right now, I disabled Min/Max filters and made them optimal filters if there is a bloom filter. This is because, if there is a bloom filter, the Min/Max filters will remain active even if they return
IS_ALWAYS_TRUE, which adds overhead of about 3% for both TPC-H and IMDB. Please let me know what your preferred behavior is.Benchmarking
The following benchmark shows the speedup of this PR vs the current main. All Benchmarks were done using eight threads.