-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Make push_batch_with_filter up to 3x faster for primitive types
#8951
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Make filtered coalescing faster for primitive
push_batch_with_filter faster for primitive types
push_batch_with_filter faster for primitive typespush_batch_with_filter faster for primitive types: up to 10x faster
push_batch_with_filter faster for primitive types: up to 10x fasterpush_batch_with_filter up to 10x faster for primitive types
|
@alamb you are probably interested in this |
|
YAAAAASSS -- this is exactly the type of thing I was hoping for with BatchCoalescer. I will check this out shortly |
| let filtered_batch = filter_record_batch(&batch, filter)?; | ||
| self.push_batch(filtered_batch) | ||
| // We only support primitve now, fallback to filter_record_batch for other types | ||
| // Also, skip optimization when filter is not very selective |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if always better to take into account biggest_coalesce_batch_size
|
run benchmark filter_kernels |
|
show benchmark queue |
|
🤖 Hi @alamb, you asked to view the benchmark queue (#8951 (comment)).
|
|
Hm it seems it contains a bug, probably makes the benchmark results off as well (will take a look tomorrow). |
|
🤖 |
|
🤖: Benchmark completed Details
|
|
run benchmark coalesce_kernels |
|
🤖 Hi @Dandandan, thanks for the request (#8951 (comment)).
Please choose one or more of these with |
push_batch_with_filter up to 10x faster for primitive typespush_batch_with_filter up to 2x faster for primitive types
push_batch_with_filter up to 2x faster for primitive typespush_batch_with_filter up to 3x faster for primitive types
|
@alamb I think it's ok now - I called AI (Opus 4.5) for some help on the Mainly needs some polish and seeing if we can improve the |
|
🤖: Benchmark completed Details
|
|
🤖 |
|
🤖: Benchmark completed Details
|
|
Hi @Dandandan -- I am working through the arrow-rs review backlog. coalesce benchmarks look better. Filter kernels look potentially slower. I'll rerun and try to see if we can reproduce the results |
|
run benchmark filter_kernels |
|
🤖 |
|
🤖: Benchmark completed Details
|
|
run benchmark filter_kernels |
I don't think the filter kernels should have any impact other than the threshold, but that isn't covered by a benchmark. |
|
🤖 |
|
🤖: Benchmark completed Details
|
|
Ok, thank you -- I will plan to review this one carefully in the morning |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Dandandan -- this looks really exciting; I had a few comments.
I suggest we break this PR up into several smaller ones (now that you have proof the benchmarks are working well):
- Add
BooleanBufferBuilder::extend(and tests) - Add
BooleanBuffer::find_nth_set_bit_position(and tests) - Add the changes to coalesce
arrow-buffer/src/builder/null.rs
Outdated
| /// assert_eq!(builder.len(), 4); | ||
| /// ``` | ||
| pub fn extend<I: Iterator<Item = bool>>(&mut self, iter: I) { | ||
| let (lower, upper) = iter.size_hint(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this method would be more generally when appending to any BooleanBuffer rather than just NullBufferBuilder
As part of the goal to consolidate mutable boolean operations in BooleanBufferBuilder so it is easier to find (and optimize) them, would you be willing to move this code to BooleanBufferBuilder so that the code in NullBufferBuilder looks like something like this (which is what most other methods in NullBufferBuilder look like)?
pub fn extend<I: Iterator<Item = bool>>(&mut self, iter: I) {
// Materialize since we're about to append bits
self.materialize_if_needed();
self.bitmap_builder.as_mut().unwrap().extend(iter)
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, I'll do that
arrow-buffer/src/builder/null.rs
Outdated
| let mut iter = iter.peekable(); | ||
|
|
||
| // Process full u64 chunks (64 bits at a time) | ||
| while bit_idx + 64 <= end_bit && iter.peek().is_some() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a follow on PR, it might be worth aligning first on 64 bit boundaries (so the underlying code doesn't have to handle aligning) -- aka handle bits 0..63 (until 64 bit alignment) specially and then use the u64 path
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
arrow-buffer/src/builder/null.rs
Outdated
| } | ||
| let byte_idx = (bit_idx - start_len) / 8 + start_byte; | ||
| // Write the u64 chunk as 8 bytes | ||
| slice[byte_idx..byte_idx + 8].copy_from_slice(&chunk.to_le_bytes()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could try unsafe here too as you ensured the right length above
| // Test extend with non-aligned start (tests bit-by-bit path) | ||
| let mut builder = NullBufferBuilder::new(0); | ||
| builder.append_non_null(); // Start at bit 1 (non-aligned) | ||
| builder.extend([false, true, false, true].iter().copied()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should probably test non aligned writes with more than 64 bits as well (this only copies 4 bits)
| batch: RecordBatch, | ||
| filter: &BooleanArray, | ||
| ) -> Result<(), ArrowError> { | ||
| // TODO: optimize this to avoid materializing (copying the results |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉
| } | ||
|
|
||
| /// Find the position after the n-th set bit in a boolean array starting from `start`. | ||
| /// Returns the position after the n-th set bit, or the end of the array if fewer than n bits are set. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend we move this code into BooleanBuffer as well so it is easier to find / reuse
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea
|
|
||
| /// Copy rows at the given indices from the current source array into the in-progress array | ||
| fn copy_rows_by_filter(&mut self, filter: &FilterPredicate) -> Result<(), ArrowError> { | ||
| // Default implementation: iterate over indices from the filter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like as a follow on we should implement something similar for the byte array filter types? If that is true I'll file a ticket
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, for views & bytes array
| self.nulls.append_n_non_nulls(count); | ||
| } | ||
| } | ||
| IterationStrategy::Slices(slices) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Sounds like a good plan, I'll follow that! |
MVP for apache#8957 awaits for apache#8951 very first version for behaviour review, optimizations TBD Signed-off-by: 蔡略 <cailue@apache.org>
arrow-buffer/src/builder/null.rs
Outdated
| let mut bit_idx = start_len; | ||
| let end_bit = start_len + len; | ||
|
|
||
| // Process in chunks of 64 bits when byte-aligned for better performance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit curious, why this don't have some part for unaligned an aligned handling, and
handle_unaligned() // handled start_len % 8 header
handle_aligned() // handle inner payloads
handle_unaligned() // handle_trailer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is the same comment as @alamb has?
Yeah perhaps this can improve performance (will see guided by benchmarks).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just checked - this seems an additional ~30% improvement for null handling:
filter: primitive, 8192, nulls: 0.1, selectivity: 0.1
time: [2.4060 ms 2.4096 ms 2.4133 ms]
change: [−33.920% −32.902% −32.274%] (p = 0.00 < 0.05)
Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high mild
filter: primitive, 8192, nulls: 0.1, selectivity: 0.8
time: [2.1610 ms 2.1666 ms 2.1728 ms]
change: [−29.488% −28.499% −27.767%] (p = 0.00 < 0.05)
Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
4 (4.00%) high mild
3 (3.00%) high severe
|
run benchmark coalesce_kernels |
|
🤖 |
|
run benchmark coalesce_kernels |
|
🤖: Benchmark completed Details
|
|
🤖 |
|
🤖: Benchmark completed Details
|
nice |
@Dandandan would you like help getting this PR into shape / creating the smaller PRs? |

Which issue does this PR close?
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?