Skip to content

Conversation

@Dandandan
Copy link
Contributor

@Dandandan Dandandan commented Dec 4, 2025

Which issue does this PR close?

  • Closes #NNN.

Rationale for this change

filter: primitive, 8192, nulls: 0, selectivity: 0.001
                        time:   [20.430 ms 20.678 ms 21.105 ms]
                        change: [−65.000% −64.516% −63.806%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) high mild
  6 (6.00%) high severe

filter: primitive, 8192, nulls: 0, selectivity: 0.01
                        time:   [3.3275 ms 3.3451 ms 3.3665 ms]
                        change: [−49.062% −48.663% −48.260%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low mild
  5 (5.00%) high severe

Benchmarking filter: primitive, 8192, nulls: 0, selectivity: 0.1: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 7.5s, enable flat sampling, or reduce sample count to 50.
filter: primitive, 8192, nulls: 0, selectivity: 0.1
                        time:   [1.4759 ms 1.4887 ms 1.5105 ms]
                        change: [−26.613% −23.553% −15.842%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  2 (2.00%) low mild
  1 (1.00%) high mild
  6 (6.00%) high severe

Benchmarking filter: primitive, 8192, nulls: 0, selectivity: 0.8: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.9s, enable flat sampling, or reduce sample count to 60.
filter: primitive, 8192, nulls: 0, selectivity: 0.8
                        time:   [1.3569 ms 1.3626 ms 1.3702 ms]
                        change: [−47.225% −46.850% −46.451%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  3 (3.00%) low mild
  2 (2.00%) high mild
  3 (3.00%) high severe

filter: primitive, 8192, nulls: 0.1, selectivity: 0.001
                        time:   [23.231 ms 23.295 ms 23.376 ms]
                        change: [−69.694% −69.516% −69.351%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high severe

filter: primitive, 8192, nulls: 0.1, selectivity: 0.01
                        time:   [5.4033 ms 5.4201 ms 5.4424 ms]
                        change: [−49.860% −49.590% −49.325%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe

filter: primitive, 8192, nulls: 0.1, selectivity: 0.1
                        time:   [3.6111 ms 3.6270 ms 3.6475 ms]
                        change: [−27.778% −26.284% −25.286%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe

filter: primitive, 8192, nulls: 0.1, selectivity: 0.8
                        time:   [3.6298 ms 3.7206 ms 3.8600 ms]
                        change: [−26.637% −24.714% −21.997%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) high mild
  4 (4.00%) high severe

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Make filtered coalescing faster for primitive
@github-actions github-actions bot added the arrow Changes to the arrow crate label Dec 4, 2025
@Dandandan Dandandan changed the title Make filtered coalescing faster for primitive types Make push_batch_with_filter faster for primitive types Dec 4, 2025
@Dandandan Dandandan changed the title Make push_batch_with_filter faster for primitive types Make push_batch_with_filter faster for primitive types: up to 10x faster Dec 4, 2025
@Dandandan Dandandan changed the title Make push_batch_with_filter faster for primitive types: up to 10x faster Make push_batch_with_filter up to 10x faster for primitive types Dec 4, 2025
@Dandandan
Copy link
Contributor Author

@alamb you are probably interested in this

@alamb
Copy link
Contributor

alamb commented Dec 4, 2025

YAAAAASSS -- this is exactly the type of thing I was hoping for with BatchCoalescer. I will check this out shortly

let filtered_batch = filter_record_batch(&batch, filter)?;
self.push_batch(filtered_batch)
// We only support primitve now, fallback to filter_record_batch for other types
// Also, skip optimization when filter is not very selective
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if always better to take into account biggest_coalesce_batch_size

@alamb
Copy link
Contributor

alamb commented Dec 4, 2025

run benchmark filter_kernels

@alamb
Copy link
Contributor

alamb commented Dec 4, 2025

show benchmark queue

@alamb-ghbot
Copy link

🤖 Hi @alamb, you asked to view the benchmark queue (#8951 (comment)).

Job User Benchmarks Comment
arrow-8933-3613162300.sh alamb default https://github.com/apache/arrow-rs/pull/8933#issuecomment-3613162300
arrow-8933-3613131981.sh alamb filter_kernels https://github.com/apache/arrow-rs/pull/8933#issuecomment-3613131981
arrow-8951-3613212415.sh alamb filter_kernels https://github.com/apache/arrow-rs/pull/8951#issuecomment-3613212415

@Dandandan
Copy link
Contributor Author

Dandandan commented Dec 4, 2025

Hm it seems it contains a bug, probably makes the benchmark results off as well (will take a look tomorrow).

@Dandandan Dandandan marked this pull request as draft December 4, 2025 17:08
@alamb-ghbot
Copy link

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing coalesce_batches_filter (0872a9b) to ed9efe7 diff
BENCH_NAME=filter_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench filter_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=coalesce_batches_filter
Results will be posted here when complete

@alamb-ghbot
Copy link

🤖: Benchmark completed

Details

group                                                                         coalesce_batches_filter                main
-----                                                                         -----------------------                ----
filter context decimal128 (kept 1/2)                                          1.36     57.5±5.45µs        ? ?/sec    1.00     42.1±1.93µs        ? ?/sec
filter context decimal128 high selectivity (kept 1023/1024)                   1.00     55.7±4.51µs        ? ?/sec    1.09     60.5±0.29µs        ? ?/sec
filter context decimal128 low selectivity (kept 1/1024)                       1.00    242.4±0.35ns        ? ?/sec    1.06    256.0±1.60ns        ? ?/sec
filter context f32 (kept 1/2)                                                 1.00     77.7±1.20µs        ? ?/sec    1.00     78.0±2.52µs        ? ?/sec
filter context f32 high selectivity (kept 1023/1024)                          1.00      9.9±0.32µs        ? ?/sec    1.01     10.1±0.30µs        ? ?/sec
filter context f32 low selectivity (kept 1/1024)                              1.00    444.2±7.59ns        ? ?/sec    1.06   469.4±13.36ns        ? ?/sec
filter context fsb with value length 20 (kept 1/2)                            1.00     60.7±1.16µs        ? ?/sec    1.00     60.7±0.37µs        ? ?/sec
filter context fsb with value length 20 high selectivity (kept 1023/1024)     1.00     60.7±0.36µs        ? ?/sec    1.00     60.7±0.56µs        ? ?/sec
filter context fsb with value length 20 low selectivity (kept 1/1024)         1.00     60.6±0.26µs        ? ?/sec    1.00     60.8±1.05µs        ? ?/sec
filter context fsb with value length 5 (kept 1/2)                             1.00     60.8±1.45µs        ? ?/sec    1.00     60.7±1.02µs        ? ?/sec
filter context fsb with value length 5 high selectivity (kept 1023/1024)      1.00     60.7±0.71µs        ? ?/sec    1.00     60.8±1.22µs        ? ?/sec
filter context fsb with value length 5 low selectivity (kept 1/1024)          1.01     61.2±3.05µs        ? ?/sec    1.00     60.8±0.90µs        ? ?/sec
filter context fsb with value length 50 (kept 1/2)                            1.00     60.7±0.46µs        ? ?/sec    1.00     60.8±0.46µs        ? ?/sec
filter context fsb with value length 50 high selectivity (kept 1023/1024)     1.00     61.0±2.06µs        ? ?/sec    1.00     60.7±0.55µs        ? ?/sec
filter context fsb with value length 50 low selectivity (kept 1/1024)         1.00     60.8±1.25µs        ? ?/sec    1.00     60.8±1.00µs        ? ?/sec
filter context i32 (kept 1/2)                                                 1.01     16.6±0.28µs        ? ?/sec    1.00     16.5±0.30µs        ? ?/sec
filter context i32 high selectivity (kept 1023/1024)                          1.04      6.5±0.20µs        ? ?/sec    1.00      6.2±0.17µs        ? ?/sec
filter context i32 low selectivity (kept 1/1024)                              1.00    236.0±5.78ns        ? ?/sec    1.05    246.9±1.45ns        ? ?/sec
filter context i32 w NULLs (kept 1/2)                                         1.00     77.8±2.17µs        ? ?/sec    1.00     77.9±0.80µs        ? ?/sec
filter context i32 w NULLs high selectivity (kept 1023/1024)                  1.00     10.1±0.52µs        ? ?/sec    1.04     10.5±0.18µs        ? ?/sec
filter context i32 w NULLs low selectivity (kept 1/1024)                      1.00    446.9±4.94ns        ? ?/sec    1.06    471.6±6.49ns        ? ?/sec
filter context mixed string view (kept 1/2)                                   1.00    109.0±3.21µs        ? ?/sec    1.11    120.7±3.20µs        ? ?/sec
filter context mixed string view high selectivity (kept 1023/1024)            1.00     53.9±2.45µs        ? ?/sec    1.03     55.3±2.41µs        ? ?/sec
filter context mixed string view low selectivity (kept 1/1024)                1.00   654.9±19.57ns        ? ?/sec    1.04   677.9±18.99ns        ? ?/sec
filter context short string view (kept 1/2)                                   1.00    104.2±1.47µs        ? ?/sec    1.08    112.2±3.44µs        ? ?/sec
filter context short string view high selectivity (kept 1023/1024)            1.02     55.5±1.25µs        ? ?/sec    1.00     54.5±0.23µs        ? ?/sec
filter context short string view low selectivity (kept 1/1024)                1.00    464.2±2.70ns        ? ?/sec    1.06    491.4±7.75ns        ? ?/sec
filter context string (kept 1/2)                                              1.03   599.4±17.30µs        ? ?/sec    1.00    582.1±5.14µs        ? ?/sec
filter context string dictionary (kept 1/2)                                   1.00     17.0±0.13µs        ? ?/sec    1.02     17.3±0.27µs        ? ?/sec
filter context string dictionary high selectivity (kept 1023/1024)            1.00      7.0±0.34µs        ? ?/sec    1.02      7.2±0.27µs        ? ?/sec
filter context string dictionary low selectivity (kept 1/1024)                1.02    847.3±9.58ns        ? ?/sec    1.00    829.8±3.84ns        ? ?/sec
filter context string dictionary w NULLs (kept 1/2)                           1.00     78.8±1.05µs        ? ?/sec    1.00     78.9±2.34µs        ? ?/sec
filter context string dictionary w NULLs high selectivity (kept 1023/1024)    1.00     10.7±0.41µs        ? ?/sec    1.01     10.8±0.35µs        ? ?/sec
filter context string dictionary w NULLs low selectivity (kept 1/1024)        1.01  1076.9±14.42ns        ? ?/sec    1.00  1067.4±30.14ns        ? ?/sec
filter context string high selectivity (kept 1023/1024)                       1.00   703.0±13.80µs        ? ?/sec    1.00   703.8±19.93µs        ? ?/sec
filter context string low selectivity (kept 1/1024)                           1.00  1016.7±52.17ns        ? ?/sec    1.02  1036.2±34.58ns        ? ?/sec
filter context u8 (kept 1/2)                                                  1.00     14.9±0.05µs        ? ?/sec    1.00     15.0±0.14µs        ? ?/sec
filter context u8 high selectivity (kept 1023/1024)                           1.00  1829.3±23.69ns        ? ?/sec    1.11      2.0±0.01µs        ? ?/sec
filter context u8 low selectivity (kept 1/1024)                               1.00    231.0±5.30ns        ? ?/sec    1.03    238.8±0.83ns        ? ?/sec
filter context u8 w NULLs (kept 1/2)                                          1.00     75.9±0.20µs        ? ?/sec    1.00     76.1±0.78µs        ? ?/sec
filter context u8 w NULLs high selectivity (kept 1023/1024)                   1.00      5.1±0.08µs        ? ?/sec    1.05      5.4±0.06µs        ? ?/sec
filter context u8 w NULLs low selectivity (kept 1/1024)                       1.00   441.3±12.39ns        ? ?/sec    1.06    467.4±2.32ns        ? ?/sec
filter decimal128 (kept 1/2)                                                  1.00     49.5±0.83µs        ? ?/sec    1.18     58.6±2.81µs        ? ?/sec
filter decimal128 high selectivity (kept 1023/1024)                           1.17     61.3±2.70µs        ? ?/sec    1.00     52.6±1.25µs        ? ?/sec
filter decimal128 low selectivity (kept 1/1024)                               1.00      2.9±0.09µs        ? ?/sec    1.13      3.2±0.08µs        ? ?/sec
filter f32 (kept 1/2)                                                         1.07    166.6±7.99µs        ? ?/sec    1.00    156.4±2.84µs        ? ?/sec
filter fsb with value length 20 (kept 1/2)                                    1.12    141.6±1.19µs        ? ?/sec    1.00    126.0±3.73µs        ? ?/sec
filter fsb with value length 20 high selectivity (kept 1023/1024)             1.11     76.6±1.07µs        ? ?/sec    1.00     68.7±1.04µs        ? ?/sec
filter fsb with value length 20 low selectivity (kept 1/1024)                 1.00      2.7±0.09µs        ? ?/sec    1.29      3.5±0.10µs        ? ?/sec
filter fsb with value length 5 (kept 1/2)                                     1.17    141.8±2.30µs        ? ?/sec    1.00    121.1±0.87µs        ? ?/sec
filter fsb with value length 5 high selectivity (kept 1023/1024)              1.00     10.8±0.16µs        ? ?/sec    1.05     11.3±0.33µs        ? ?/sec
filter fsb with value length 5 low selectivity (kept 1/1024)                  1.00      2.6±0.08µs        ? ?/sec    1.28      3.3±0.02µs        ? ?/sec
filter fsb with value length 50 (kept 1/2)                                    1.05    189.3±7.05µs        ? ?/sec    1.00    181.1±9.22µs        ? ?/sec
filter fsb with value length 50 high selectivity (kept 1023/1024)             1.00    255.5±8.77µs        ? ?/sec    1.03    264.3±6.26µs        ? ?/sec
filter fsb with value length 50 low selectivity (kept 1/1024)                 1.00      2.6±0.03µs        ? ?/sec    1.27      3.3±0.10µs        ? ?/sec
filter i32 (kept 1/2)                                                         1.25     53.8±0.68µs        ? ?/sec    1.00     43.2±0.31µs        ? ?/sec
filter i32 high selectivity (kept 1023/1024)                                  1.05      8.9±0.48µs        ? ?/sec    1.00      8.4±0.32µs        ? ?/sec
filter i32 low selectivity (kept 1/1024)                                      1.24      2.9±0.06µs        ? ?/sec    1.00      2.4±0.03µs        ? ?/sec
filter optimize (kept 1/2)                                                    1.01     54.8±2.99µs        ? ?/sec    1.00     54.5±1.51µs        ? ?/sec
filter optimize high selectivity (kept 1023/1024)                             1.18      3.1±0.14µs        ? ?/sec    1.00      2.6±0.01µs        ? ?/sec
filter optimize low selectivity (kept 1/1024)                                 1.00      2.7±0.00µs        ? ?/sec    1.00      2.7±0.02µs        ? ?/sec
filter run array (kept 1/2)                                                   1.03   436.4±17.42µs        ? ?/sec    1.00    422.5±4.27µs        ? ?/sec
filter run array high selectivity (kept 1023/1024)                            1.01    452.6±7.45µs        ? ?/sec    1.00   449.3±12.94µs        ? ?/sec
filter run array low selectivity (kept 1/1024)                                1.01   336.4±10.57µs        ? ?/sec    1.00    334.5±2.82µs        ? ?/sec
filter single record batch                                                    1.23     54.3±2.92µs        ? ?/sec    1.00     44.2±0.07µs        ? ?/sec
filter u8 (kept 1/2)                                                          1.00     45.5±0.99µs        ? ?/sec    1.00     45.7±0.44µs        ? ?/sec
filter u8 high selectivity (kept 1023/1024)                                   1.05      4.0±0.11µs        ? ?/sec    1.00      3.8±0.04µs        ? ?/sec
filter u8 low selectivity (kept 1/1024)                                       1.00      3.0±0.05µs        ? ?/sec    1.12      3.3±0.11µs        ? ?/sec

@Dandandan
Copy link
Contributor Author

run benchmark coalesce_kernels

@alamb-ghbot
Copy link

🤖 Hi @Dandandan, thanks for the request (#8951 (comment)).

scrape_comments.py only supports whitelisted benchmarks.

  • Standard: (none)
  • Criterion: arrow_reader, concatenate_kernels, filter_kernels

Please choose one or more of these with run benchmark <name> or run benchmark <name1> <name2>...
Unsupported benchmarks: coalesce_kernels.

@Dandandan Dandandan changed the title Make push_batch_with_filter up to 10x faster for primitive types Make push_batch_with_filter up to 2x faster for primitive types Dec 4, 2025
@Dandandan Dandandan changed the title Make push_batch_with_filter up to 2x faster for primitive types Make push_batch_with_filter up to 3x faster for primitive types Dec 4, 2025
@Dandandan
Copy link
Contributor Author

@alamb I think it's ok now - I called AI (Opus 4.5) for some help on the find_nth_set_bit_position function.

Mainly needs some polish and seeing if we can improve the filter: primitive, 8192, nulls: 0.1, selectivity: 0.8 case.

@alamb-ghbot
Copy link

🤖: Benchmark completed

Details

group                                                                                coalesce_batches_filter                main
-----                                                                                -----------------------                ----
filter: mixed_dict, 8192, nulls: 0, selectivity: 0.001                               1.01    277.1±1.94ms        ? ?/sec    1.00    273.9±2.22ms        ? ?/sec
filter: mixed_dict, 8192, nulls: 0, selectivity: 0.01                                1.04      9.6±0.29ms        ? ?/sec    1.00      9.2±0.32ms        ? ?/sec
filter: mixed_dict, 8192, nulls: 0, selectivity: 0.1                                 1.02      4.4±0.13ms        ? ?/sec    1.00      4.3±0.10ms        ? ?/sec
filter: mixed_dict, 8192, nulls: 0, selectivity: 0.8                                 1.00      3.5±0.05ms        ? ?/sec    1.06      3.7±0.05ms        ? ?/sec
filter: mixed_dict, 8192, nulls: 0.1, selectivity: 0.001                             1.00    261.6±1.88ms        ? ?/sec    1.28    333.6±2.60ms        ? ?/sec
filter: mixed_dict, 8192, nulls: 0.1, selectivity: 0.01                              1.00     10.0±0.36ms        ? ?/sec    1.00     10.1±0.43ms        ? ?/sec
filter: mixed_dict, 8192, nulls: 0.1, selectivity: 0.1                               1.05      4.9±0.05ms        ? ?/sec    1.00      4.6±0.09ms        ? ?/sec
filter: mixed_dict, 8192, nulls: 0.1, selectivity: 0.8                               1.00      3.8±0.04ms        ? ?/sec    1.25      4.8±0.06ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0, selectivity: 0.001                               1.04     66.0±1.43ms        ? ?/sec    1.00     63.2±1.47ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0, selectivity: 0.01                                1.00     11.9±0.17ms        ? ?/sec    1.01     12.1±0.21ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0, selectivity: 0.1                                 1.01     10.6±0.47ms        ? ?/sec    1.00     10.5±0.26ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0, selectivity: 0.8                                 1.00      9.9±0.21ms        ? ?/sec    1.34     13.2±0.38ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0.1, selectivity: 0.001                             1.01     73.2±1.24ms        ? ?/sec    1.00     72.3±0.90ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0.1, selectivity: 0.01                              1.03     13.5±0.19ms        ? ?/sec    1.00     13.1±0.16ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0.1, selectivity: 0.1                               1.06     11.2±0.40ms        ? ?/sec    1.00     10.6±0.32ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0.1, selectivity: 0.8                               1.00     10.2±0.27ms        ? ?/sec    1.14     11.6±0.46ms        ? ?/sec
filter: mixed_utf8view (max_string_len=128), 8192, nulls: 0, selectivity: 0.001      1.01     49.7±0.34ms        ? ?/sec    1.00     49.0±1.09ms        ? ?/sec
filter: mixed_utf8view (max_string_len=128), 8192, nulls: 0, selectivity: 0.01       1.01      6.2±0.06ms        ? ?/sec    1.00      6.1±0.20ms        ? ?/sec
filter: mixed_utf8view (max_string_len=128), 8192, nulls: 0, selectivity: 0.1        1.04      5.0±0.20ms        ? ?/sec    1.00      4.8±0.13ms        ? ?/sec
filter: mixed_utf8view (max_string_len=128), 8192, nulls: 0, selectivity: 0.8        1.00      3.1±0.11ms        ? ?/sec    1.15      3.6±0.13ms        ? ?/sec
filter: mixed_utf8view (max_string_len=128), 8192, nulls: 0.1, selectivity: 0.001    1.03     60.2±1.30ms        ? ?/sec    1.00     58.6±0.81ms        ? ?/sec
filter: mixed_utf8view (max_string_len=128), 8192, nulls: 0.1, selectivity: 0.01     1.00      8.2±0.11ms        ? ?/sec    1.00      8.2±0.09ms        ? ?/sec
filter: mixed_utf8view (max_string_len=128), 8192, nulls: 0.1, selectivity: 0.1      1.10      6.2±0.16ms        ? ?/sec    1.00      5.6±0.13ms        ? ?/sec
filter: mixed_utf8view (max_string_len=128), 8192, nulls: 0.1, selectivity: 0.8      1.00      2.3±0.04ms        ? ?/sec    1.74      4.0±0.06ms        ? ?/sec
filter: mixed_utf8view (max_string_len=20), 8192, nulls: 0, selectivity: 0.001       1.02     43.6±0.62ms        ? ?/sec    1.00     42.9±0.57ms        ? ?/sec
filter: mixed_utf8view (max_string_len=20), 8192, nulls: 0, selectivity: 0.01        1.01      4.8±0.05ms        ? ?/sec    1.00      4.8±0.10ms        ? ?/sec
filter: mixed_utf8view (max_string_len=20), 8192, nulls: 0, selectivity: 0.1         1.09      2.6±0.20ms        ? ?/sec    1.00      2.4±0.09ms        ? ?/sec
filter: mixed_utf8view (max_string_len=20), 8192, nulls: 0, selectivity: 0.8         1.00  1181.2±18.79µs        ? ?/sec    1.32  1563.3±17.05µs        ? ?/sec
filter: mixed_utf8view (max_string_len=20), 8192, nulls: 0.1, selectivity: 0.001     1.02     53.2±0.69ms        ? ?/sec    1.00     52.1±0.63ms        ? ?/sec
filter: mixed_utf8view (max_string_len=20), 8192, nulls: 0.1, selectivity: 0.01      1.01      7.2±0.31ms        ? ?/sec    1.00      7.1±0.05ms        ? ?/sec
filter: mixed_utf8view (max_string_len=20), 8192, nulls: 0.1, selectivity: 0.1       1.08      3.9±0.10ms        ? ?/sec    1.00      3.7±0.04ms        ? ?/sec
filter: mixed_utf8view (max_string_len=20), 8192, nulls: 0.1, selectivity: 0.8       1.00      2.4±0.02ms        ? ?/sec    1.71      4.0±0.04ms        ? ?/sec
filter: primitive, 8192, nulls: 0, selectivity: 0.001                                1.00     54.1±0.43ms        ? ?/sec    1.80     97.2±0.29ms        ? ?/sec
filter: primitive, 8192, nulls: 0, selectivity: 0.01                                 1.00      5.9±0.06ms        ? ?/sec    1.59      9.3±0.19ms        ? ?/sec
filter: primitive, 8192, nulls: 0, selectivity: 0.1                                  1.00      3.9±0.42ms        ? ?/sec    1.00      3.9±0.12ms        ? ?/sec
filter: primitive, 8192, nulls: 0, selectivity: 0.8                                  1.00  1730.1±35.92µs        ? ?/sec    1.81      3.1±0.03ms        ? ?/sec
filter: primitive, 8192, nulls: 0.1, selectivity: 0.001                              1.00     59.1±0.67ms        ? ?/sec    2.13    126.1±1.67ms        ? ?/sec
filter: primitive, 8192, nulls: 0.1, selectivity: 0.01                               1.00      8.2±0.06ms        ? ?/sec    1.84     15.0±0.23ms        ? ?/sec
filter: primitive, 8192, nulls: 0.1, selectivity: 0.1                                1.00      6.5±0.30ms        ? ?/sec    1.08      7.0±0.11ms        ? ?/sec
filter: primitive, 8192, nulls: 0.1, selectivity: 0.8                                1.00      4.9±0.07ms        ? ?/sec    1.85      9.1±0.08ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0, selectivity: 0.001                          1.03     68.0±0.33ms        ? ?/sec    1.00     66.1±0.38ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0, selectivity: 0.01                           1.04      7.7±0.05ms        ? ?/sec    1.00      7.4±0.24ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0, selectivity: 0.1                            1.03      4.1±0.32ms        ? ?/sec    1.00      3.9±0.10ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0, selectivity: 0.8                            1.00  1366.2±15.03µs        ? ?/sec    1.06  1442.0±19.64µs        ? ?/sec
filter: single_utf8view, 8192, nulls: 0.1, selectivity: 0.001                        1.06     89.9±0.59ms        ? ?/sec    1.00     84.7±1.17ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0.1, selectivity: 0.01                         1.04     11.6±0.09ms        ? ?/sec    1.00     11.2±0.07ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0.1, selectivity: 0.1                          1.00      5.2±0.31ms        ? ?/sec    1.03      5.4±0.22ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0.1, selectivity: 0.8                          1.00      2.7±0.07ms        ? ?/sec    1.45      4.0±0.04ms        ? ?/sec

@alamb-ghbot
Copy link

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing coalesce_batches_filter (bb025cf) to ed9efe7 diff
BENCH_NAME=filter_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench filter_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=coalesce_batches_filter
Results will be posted here when complete

@alamb-ghbot
Copy link

🤖: Benchmark completed

Details

group                                                                         coalesce_batches_filter                main
-----                                                                         -----------------------                ----
filter context decimal128 (kept 1/2)                                          1.00     44.1±1.01µs        ? ?/sec    1.02     45.1±1.75µs        ? ?/sec
filter context decimal128 high selectivity (kept 1023/1024)                   1.00     50.7±0.96µs        ? ?/sec    1.00     50.9±1.35µs        ? ?/sec
filter context decimal128 low selectivity (kept 1/1024)                       1.00    240.2±4.97ns        ? ?/sec    1.04   248.8±27.57ns        ? ?/sec
filter context f32 (kept 1/2)                                                 1.00     77.8±0.69µs        ? ?/sec    1.00     77.7±0.42µs        ? ?/sec
filter context f32 high selectivity (kept 1023/1024)                          1.00     10.5±0.26µs        ? ?/sec    1.00     10.5±0.32µs        ? ?/sec
filter context f32 low selectivity (kept 1/1024)                              1.00    442.5±1.33ns        ? ?/sec    1.02    453.5±4.38ns        ? ?/sec
filter context fsb with value length 20 (kept 1/2)                            1.00     60.7±0.53µs        ? ?/sec    1.00     60.8±0.68µs        ? ?/sec
filter context fsb with value length 20 high selectivity (kept 1023/1024)     1.00     61.0±2.62µs        ? ?/sec    1.00     60.8±0.82µs        ? ?/sec
filter context fsb with value length 20 low selectivity (kept 1/1024)         1.00     60.7±0.52µs        ? ?/sec    1.00     60.8±0.44µs        ? ?/sec
filter context fsb with value length 5 (kept 1/2)                             1.00     60.7±0.58µs        ? ?/sec    1.00     60.8±0.43µs        ? ?/sec
filter context fsb with value length 5 high selectivity (kept 1023/1024)      1.00     60.8±0.62µs        ? ?/sec    1.00     60.7±0.34µs        ? ?/sec
filter context fsb with value length 5 low selectivity (kept 1/1024)          1.00     60.7±0.86µs        ? ?/sec    1.00     60.9±2.06µs        ? ?/sec
filter context fsb with value length 50 (kept 1/2)                            1.00     61.0±2.13µs        ? ?/sec    1.00     61.2±1.53µs        ? ?/sec
filter context fsb with value length 50 high selectivity (kept 1023/1024)     1.00     60.7±0.48µs        ? ?/sec    1.00     60.9±1.44µs        ? ?/sec
filter context fsb with value length 50 low selectivity (kept 1/1024)         1.00     60.6±0.19µs        ? ?/sec    1.00     60.7±0.21µs        ? ?/sec
filter context i32 (kept 1/2)                                                 1.01     16.7±0.08µs        ? ?/sec    1.00     16.4±0.25µs        ? ?/sec
filter context i32 high selectivity (kept 1023/1024)                          1.00      6.8±0.20µs        ? ?/sec    1.00      6.8±0.21µs        ? ?/sec
filter context i32 low selectivity (kept 1/1024)                              1.00    233.0±2.05ns        ? ?/sec    1.00    233.4±2.00ns        ? ?/sec
filter context i32 w NULLs (kept 1/2)                                         1.00     78.0±1.49µs        ? ?/sec    1.00     77.7±0.48µs        ? ?/sec
filter context i32 w NULLs high selectivity (kept 1023/1024)                  1.07     11.0±0.24µs        ? ?/sec    1.00     10.3±0.34µs        ? ?/sec
filter context i32 w NULLs low selectivity (kept 1/1024)                      1.00    444.5±4.31ns        ? ?/sec    1.03    457.1±6.10ns        ? ?/sec
filter context mixed string view (kept 1/2)                                   1.00    103.8±1.23µs        ? ?/sec    1.02    105.9±4.60µs        ? ?/sec
filter context mixed string view high selectivity (kept 1023/1024)            1.00     53.7±1.68µs        ? ?/sec    1.07     57.7±1.29µs        ? ?/sec
filter context mixed string view low selectivity (kept 1/1024)                1.00    627.1±2.50ns        ? ?/sec    1.05    658.2±2.46ns        ? ?/sec
filter context short string view (kept 1/2)                                   1.00    106.2±3.93µs        ? ?/sec    1.00    105.9±5.09µs        ? ?/sec
filter context short string view high selectivity (kept 1023/1024)            1.00     54.8±1.60µs        ? ?/sec    1.01     55.3±0.75µs        ? ?/sec
filter context short string view low selectivity (kept 1/1024)                1.00    464.9±5.81ns        ? ?/sec    1.03   479.0±25.84ns        ? ?/sec
filter context string (kept 1/2)                                              1.00   644.2±18.04µs        ? ?/sec    1.00   641.2±18.57µs        ? ?/sec
filter context string dictionary (kept 1/2)                                   1.00     17.0±0.14µs        ? ?/sec    1.03     17.5±0.29µs        ? ?/sec
filter context string dictionary high selectivity (kept 1023/1024)            1.07      8.1±0.28µs        ? ?/sec    1.00      7.5±0.19µs        ? ?/sec
filter context string dictionary low selectivity (kept 1/1024)                1.00    810.9±3.02ns        ? ?/sec    1.06    855.6±3.37ns        ? ?/sec
filter context string dictionary w NULLs (kept 1/2)                           1.00     78.6±0.35µs        ? ?/sec    1.01     79.2±1.19µs        ? ?/sec
filter context string dictionary w NULLs high selectivity (kept 1023/1024)    1.03     11.7±0.26µs        ? ?/sec    1.00     11.4±0.44µs        ? ?/sec
filter context string dictionary w NULLs low selectivity (kept 1/1024)        1.00  1051.7±28.46ns        ? ?/sec    1.03   1084.0±6.95ns        ? ?/sec
filter context string high selectivity (kept 1023/1024)                       1.00   919.0±59.05µs        ? ?/sec    1.00   918.0±58.64µs        ? ?/sec
filter context string low selectivity (kept 1/1024)                           1.00  1003.8±44.61ns        ? ?/sec    1.13  1135.8±13.05ns        ? ?/sec
filter context u8 (kept 1/2)                                                  1.00     15.1±0.29µs        ? ?/sec    1.00     15.0±0.19µs        ? ?/sec
filter context u8 high selectivity (kept 1023/1024)                           1.01  1857.9±20.28ns        ? ?/sec    1.00  1840.3±27.94ns        ? ?/sec
filter context u8 low selectivity (kept 1/1024)                               1.00    228.6±0.61ns        ? ?/sec    1.00    228.7±2.06ns        ? ?/sec
filter context u8 w NULLs (kept 1/2)                                          1.00     76.0±0.43µs        ? ?/sec    1.00     76.1±0.88µs        ? ?/sec
filter context u8 w NULLs high selectivity (kept 1023/1024)                   1.01      5.4±0.02µs        ? ?/sec    1.00      5.4±0.02µs        ? ?/sec
filter context u8 w NULLs low selectivity (kept 1/1024)                       1.00    440.0±4.62ns        ? ?/sec    1.04    455.9±3.00ns        ? ?/sec
filter decimal128 (kept 1/2)                                                  1.00     57.2±2.82µs        ? ?/sec    1.01     58.0±3.28µs        ? ?/sec
filter decimal128 high selectivity (kept 1023/1024)                           1.04     56.4±0.63µs        ? ?/sec    1.00     54.0±1.63µs        ? ?/sec
filter decimal128 low selectivity (kept 1/1024)                               1.00      3.2±0.05µs        ? ?/sec    1.03      3.2±0.01µs        ? ?/sec
filter f32 (kept 1/2)                                                         1.01    157.4±1.69µs        ? ?/sec    1.00    156.0±1.07µs        ? ?/sec
filter fsb with value length 20 (kept 1/2)                                    1.06    133.0±2.53µs        ? ?/sec    1.00    125.3±0.91µs        ? ?/sec
filter fsb with value length 20 high selectivity (kept 1023/1024)             1.02     74.3±2.39µs        ? ?/sec    1.00     72.7±2.16µs        ? ?/sec
filter fsb with value length 20 low selectivity (kept 1/1024)                 1.00      3.1±0.01µs        ? ?/sec    1.12      3.5±0.03µs        ? ?/sec
filter fsb with value length 5 (kept 1/2)                                     1.14    137.7±1.51µs        ? ?/sec    1.00    121.1±0.67µs        ? ?/sec
filter fsb with value length 5 high selectivity (kept 1023/1024)              1.00     11.6±0.22µs        ? ?/sec    1.00     11.5±0.44µs        ? ?/sec
filter fsb with value length 5 low selectivity (kept 1/1024)                  1.00      3.1±0.04µs        ? ?/sec    1.11      3.4±0.03µs        ? ?/sec
filter fsb with value length 50 (kept 1/2)                                    1.01   168.4±11.42µs        ? ?/sec    1.00    167.1±3.00µs        ? ?/sec
filter fsb with value length 50 high selectivity (kept 1023/1024)             1.00    216.6±5.18µs        ? ?/sec    1.00    217.1±4.37µs        ? ?/sec
filter fsb with value length 50 low selectivity (kept 1/1024)                 1.00      3.1±0.04µs        ? ?/sec    1.08      3.4±0.01µs        ? ?/sec
filter i32 (kept 1/2)                                                         1.07     46.4±0.18µs        ? ?/sec    1.00     43.3±0.21µs        ? ?/sec
filter i32 high selectivity (kept 1023/1024)                                  1.00      8.9±0.19µs        ? ?/sec    1.01      9.0±0.16µs        ? ?/sec
filter i32 low selectivity (kept 1/1024)                                      1.38      3.3±0.05µs        ? ?/sec    1.00      2.4±0.03µs        ? ?/sec
filter optimize (kept 1/2)                                                    1.01     54.5±0.31µs        ? ?/sec    1.00     54.2±0.26µs        ? ?/sec
filter optimize high selectivity (kept 1023/1024)                             1.00      2.6±0.02µs        ? ?/sec    1.01      2.6±0.05µs        ? ?/sec
filter optimize low selectivity (kept 1/1024)                                 1.11      3.1±0.08µs        ? ?/sec    1.00      2.8±0.06µs        ? ?/sec
filter run array (kept 1/2)                                                   1.01   427.0±10.06µs        ? ?/sec    1.00   423.4±12.15µs        ? ?/sec
filter run array high selectivity (kept 1023/1024)                            1.00    453.3±9.13µs        ? ?/sec    1.00    452.3±3.27µs        ? ?/sec
filter run array low selectivity (kept 1/1024)                                1.01    338.6±9.42µs        ? ?/sec    1.00    334.9±2.17µs        ? ?/sec
filter single record batch                                                    1.03     45.7±0.70µs        ? ?/sec    1.00     44.3±0.23µs        ? ?/sec
filter u8 (kept 1/2)                                                          1.00     45.1±0.11µs        ? ?/sec    1.01     45.6±0.11µs        ? ?/sec
filter u8 high selectivity (kept 1023/1024)                                   1.00      3.8±0.02µs        ? ?/sec    1.00      3.8±0.03µs        ? ?/sec
filter u8 low selectivity (kept 1/1024)                                       1.00      3.3±0.02µs        ? ?/sec    1.00      3.3±0.02µs        ? ?/sec

@alamb
Copy link
Contributor

alamb commented Dec 11, 2025

Hi @Dandandan -- I am working through the arrow-rs review backlog. coalesce benchmarks look better. Filter kernels look potentially slower. I'll rerun and try to see if we can reproduce the results

@alamb
Copy link
Contributor

alamb commented Dec 11, 2025

run benchmark filter_kernels

@alamb-ghbot
Copy link

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing coalesce_batches_filter (bb025cf) to ed9efe7 diff
BENCH_NAME=filter_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench filter_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=coalesce_batches_filter
Results will be posted here when complete

@alamb-ghbot
Copy link

🤖: Benchmark completed

Details

group                                                                         coalesce_batches_filter                main
-----                                                                         -----------------------                ----
filter context decimal128 (kept 1/2)                                          1.00     45.9±6.33µs        ? ?/sec    1.01     46.4±6.05µs        ? ?/sec
filter context decimal128 high selectivity (kept 1023/1024)                   1.03     51.1±2.22µs        ? ?/sec    1.00     49.4±1.18µs        ? ?/sec
filter context decimal128 low selectivity (kept 1/1024)                       1.00    241.1±2.07ns        ? ?/sec    1.01    244.2±1.03ns        ? ?/sec
filter context f32 (kept 1/2)                                                 1.00     78.0±1.94µs        ? ?/sec    1.00     77.9±0.71µs        ? ?/sec
filter context f32 high selectivity (kept 1023/1024)                          1.00     10.4±0.49µs        ? ?/sec    1.00     10.4±0.31µs        ? ?/sec
filter context f32 low selectivity (kept 1/1024)                              1.00    451.1±1.68ns        ? ?/sec    1.02    458.4±7.82ns        ? ?/sec
filter context fsb with value length 20 (kept 1/2)                            1.00     60.6±0.28µs        ? ?/sec    1.00     60.6±0.24µs        ? ?/sec
filter context fsb with value length 20 high selectivity (kept 1023/1024)     1.00     60.8±0.96µs        ? ?/sec    1.00     60.7±0.90µs        ? ?/sec
filter context fsb with value length 20 low selectivity (kept 1/1024)         1.00     60.5±0.24µs        ? ?/sec    1.00     60.8±0.81µs        ? ?/sec
filter context fsb with value length 5 (kept 1/2)                             1.00     61.1±1.47µs        ? ?/sec    1.00     60.8±0.80µs        ? ?/sec
filter context fsb with value length 5 high selectivity (kept 1023/1024)      1.00     60.8±0.80µs        ? ?/sec    1.00     60.7±0.76µs        ? ?/sec
filter context fsb with value length 5 low selectivity (kept 1/1024)          1.00     60.7±0.66µs        ? ?/sec    1.00     61.0±1.31µs        ? ?/sec
filter context fsb with value length 50 (kept 1/2)                            1.00     60.6±0.93µs        ? ?/sec    1.00     60.8±0.33µs        ? ?/sec
filter context fsb with value length 50 high selectivity (kept 1023/1024)     1.00     60.9±0.62µs        ? ?/sec    1.00     60.7±0.25µs        ? ?/sec
filter context fsb with value length 50 low selectivity (kept 1/1024)         1.00     60.6±0.46µs        ? ?/sec    1.00     60.7±0.60µs        ? ?/sec
filter context i32 (kept 1/2)                                                 1.01     16.6±0.09µs        ? ?/sec    1.00     16.5±0.17µs        ? ?/sec
filter context i32 high selectivity (kept 1023/1024)                          1.00      6.4±0.40µs        ? ?/sec    1.02      6.6±0.40µs        ? ?/sec
filter context i32 low selectivity (kept 1/1024)                              1.01    238.3±1.05ns        ? ?/sec    1.00    235.9±3.08ns        ? ?/sec
filter context i32 w NULLs (kept 1/2)                                         1.00     77.7±0.54µs        ? ?/sec    1.00     77.7±1.59µs        ? ?/sec
filter context i32 w NULLs high selectivity (kept 1023/1024)                  1.00     10.2±0.45µs        ? ?/sec    1.05     10.8±0.41µs        ? ?/sec
filter context i32 w NULLs low selectivity (kept 1/1024)                      1.00    454.0±5.90ns        ? ?/sec    1.02    462.1±7.68ns        ? ?/sec
filter context mixed string view (kept 1/2)                                   1.00    104.5±4.74µs        ? ?/sec    1.02    106.5±4.35µs        ? ?/sec
filter context mixed string view high selectivity (kept 1023/1024)            1.00     53.9±1.70µs        ? ?/sec    1.04     56.2±1.61µs        ? ?/sec
filter context mixed string view low selectivity (kept 1/1024)                1.00    675.7±7.20ns        ? ?/sec    1.01    684.0±2.67ns        ? ?/sec
filter context short string view (kept 1/2)                                   1.00    104.8±5.30µs        ? ?/sec    1.02    107.0±3.28µs        ? ?/sec
filter context short string view high selectivity (kept 1023/1024)            1.01     55.5±1.57µs        ? ?/sec    1.00     54.9±1.38µs        ? ?/sec
filter context short string view low selectivity (kept 1/1024)                1.01    482.5±4.17ns        ? ?/sec    1.00    477.0±1.44ns        ? ?/sec
filter context string (kept 1/2)                                              1.00   585.3±12.49µs        ? ?/sec    1.03   603.2±17.73µs        ? ?/sec
filter context string dictionary (kept 1/2)                                   1.01     17.4±0.13µs        ? ?/sec    1.00     17.2±0.16µs        ? ?/sec
filter context string dictionary high selectivity (kept 1023/1024)            1.00      7.2±0.50µs        ? ?/sec    1.09      7.9±0.19µs        ? ?/sec
filter context string dictionary low selectivity (kept 1/1024)                1.00    832.1±8.58ns        ? ?/sec    1.01   842.1±14.13ns        ? ?/sec
filter context string dictionary w NULLs (kept 1/2)                           1.01     79.1±1.46µs        ? ?/sec    1.00     78.6±0.86µs        ? ?/sec
filter context string dictionary w NULLs high selectivity (kept 1023/1024)    1.00     11.1±0.46µs        ? ?/sec    1.01     11.2±0.26µs        ? ?/sec
filter context string dictionary w NULLs low selectivity (kept 1/1024)        1.00  1064.7±16.00ns        ? ?/sec    1.02   1086.4±7.18ns        ? ?/sec
filter context string high selectivity (kept 1023/1024)                       1.00   670.3±23.59µs        ? ?/sec    1.37  917.2±161.37µs        ? ?/sec
filter context string low selectivity (kept 1/1024)                           1.02   1037.6±5.15ns        ? ?/sec    1.00  1015.4±17.73ns        ? ?/sec
filter context u8 (kept 1/2)                                                  1.00     15.0±0.23µs        ? ?/sec    1.00     15.0±0.08µs        ? ?/sec
filter context u8 high selectivity (kept 1023/1024)                           1.00  1820.1±12.84ns        ? ?/sec    1.00  1826.1±31.22ns        ? ?/sec
filter context u8 low selectivity (kept 1/1024)                               1.00    230.7±1.42ns        ? ?/sec    1.00    231.1±2.77ns        ? ?/sec
filter context u8 w NULLs (kept 1/2)                                          1.00     76.0±0.66µs        ? ?/sec    1.00     76.2±0.61µs        ? ?/sec
filter context u8 w NULLs high selectivity (kept 1023/1024)                   1.00      5.4±0.12µs        ? ?/sec    1.00      5.4±0.03µs        ? ?/sec
filter context u8 w NULLs low selectivity (kept 1/1024)                       1.00    446.4±8.90ns        ? ?/sec    1.02    457.0±1.87ns        ? ?/sec
filter decimal128 (kept 1/2)                                                  1.00     51.9±4.04µs        ? ?/sec    1.10     57.0±3.08µs        ? ?/sec
filter decimal128 high selectivity (kept 1023/1024)                           1.00     52.6±1.80µs        ? ?/sec    1.00     52.5±1.20µs        ? ?/sec
filter decimal128 low selectivity (kept 1/1024)                               1.00      3.2±0.10µs        ? ?/sec    1.02      3.2±0.04µs        ? ?/sec
filter f32 (kept 1/2)                                                         1.01    157.2±1.78µs        ? ?/sec    1.00    156.4±1.54µs        ? ?/sec
filter fsb with value length 20 (kept 1/2)                                    1.06    132.2±1.80µs        ? ?/sec    1.00    124.8±0.92µs        ? ?/sec
filter fsb with value length 20 high selectivity (kept 1023/1024)             1.00     71.8±2.78µs        ? ?/sec    1.00     71.6±2.42µs        ? ?/sec
filter fsb with value length 20 low selectivity (kept 1/1024)                 1.00      3.2±0.02µs        ? ?/sec    1.07      3.4±0.03µs        ? ?/sec
filter fsb with value length 5 (kept 1/2)                                     1.13    137.3±0.79µs        ? ?/sec    1.00    121.2±0.60µs        ? ?/sec
filter fsb with value length 5 high selectivity (kept 1023/1024)              1.00     11.3±0.65µs        ? ?/sec    1.01     11.4±0.63µs        ? ?/sec
filter fsb with value length 5 low selectivity (kept 1/1024)                  1.00      3.1±0.01µs        ? ?/sec    1.09      3.4±0.06µs        ? ?/sec
filter fsb with value length 50 (kept 1/2)                                    1.00   164.1±13.09µs        ? ?/sec    1.04   170.8±17.51µs        ? ?/sec
filter fsb with value length 50 high selectivity (kept 1023/1024)             1.00    204.1±6.86µs        ? ?/sec    1.01    206.4±8.06µs        ? ?/sec
filter fsb with value length 50 low selectivity (kept 1/1024)                 1.00      3.2±0.03µs        ? ?/sec    1.06      3.4±0.02µs        ? ?/sec
filter i32 (kept 1/2)                                                         1.07     46.4±0.12µs        ? ?/sec    1.00     43.3±0.16µs        ? ?/sec
filter i32 high selectivity (kept 1023/1024)                                  1.00      8.5±0.36µs        ? ?/sec    1.05      9.0±0.39µs        ? ?/sec
filter i32 low selectivity (kept 1/1024)                                      1.39      3.3±0.06µs        ? ?/sec    1.00      2.4±0.01µs        ? ?/sec
filter optimize (kept 1/2)                                                    1.01     54.5±0.54µs        ? ?/sec    1.00     54.1±0.37µs        ? ?/sec
filter optimize high selectivity (kept 1023/1024)                             1.00      2.6±0.01µs        ? ?/sec    1.01      2.6±0.01µs        ? ?/sec
filter optimize low selectivity (kept 1/1024)                                 1.11      3.0±0.02µs        ? ?/sec    1.00      2.8±0.02µs        ? ?/sec
filter run array (kept 1/2)                                                   1.00    426.0±5.10µs        ? ?/sec    1.00   424.0±11.28µs        ? ?/sec
filter run array high selectivity (kept 1023/1024)                            1.00    450.1±4.75µs        ? ?/sec    1.00    450.9±5.01µs        ? ?/sec
filter run array low selectivity (kept 1/1024)                                1.01    337.5±5.50µs        ? ?/sec    1.00    335.4±3.55µs        ? ?/sec
filter single record batch                                                    1.03     45.6±0.24µs        ? ?/sec    1.00     44.3±0.42µs        ? ?/sec
filter u8 (kept 1/2)                                                          1.00     45.1±0.11µs        ? ?/sec    1.01     45.6±0.20µs        ? ?/sec
filter u8 high selectivity (kept 1023/1024)                                   1.02      3.8±0.03µs        ? ?/sec    1.00      3.7±0.03µs        ? ?/sec
filter u8 low selectivity (kept 1/1024)                                       1.00      3.3±0.05µs        ? ?/sec    1.00      3.3±0.02µs        ? ?/sec

@alamb
Copy link
Contributor

alamb commented Dec 11, 2025

run benchmark filter_kernels

@Dandandan
Copy link
Contributor Author

Hi @Dandandan -- I am working through the arrow-rs review backlog. coalesce benchmarks look better. Filter kernels look potentially slower. I'll rerun and try to see if we can reproduce the results

I don't think the filter kernels should have any impact other than the threshold, but that isn't covered by a benchmark.

@alamb-ghbot
Copy link

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing coalesce_batches_filter (bb025cf) to ed9efe7 diff
BENCH_NAME=filter_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench filter_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=coalesce_batches_filter
Results will be posted here when complete

@alamb-ghbot
Copy link

🤖: Benchmark completed

Details

group                                                                         coalesce_batches_filter                main
-----                                                                         -----------------------                ----
filter context decimal128 (kept 1/2)                                          1.06     47.0±7.08µs        ? ?/sec    1.00     44.3±1.94µs        ? ?/sec
filter context decimal128 high selectivity (kept 1023/1024)                   1.02     51.2±1.84µs        ? ?/sec    1.00     50.0±2.98µs        ? ?/sec
filter context decimal128 low selectivity (kept 1/1024)                       1.00    241.4±2.88ns        ? ?/sec    1.00    240.8±4.26ns        ? ?/sec
filter context f32 (kept 1/2)                                                 1.00     77.9±0.90µs        ? ?/sec    1.00     77.7±0.95µs        ? ?/sec
filter context f32 high selectivity (kept 1023/1024)                          1.02     10.5±0.43µs        ? ?/sec    1.00     10.3±0.44µs        ? ?/sec
filter context f32 low selectivity (kept 1/1024)                              1.00    442.0±7.36ns        ? ?/sec    1.03    454.6±1.42ns        ? ?/sec
filter context fsb with value length 20 (kept 1/2)                            1.00     60.6±0.67µs        ? ?/sec    1.00     60.9±0.62µs        ? ?/sec
filter context fsb with value length 20 high selectivity (kept 1023/1024)     1.00     60.7±0.74µs        ? ?/sec    1.00     60.8±0.52µs        ? ?/sec
filter context fsb with value length 20 low selectivity (kept 1/1024)         1.00     60.6±0.30µs        ? ?/sec    1.00     60.8±0.61µs        ? ?/sec
filter context fsb with value length 5 (kept 1/2)                             1.00     60.7±0.52µs        ? ?/sec    1.01     61.1±0.80µs        ? ?/sec
filter context fsb with value length 5 high selectivity (kept 1023/1024)      1.00     60.9±1.82µs        ? ?/sec    1.00     60.7±0.56µs        ? ?/sec
filter context fsb with value length 5 low selectivity (kept 1/1024)          1.00     60.7±0.43µs        ? ?/sec    1.00     60.6±0.16µs        ? ?/sec
filter context fsb with value length 50 (kept 1/2)                            1.00     60.8±0.32µs        ? ?/sec    1.02     61.9±5.83µs        ? ?/sec
filter context fsb with value length 50 high selectivity (kept 1023/1024)     1.00     60.9±1.96µs        ? ?/sec    1.00     60.7±0.69µs        ? ?/sec
filter context fsb with value length 50 low selectivity (kept 1/1024)         1.00     60.6±0.42µs        ? ?/sec    1.00     60.7±0.34µs        ? ?/sec
filter context i32 (kept 1/2)                                                 1.02     17.0±0.11µs        ? ?/sec    1.00     16.6±0.09µs        ? ?/sec
filter context i32 high selectivity (kept 1023/1024)                          1.00      6.5±0.45µs        ? ?/sec    1.00      6.6±0.38µs        ? ?/sec
filter context i32 low selectivity (kept 1/1024)                              1.02    239.3±1.79ns        ? ?/sec    1.00    234.5±2.04ns        ? ?/sec
filter context i32 w NULLs (kept 1/2)                                         1.01     78.0±0.70µs        ? ?/sec    1.00     77.5±0.39µs        ? ?/sec
filter context i32 w NULLs high selectivity (kept 1023/1024)                  1.05     10.6±0.52µs        ? ?/sec    1.00     10.1±0.53µs        ? ?/sec
filter context i32 w NULLs low selectivity (kept 1/1024)                      1.00   450.4±20.45ns        ? ?/sec    1.03    463.5±3.66ns        ? ?/sec
filter context mixed string view (kept 1/2)                                   1.01    105.2±5.62µs        ? ?/sec    1.00    104.2±4.47µs        ? ?/sec
filter context mixed string view high selectivity (kept 1023/1024)            1.00     54.9±1.34µs        ? ?/sec    1.01     55.5±1.30µs        ? ?/sec
filter context mixed string view low selectivity (kept 1/1024)                1.00    663.5±1.51ns        ? ?/sec    1.02    678.2±2.96ns        ? ?/sec
filter context short string view (kept 1/2)                                   1.00    103.8±4.66µs        ? ?/sec    1.01    104.5±5.17µs        ? ?/sec
filter context short string view high selectivity (kept 1023/1024)            1.00     53.3±1.67µs        ? ?/sec    1.04     55.5±1.91µs        ? ?/sec
filter context short string view low selectivity (kept 1/1024)                1.00    471.2±8.20ns        ? ?/sec    1.01    476.8±5.12ns        ? ?/sec
filter context string (kept 1/2)                                              1.00   576.6±11.44µs        ? ?/sec    1.01   584.2±12.53µs        ? ?/sec
filter context string dictionary (kept 1/2)                                   1.00     17.0±0.13µs        ? ?/sec    1.01     17.1±0.11µs        ? ?/sec
filter context string dictionary high selectivity (kept 1023/1024)            1.02      7.6±0.29µs        ? ?/sec    1.00      7.4±0.58µs        ? ?/sec
filter context string dictionary low selectivity (kept 1/1024)                1.03    829.0±9.39ns        ? ?/sec    1.00    806.5±4.65ns        ? ?/sec
filter context string dictionary w NULLs (kept 1/2)                           1.00     78.8±0.44µs        ? ?/sec    1.00     78.4±1.77µs        ? ?/sec
filter context string dictionary w NULLs high selectivity (kept 1023/1024)    1.06     11.5±0.49µs        ? ?/sec    1.00     10.9±0.44µs        ? ?/sec
filter context string dictionary w NULLs low selectivity (kept 1/1024)        1.00  1053.4±19.67ns        ? ?/sec    1.00  1054.2±15.33ns        ? ?/sec
filter context string high selectivity (kept 1023/1024)                       1.04   683.6±37.71µs        ? ?/sec    1.00   659.9±22.70µs        ? ?/sec
filter context string low selectivity (kept 1/1024)                           1.00   1045.0±4.96ns        ? ?/sec    1.01   1059.2±5.93ns        ? ?/sec
filter context u8 (kept 1/2)                                                  1.00     15.0±0.22µs        ? ?/sec    1.00     15.0±0.09µs        ? ?/sec
filter context u8 high selectivity (kept 1023/1024)                           1.00  1817.8±13.65ns        ? ?/sec    1.00  1809.0±22.61ns        ? ?/sec
filter context u8 low selectivity (kept 1/1024)                               1.02    233.1±8.29ns        ? ?/sec    1.00    227.5±3.72ns        ? ?/sec
filter context u8 w NULLs (kept 1/2)                                          1.00     76.2±0.83µs        ? ?/sec    1.00     75.9±0.33µs        ? ?/sec
filter context u8 w NULLs high selectivity (kept 1023/1024)                   1.00      5.3±0.09µs        ? ?/sec    1.00      5.3±0.09µs        ? ?/sec
filter context u8 w NULLs low selectivity (kept 1/1024)                       1.00    436.9±3.27ns        ? ?/sec    1.04    455.9±5.89ns        ? ?/sec
filter decimal128 (kept 1/2)                                                  1.00     50.7±3.62µs        ? ?/sec    1.14     57.8±3.52µs        ? ?/sec
filter decimal128 high selectivity (kept 1023/1024)                           1.01     53.4±1.60µs        ? ?/sec    1.00     52.6±1.58µs        ? ?/sec
filter decimal128 low selectivity (kept 1/1024)                               1.00      3.2±0.01µs        ? ?/sec    1.02      3.2±0.03µs        ? ?/sec
filter f32 (kept 1/2)                                                         1.01    157.2±0.54µs        ? ?/sec    1.00    155.8±1.49µs        ? ?/sec
filter fsb with value length 20 (kept 1/2)                                    1.05    131.9±2.38µs        ? ?/sec    1.00    125.3±1.78µs        ? ?/sec
filter fsb with value length 20 high selectivity (kept 1023/1024)             1.01     74.1±2.24µs        ? ?/sec    1.00     73.6±2.14µs        ? ?/sec
filter fsb with value length 20 low selectivity (kept 1/1024)                 1.00      3.1±0.03µs        ? ?/sec    1.10      3.4±0.02µs        ? ?/sec
filter fsb with value length 5 (kept 1/2)                                     1.13    137.3±0.78µs        ? ?/sec    1.00    121.4±0.88µs        ? ?/sec
filter fsb with value length 5 high selectivity (kept 1023/1024)              1.06     11.7±0.67µs        ? ?/sec    1.00     11.1±0.61µs        ? ?/sec
filter fsb with value length 5 low selectivity (kept 1/1024)                  1.00      3.1±0.03µs        ? ?/sec    1.08      3.4±0.01µs        ? ?/sec
filter fsb with value length 50 (kept 1/2)                                    1.02   165.8±14.63µs        ? ?/sec    1.00   163.2±12.91µs        ? ?/sec
filter fsb with value length 50 high selectivity (kept 1023/1024)             1.02    214.8±7.61µs        ? ?/sec    1.00    210.2±6.72µs        ? ?/sec
filter fsb with value length 50 low selectivity (kept 1/1024)                 1.00      3.1±0.03µs        ? ?/sec    1.07      3.3±0.05µs        ? ?/sec
filter i32 (kept 1/2)                                                         1.07     46.6±0.57µs        ? ?/sec    1.00     43.4±0.77µs        ? ?/sec
filter i32 high selectivity (kept 1023/1024)                                  1.00      8.7±0.38µs        ? ?/sec    1.00      8.8±0.41µs        ? ?/sec
filter i32 low selectivity (kept 1/1024)                                      1.39      3.3±0.02µs        ? ?/sec    1.00      2.4±0.01µs        ? ?/sec
filter optimize (kept 1/2)                                                    1.00     54.5±0.44µs        ? ?/sec    1.00     54.4±0.93µs        ? ?/sec
filter optimize high selectivity (kept 1023/1024)                             1.00      2.6±0.04µs        ? ?/sec    1.00      2.6±0.03µs        ? ?/sec
filter optimize low selectivity (kept 1/1024)                                 1.10      3.1±0.09µs        ? ?/sec    1.00      2.8±0.14µs        ? ?/sec
filter run array (kept 1/2)                                                   1.00    425.0±1.68µs        ? ?/sec    1.00    423.0±7.01µs        ? ?/sec
filter run array high selectivity (kept 1023/1024)                            1.00    449.7±6.02µs        ? ?/sec    1.00    449.4±4.66µs        ? ?/sec
filter run array low selectivity (kept 1/1024)                                1.00    336.6±2.12µs        ? ?/sec    1.00    335.4±2.43µs        ? ?/sec
filter single record batch                                                    1.03     45.6±0.15µs        ? ?/sec    1.00     44.2±0.14µs        ? ?/sec
filter u8 (kept 1/2)                                                          1.00     45.2±0.87µs        ? ?/sec    1.01     45.7±0.74µs        ? ?/sec
filter u8 high selectivity (kept 1023/1024)                                   1.02      3.8±0.02µs        ? ?/sec    1.00      3.8±0.05µs        ? ?/sec
filter u8 low selectivity (kept 1/1024)                                       1.00      3.3±0.07µs        ? ?/sec    1.00      3.3±0.07µs        ? ?/sec

@alamb
Copy link
Contributor

alamb commented Dec 11, 2025

Ok, thank you -- I will plan to review this one carefully in the morning

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Dandandan -- this looks really exciting; I had a few comments.

I suggest we break this PR up into several smaller ones (now that you have proof the benchmarks are working well):

  1. Add BooleanBufferBuilder::extend (and tests)
  2. Add BooleanBuffer::find_nth_set_bit_position (and tests)
  3. Add the changes to coalesce

/// assert_eq!(builder.len(), 4);
/// ```
pub fn extend<I: Iterator<Item = bool>>(&mut self, iter: I) {
let (lower, upper) = iter.size_hint();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this method would be more generally when appending to any BooleanBuffer rather than just NullBufferBuilder

As part of the goal to consolidate mutable boolean operations in BooleanBufferBuilder so it is easier to find (and optimize) them, would you be willing to move this code to BooleanBufferBuilder so that the code in NullBufferBuilder looks like something like this (which is what most other methods in NullBufferBuilder look like)?

    pub fn extend<I: Iterator<Item = bool>>(&mut self, iter: I) {
        // Materialize since we're about to append bits
        self.materialize_if_needed();
        self.bitmap_builder.as_mut().unwrap().extend(iter)
    }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, I'll do that

let mut iter = iter.peekable();

// Process full u64 chunks (64 bits at a time)
while bit_idx + 64 <= end_bit && iter.peek().is_some() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a follow on PR, it might be worth aligning first on 64 bit boundaries (so the underlying code doesn't have to handle aligning) -- aka handle bits 0..63 (until 64 bit alignment) specially and then use the u64 path

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

}
let byte_idx = (bit_idx - start_len) / 8 + start_byte;
// Write the u64 chunk as 8 bytes
slice[byte_idx..byte_idx + 8].copy_from_slice(&chunk.to_le_bytes());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could try unsafe here too as you ensured the right length above

// Test extend with non-aligned start (tests bit-by-bit path)
let mut builder = NullBufferBuilder::new(0);
builder.append_non_null(); // Start at bit 1 (non-aligned)
builder.extend([false, true, false, true].iter().copied());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should probably test non aligned writes with more than 64 bits as well (this only copies 4 bits)

batch: RecordBatch,
filter: &BooleanArray,
) -> Result<(), ArrowError> {
// TODO: optimize this to avoid materializing (copying the results
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

}

/// Find the position after the n-th set bit in a boolean array starting from `start`.
/// Returns the position after the n-th set bit, or the end of the array if fewer than n bits are set.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend we move this code into BooleanBuffer as well so it is easier to find / reuse

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea


/// Copy rows at the given indices from the current source array into the in-progress array
fn copy_rows_by_filter(&mut self, filter: &FilterPredicate) -> Result<(), ArrowError> {
// Default implementation: iterate over indices from the filter
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like as a follow on we should implement something similar for the byte array filter types? If that is true I'll file a ticket

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, for views & bytes array

self.nulls.append_n_non_nulls(count);
}
}
IterationStrategy::Slices(slices) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this function needs some tests

I ran code coverage like this

cargo llvm-cov test --html -p arrow-buffer  -p arrow-select

And there appears to be no coverage
Image

Dandandan and others added 2 commits December 13, 2025 05:02
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
@Dandandan
Copy link
Contributor Author

Thanks @Dandandan -- this looks really exciting; I had a few comments.

I suggest we break this PR up into several smaller ones (now that you have proof the benchmarks are working well):

  1. Add BooleanBufferBuilder::extend (and tests)
  2. Add BooleanBuffer::find_nth_set_bit_position (and tests)
  3. Add the changes to coalesce

Sounds like a good plan, I'll follow that!

ClSlaid added a commit to ClSlaid/arrow-rs that referenced this pull request Dec 14, 2025
MVP for apache#8957
awaits for apache#8951

very first version for behaviour review, optimizations TBD

Signed-off-by: 蔡略 <cailue@apache.org>
let mut bit_idx = start_len;
let end_bit = start_len + len;

// Process in chunks of 64 bits when byte-aligned for better performance
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit curious, why this don't have some part for unaligned an aligned handling, and

handle_unaligned() // handled start_len % 8 header
handle_aligned() // handle inner payloads
handle_unaligned() // handle_trailer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the same comment as @alamb has?

Yeah perhaps this can improve performance (will see guided by benchmarks).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just checked - this seems an additional ~30% improvement for null handling:

filter: primitive, 8192, nulls: 0.1, selectivity: 0.1
                        time:   [2.4060 ms 2.4096 ms 2.4133 ms]
                        change: [−33.920% −32.902% −32.274%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

filter: primitive, 8192, nulls: 0.1, selectivity: 0.8
                        time:   [2.1610 ms 2.1666 ms 2.1728 ms]
                        change: [−29.488% −28.499% −27.767%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  4 (4.00%) high mild
  3 (3.00%) high severe

@Dandandan
Copy link
Contributor Author

run benchmark coalesce_kernels

@alamb-ghbot
Copy link

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing coalesce_batches_filter (f718f2e) to ed9efe7 diff
BENCH_NAME=coalesce_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench coalesce_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=coalesce_batches_filter
Results will be posted here when complete

@Dandandan
Copy link
Contributor Author

run benchmark coalesce_kernels

@alamb-ghbot
Copy link

🤖: Benchmark completed

Details

group                                                                                coalesce_batches_filter                main
-----                                                                                -----------------------                ----
filter: mixed_dict, 8192, nulls: 0, selectivity: 0.001                               1.00    257.2±3.16ms        ? ?/sec    1.02    263.0±3.88ms        ? ?/sec
filter: mixed_dict, 8192, nulls: 0, selectivity: 0.01                                1.00      8.6±0.14ms        ? ?/sec    1.00      8.6±0.11ms        ? ?/sec
filter: mixed_dict, 8192, nulls: 0, selectivity: 0.1                                 1.00      4.0±0.08ms        ? ?/sec    1.02      4.1±0.08ms        ? ?/sec
filter: mixed_dict, 8192, nulls: 0, selectivity: 0.8                                 1.00      3.3±0.10ms        ? ?/sec    1.08      3.5±0.03ms        ? ?/sec
filter: mixed_dict, 8192, nulls: 0.1, selectivity: 0.001                             1.00    242.3±3.47ms        ? ?/sec    1.31    317.6±4.11ms        ? ?/sec
filter: mixed_dict, 8192, nulls: 0.1, selectivity: 0.01                              1.00      9.3±0.13ms        ? ?/sec    1.00      9.3±0.34ms        ? ?/sec
filter: mixed_dict, 8192, nulls: 0.1, selectivity: 0.1                               1.01      4.6±0.06ms        ? ?/sec    1.00      4.5±0.10ms        ? ?/sec
filter: mixed_dict, 8192, nulls: 0.1, selectivity: 0.8                               1.00      3.7±0.04ms        ? ?/sec    1.24      4.6±0.05ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0, selectivity: 0.001                               1.00     59.6±0.52ms        ? ?/sec    1.01     59.9±1.07ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0, selectivity: 0.01                                1.00     11.2±0.11ms        ? ?/sec    1.03     11.6±0.15ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0, selectivity: 0.1                                 1.00      9.1±0.27ms        ? ?/sec    1.03      9.3±0.29ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0, selectivity: 0.8                                 1.00      7.8±0.20ms        ? ?/sec    1.41     11.0±0.27ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0.1, selectivity: 0.001                             1.00     69.7±0.42ms        ? ?/sec    1.00     69.4±0.68ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0.1, selectivity: 0.01                              1.00     12.8±0.10ms        ? ?/sec    1.00     12.7±0.14ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0.1, selectivity: 0.1                               1.02     10.0±0.35ms        ? ?/sec    1.00      9.8±0.28ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0.1, selectivity: 0.8                               1.00      8.6±0.20ms        ? ?/sec    1.14      9.8±0.22ms        ? ?/sec
filter: mixed_utf8view (max_string_len=128), 8192, nulls: 0, selectivity: 0.001      1.00     48.8±0.30ms        ? ?/sec    1.00     48.6±0.63ms        ? ?/sec
filter: mixed_utf8view (max_string_len=128), 8192, nulls: 0, selectivity: 0.01       1.00      5.9±0.04ms        ? ?/sec    1.01      6.0±0.17ms        ? ?/sec
filter: mixed_utf8view (max_string_len=128), 8192, nulls: 0, selectivity: 0.1        1.00      4.3±0.11ms        ? ?/sec    1.05      4.6±0.20ms        ? ?/sec
filter: mixed_utf8view (max_string_len=128), 8192, nulls: 0, selectivity: 0.8        1.00      2.6±0.03ms        ? ?/sec    1.22      3.1±0.08ms        ? ?/sec
filter: mixed_utf8view (max_string_len=128), 8192, nulls: 0.1, selectivity: 0.001    1.00     58.0±0.49ms        ? ?/sec    1.01     58.4±0.73ms        ? ?/sec
filter: mixed_utf8view (max_string_len=128), 8192, nulls: 0.1, selectivity: 0.01     1.00      7.9±0.06ms        ? ?/sec    1.00      7.9±0.06ms        ? ?/sec
filter: mixed_utf8view (max_string_len=128), 8192, nulls: 0.1, selectivity: 0.1      1.00      5.4±0.20ms        ? ?/sec    1.00      5.5±0.15ms        ? ?/sec
filter: mixed_utf8view (max_string_len=128), 8192, nulls: 0.1, selectivity: 0.8      1.00      2.2±0.02ms        ? ?/sec    1.75      3.9±0.06ms        ? ?/sec
filter: mixed_utf8view (max_string_len=20), 8192, nulls: 0, selectivity: 0.001       1.01     42.7±0.22ms        ? ?/sec    1.00     42.5±0.84ms        ? ?/sec
filter: mixed_utf8view (max_string_len=20), 8192, nulls: 0, selectivity: 0.01        1.00      4.7±0.03ms        ? ?/sec    1.00      4.7±0.03ms        ? ?/sec
filter: mixed_utf8view (max_string_len=20), 8192, nulls: 0, selectivity: 0.1         1.05      2.5±0.21ms        ? ?/sec    1.00      2.4±0.19ms        ? ?/sec
filter: mixed_utf8view (max_string_len=20), 8192, nulls: 0, selectivity: 0.8         1.00  1100.1±30.40µs        ? ?/sec    1.40  1535.0±16.88µs        ? ?/sec
filter: mixed_utf8view (max_string_len=20), 8192, nulls: 0.1, selectivity: 0.001     1.00     52.1±0.82ms        ? ?/sec    1.00     52.0±0.33ms        ? ?/sec
filter: mixed_utf8view (max_string_len=20), 8192, nulls: 0.1, selectivity: 0.01      1.00      7.0±0.04ms        ? ?/sec    1.01      7.1±0.04ms        ? ?/sec
filter: mixed_utf8view (max_string_len=20), 8192, nulls: 0.1, selectivity: 0.1       1.01      3.7±0.16ms        ? ?/sec    1.00      3.7±0.19ms        ? ?/sec
filter: mixed_utf8view (max_string_len=20), 8192, nulls: 0.1, selectivity: 0.8       1.00      2.3±0.02ms        ? ?/sec    1.70      3.9±0.02ms        ? ?/sec
filter: primitive, 8192, nulls: 0, selectivity: 0.001                                1.00     53.7±0.17ms        ? ?/sec    1.82     97.8±0.57ms        ? ?/sec
filter: primitive, 8192, nulls: 0, selectivity: 0.01                                 1.00      5.9±0.03ms        ? ?/sec    1.59      9.3±0.12ms        ? ?/sec
filter: primitive, 8192, nulls: 0, selectivity: 0.1                                  1.00      3.5±0.37ms        ? ?/sec    1.13      3.9±0.31ms        ? ?/sec
filter: primitive, 8192, nulls: 0, selectivity: 0.8                                  1.00   1673.0±9.89µs        ? ?/sec    1.86      3.1±0.02ms        ? ?/sec
filter: primitive, 8192, nulls: 0.1, selectivity: 0.001                              1.00     59.2±1.20ms        ? ?/sec    2.14    126.3±0.72ms        ? ?/sec
filter: primitive, 8192, nulls: 0.1, selectivity: 0.01                               1.00      8.1±0.07ms        ? ?/sec    1.85     15.0±0.45ms        ? ?/sec
filter: primitive, 8192, nulls: 0.1, selectivity: 0.1                                1.00      6.7±0.49ms        ? ?/sec    1.05      7.0±0.28ms        ? ?/sec
filter: primitive, 8192, nulls: 0.1, selectivity: 0.8                                1.00      4.9±0.13ms        ? ?/sec    1.87      9.1±0.09ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0, selectivity: 0.001                          1.01     66.9±0.24ms        ? ?/sec    1.00     66.2±0.76ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0, selectivity: 0.01                           1.03      7.6±0.04ms        ? ?/sec    1.00      7.3±0.19ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0, selectivity: 0.1                            1.00      3.9±0.10ms        ? ?/sec    1.04      4.0±0.27ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0, selectivity: 0.8                            1.00   1285.2±8.42µs        ? ?/sec    1.11  1421.9±17.75µs        ? ?/sec
filter: single_utf8view, 8192, nulls: 0.1, selectivity: 0.001                        1.07     89.6±0.36ms        ? ?/sec    1.00     84.1±0.60ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0.1, selectivity: 0.01                         1.02     11.5±0.21ms        ? ?/sec    1.00     11.3±0.08ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0.1, selectivity: 0.1                          1.00      5.6±0.32ms        ? ?/sec    1.00      5.6±0.34ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0.1, selectivity: 0.8                          1.00      2.7±0.02ms        ? ?/sec    1.49      3.9±0.05ms        ? ?/sec

@alamb-ghbot
Copy link

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing coalesce_batches_filter (b235243) to ed9efe7 diff
BENCH_NAME=coalesce_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench coalesce_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=coalesce_batches_filter
Results will be posted here when complete

@alamb-ghbot
Copy link

🤖: Benchmark completed

Details

group                                                                                coalesce_batches_filter                main
-----                                                                                -----------------------                ----
filter: mixed_dict, 8192, nulls: 0, selectivity: 0.001                               1.00    257.2±2.50ms        ? ?/sec    1.00    256.6±3.74ms        ? ?/sec
filter: mixed_dict, 8192, nulls: 0, selectivity: 0.01                                1.04      8.8±0.18ms        ? ?/sec    1.00      8.5±0.10ms        ? ?/sec
filter: mixed_dict, 8192, nulls: 0, selectivity: 0.1                                 1.03      4.2±0.13ms        ? ?/sec    1.00      4.1±0.12ms        ? ?/sec
filter: mixed_dict, 8192, nulls: 0, selectivity: 0.8                                 1.00      3.3±0.04ms        ? ?/sec    1.08      3.5±0.03ms        ? ?/sec
filter: mixed_dict, 8192, nulls: 0.1, selectivity: 0.001                             1.00    244.5±3.19ms        ? ?/sec    1.27    310.3±4.52ms        ? ?/sec
filter: mixed_dict, 8192, nulls: 0.1, selectivity: 0.01                              1.01      9.4±0.17ms        ? ?/sec    1.00      9.3±0.24ms        ? ?/sec
filter: mixed_dict, 8192, nulls: 0.1, selectivity: 0.1                               1.03      4.6±0.11ms        ? ?/sec    1.00      4.5±0.10ms        ? ?/sec
filter: mixed_dict, 8192, nulls: 0.1, selectivity: 0.8                               1.00      3.7±0.07ms        ? ?/sec    1.23      4.6±0.09ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0, selectivity: 0.001                               1.00     59.7±0.49ms        ? ?/sec    1.00     59.5±0.60ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0, selectivity: 0.01                                1.00     11.5±0.19ms        ? ?/sec    1.01     11.5±0.15ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0, selectivity: 0.1                                 1.04      9.5±0.33ms        ? ?/sec    1.00      9.2±0.40ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0, selectivity: 0.8                                 1.00      7.7±0.15ms        ? ?/sec    1.33     10.3±0.21ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0.1, selectivity: 0.001                             1.01     70.3±0.37ms        ? ?/sec    1.00     69.6±2.61ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0.1, selectivity: 0.01                              1.00     12.8±0.15ms        ? ?/sec    1.00     12.8±0.23ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0.1, selectivity: 0.1                               1.01      9.6±0.33ms        ? ?/sec    1.00      9.5±0.21ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0.1, selectivity: 0.8                               1.00      8.3±0.22ms        ? ?/sec    1.17      9.8±0.21ms        ? ?/sec
filter: mixed_utf8view (max_string_len=128), 8192, nulls: 0, selectivity: 0.001      1.02     49.5±0.41ms        ? ?/sec    1.00     48.4±0.46ms        ? ?/sec
filter: mixed_utf8view (max_string_len=128), 8192, nulls: 0, selectivity: 0.01       1.00      5.9±0.06ms        ? ?/sec    1.00      5.9±0.10ms        ? ?/sec
filter: mixed_utf8view (max_string_len=128), 8192, nulls: 0, selectivity: 0.1        1.03      4.6±0.25ms        ? ?/sec    1.00      4.5±0.23ms        ? ?/sec
filter: mixed_utf8view (max_string_len=128), 8192, nulls: 0, selectivity: 0.8        1.00      2.6±0.05ms        ? ?/sec    1.14      3.0±0.04ms        ? ?/sec
filter: mixed_utf8view (max_string_len=128), 8192, nulls: 0.1, selectivity: 0.001    1.01     59.2±0.47ms        ? ?/sec    1.00     58.5±0.79ms        ? ?/sec
filter: mixed_utf8view (max_string_len=128), 8192, nulls: 0.1, selectivity: 0.01     1.00      8.0±0.25ms        ? ?/sec    1.01      8.1±0.23ms        ? ?/sec
filter: mixed_utf8view (max_string_len=128), 8192, nulls: 0.1, selectivity: 0.1      1.00      5.5±0.25ms        ? ?/sec    1.03      5.7±0.22ms        ? ?/sec
filter: mixed_utf8view (max_string_len=128), 8192, nulls: 0.1, selectivity: 0.8      1.00      2.3±0.02ms        ? ?/sec    1.73      3.9±0.04ms        ? ?/sec
filter: mixed_utf8view (max_string_len=20), 8192, nulls: 0, selectivity: 0.001       1.02     43.5±0.42ms        ? ?/sec    1.00     42.6±0.28ms        ? ?/sec
filter: mixed_utf8view (max_string_len=20), 8192, nulls: 0, selectivity: 0.01        1.00      4.7±0.03ms        ? ?/sec    1.00      4.7±0.10ms        ? ?/sec
filter: mixed_utf8view (max_string_len=20), 8192, nulls: 0, selectivity: 0.1         1.10      2.5±0.19ms        ? ?/sec    1.00      2.3±0.18ms        ? ?/sec
filter: mixed_utf8view (max_string_len=20), 8192, nulls: 0, selectivity: 0.8         1.00  1176.8±34.76µs        ? ?/sec    1.28   1506.1±8.68µs        ? ?/sec
filter: mixed_utf8view (max_string_len=20), 8192, nulls: 0.1, selectivity: 0.001     1.02     53.5±0.65ms        ? ?/sec    1.00     52.4±0.43ms        ? ?/sec
filter: mixed_utf8view (max_string_len=20), 8192, nulls: 0.1, selectivity: 0.01      1.00      7.0±0.04ms        ? ?/sec    1.01      7.1±0.04ms        ? ?/sec
filter: mixed_utf8view (max_string_len=20), 8192, nulls: 0.1, selectivity: 0.1       1.03      3.8±0.19ms        ? ?/sec    1.00      3.7±0.12ms        ? ?/sec
filter: mixed_utf8view (max_string_len=20), 8192, nulls: 0.1, selectivity: 0.8       1.00      2.4±0.06ms        ? ?/sec    1.64      3.9±0.07ms        ? ?/sec
filter: primitive, 8192, nulls: 0, selectivity: 0.001                                1.00     55.4±0.82ms        ? ?/sec    1.76     97.3±1.34ms        ? ?/sec
filter: primitive, 8192, nulls: 0, selectivity: 0.01                                 1.00      5.8±0.02ms        ? ?/sec    1.60      9.3±0.07ms        ? ?/sec
filter: primitive, 8192, nulls: 0, selectivity: 0.1                                  1.00      3.6±0.41ms        ? ?/sec    1.10      4.0±0.37ms        ? ?/sec
filter: primitive, 8192, nulls: 0, selectivity: 0.8                                  1.00  1681.5±15.51µs        ? ?/sec    1.86      3.1±0.06ms        ? ?/sec
filter: primitive, 8192, nulls: 0.1, selectivity: 0.001                              1.00     62.1±0.50ms        ? ?/sec    2.02    125.4±0.57ms        ? ?/sec
filter: primitive, 8192, nulls: 0.1, selectivity: 0.01                               1.00     10.1±0.15ms        ? ?/sec    1.50     15.0±0.07ms        ? ?/sec
filter: primitive, 8192, nulls: 0.1, selectivity: 0.1                                1.00      6.4±0.25ms        ? ?/sec    1.16      7.4±0.36ms        ? ?/sec
filter: primitive, 8192, nulls: 0.1, selectivity: 0.8                                1.00      4.5±0.03ms        ? ?/sec    2.02      9.1±0.05ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0, selectivity: 0.001                          1.02     67.3±1.39ms        ? ?/sec    1.00     66.2±0.99ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0, selectivity: 0.01                           1.04      7.6±0.04ms        ? ?/sec    1.00      7.3±0.12ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0, selectivity: 0.1                            1.00      4.1±0.31ms        ? ?/sec    1.02      4.2±0.36ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0, selectivity: 0.8                            1.00   1335.7±6.87µs        ? ?/sec    1.06   1412.2±9.47µs        ? ?/sec
filter: single_utf8view, 8192, nulls: 0.1, selectivity: 0.001                        1.07     89.5±0.97ms        ? ?/sec    1.00     83.9±0.28ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0.1, selectivity: 0.01                         1.04     11.6±0.28ms        ? ?/sec    1.00     11.2±0.12ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0.1, selectivity: 0.1                          1.01      5.5±0.36ms        ? ?/sec    1.00      5.5±0.31ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0.1, selectivity: 0.8                          1.00      2.7±0.01ms        ? ?/sec    1.46      3.9±0.04ms        ? ?/sec

@Dandandan
Copy link
Contributor Author

filter: primitive, 8192, nulls: 0.1, selectivity: 0.1                                1.00      6.4±0.25ms        ? ?/sec    1.16      7.4±0.36ms        ? ?/sec
filter: primitive, 8192, nulls: 0.1, selectivity: 0.8                                1.00      4.5±0.03ms        ? ?/sec    2.02      9.1±0.05ms        ? ?/sec

nice

alamb pushed a commit that referenced this pull request Dec 15, 2025
MVP for #8957
awaits for #8951

very first version for reviewers to confirm behaviour, optimizations TBD

Signed-off-by: 蔡略 <cailue@apache.org>
@alamb
Copy link
Contributor

alamb commented Dec 17, 2025

I suggest we break this PR up into several smaller ones (now that you have proof the benchmarks are working well):

@Dandandan would you like help getting this PR into shape / creating the smaller PRs?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants