Make `push_batch_with_filter` up to 3x faster for primitive types #8951

Dandandan · 2025-12-04T14:28:18Z

Which issue does this PR close?

Closes #NNN.

Rationale for this change

filter: primitive, 8192, nulls: 0, selectivity: 0.001
                        time:   [20.430 ms 20.678 ms 21.105 ms]
                        change: [−65.000% −64.516% −63.806%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) high mild
  6 (6.00%) high severe

filter: primitive, 8192, nulls: 0, selectivity: 0.01
                        time:   [3.3275 ms 3.3451 ms 3.3665 ms]
                        change: [−49.062% −48.663% −48.260%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low mild
  5 (5.00%) high severe

Benchmarking filter: primitive, 8192, nulls: 0, selectivity: 0.1: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 7.5s, enable flat sampling, or reduce sample count to 50.
filter: primitive, 8192, nulls: 0, selectivity: 0.1
                        time:   [1.4759 ms 1.4887 ms 1.5105 ms]
                        change: [−26.613% −23.553% −15.842%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  2 (2.00%) low mild
  1 (1.00%) high mild
  6 (6.00%) high severe

Benchmarking filter: primitive, 8192, nulls: 0, selectivity: 0.8: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.9s, enable flat sampling, or reduce sample count to 60.
filter: primitive, 8192, nulls: 0, selectivity: 0.8
                        time:   [1.3569 ms 1.3626 ms 1.3702 ms]
                        change: [−47.225% −46.850% −46.451%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  3 (3.00%) low mild
  2 (2.00%) high mild
  3 (3.00%) high severe

filter: primitive, 8192, nulls: 0.1, selectivity: 0.001
                        time:   [23.231 ms 23.295 ms 23.376 ms]
                        change: [−69.694% −69.516% −69.351%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high severe

filter: primitive, 8192, nulls: 0.1, selectivity: 0.01
                        time:   [5.4033 ms 5.4201 ms 5.4424 ms]
                        change: [−49.860% −49.590% −49.325%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe

filter: primitive, 8192, nulls: 0.1, selectivity: 0.1
                        time:   [3.6111 ms 3.6270 ms 3.6475 ms]
                        change: [−27.778% −26.284% −25.286%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe

filter: primitive, 8192, nulls: 0.1, selectivity: 0.8
                        time:   [3.6298 ms 3.7206 ms 3.8600 ms]
                        change: [−26.637% −24.714% −21.997%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) high mild
  4 (4.00%) high severe

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Make filtered coalescing faster for primitive

Dandandan · 2025-12-04T16:00:15Z

@alamb you are probably interested in this

alamb · 2025-12-04T16:01:44Z

YAAAAASSS -- this is exactly the type of thing I was hoping for with BatchCoalescer. I will check this out shortly

Dandandan · 2025-12-04T16:05:47Z

arrow-select/src/coalesce.rs

-        let filtered_batch = filter_record_batch(&batch, filter)?;
-        self.push_batch(filtered_batch)
+        // We only support primitve now, fallback to filter_record_batch for other types
+        // Also, skip optimization when filter is not very selective


Not sure if always better to take into account biggest_coalesce_batch_size

alamb · 2025-12-04T16:51:52Z

run benchmark filter_kernels

alamb · 2025-12-04T16:52:00Z

show benchmark queue

alamb-ghbot · 2025-12-04T16:52:03Z

🤖 Hi @alamb, you asked to view the benchmark queue (#8951 (comment)).

Job	User	Benchmarks	Comment
`arrow-8933-3613162300.sh`	alamb	default	`https://github.com/apache/arrow-rs/pull/8933#issuecomment-3613162300`
`arrow-8933-3613131981.sh`	alamb	filter_kernels	`https://github.com/apache/arrow-rs/pull/8933#issuecomment-3613131981`
`arrow-8951-3613212415.sh`	alamb	filter_kernels	`https://github.com/apache/arrow-rs/pull/8951#issuecomment-3613212415`

Dandandan · 2025-12-04T17:04:14Z

Hm it seems it contains a bug, probably makes the benchmark results off as well (will take a look tomorrow).

alamb-ghbot · 2025-12-04T17:27:06Z

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing coalesce_batches_filter (0872a9b) to ed9efe7 diff
BENCH_NAME=filter_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench filter_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=coalesce_batches_filter
Results will be posted here when complete

alamb-ghbot · 2025-12-04T17:51:25Z

🤖: Benchmark completed

Details

group                                                                         coalesce_batches_filter                main
-----                                                                         -----------------------                ----
filter context decimal128 (kept 1/2)                                          1.36     57.5±5.45µs        ? ?/sec    1.00     42.1±1.93µs        ? ?/sec
filter context decimal128 high selectivity (kept 1023/1024)                   1.00     55.7±4.51µs        ? ?/sec    1.09     60.5±0.29µs        ? ?/sec
filter context decimal128 low selectivity (kept 1/1024)                       1.00    242.4±0.35ns        ? ?/sec    1.06    256.0±1.60ns        ? ?/sec
filter context f32 (kept 1/2)                                                 1.00     77.7±1.20µs        ? ?/sec    1.00     78.0±2.52µs        ? ?/sec
filter context f32 high selectivity (kept 1023/1024)                          1.00      9.9±0.32µs        ? ?/sec    1.01     10.1±0.30µs        ? ?/sec
filter context f32 low selectivity (kept 1/1024)                              1.00    444.2±7.59ns        ? ?/sec    1.06   469.4±13.36ns        ? ?/sec
filter context fsb with value length 20 (kept 1/2)                            1.00     60.7±1.16µs        ? ?/sec    1.00     60.7±0.37µs        ? ?/sec
filter context fsb with value length 20 high selectivity (kept 1023/1024)     1.00     60.7±0.36µs        ? ?/sec    1.00     60.7±0.56µs        ? ?/sec
filter context fsb with value length 20 low selectivity (kept 1/1024)         1.00     60.6±0.26µs        ? ?/sec    1.00     60.8±1.05µs        ? ?/sec
filter context fsb with value length 5 (kept 1/2)                             1.00     60.8±1.45µs        ? ?/sec    1.00     60.7±1.02µs        ? ?/sec
filter context fsb with value length 5 high selectivity (kept 1023/1024)      1.00     60.7±0.71µs        ? ?/sec    1.00     60.8±1.22µs        ? ?/sec
filter context fsb with value length 5 low selectivity (kept 1/1024)          1.01     61.2±3.05µs        ? ?/sec    1.00     60.8±0.90µs        ? ?/sec
filter context fsb with value length 50 (kept 1/2)                            1.00     60.7±0.46µs        ? ?/sec    1.00     60.8±0.46µs        ? ?/sec
filter context fsb with value length 50 high selectivity (kept 1023/1024)     1.00     61.0±2.06µs        ? ?/sec    1.00     60.7±0.55µs        ? ?/sec
filter context fsb with value length 50 low selectivity (kept 1/1024)         1.00     60.8±1.25µs        ? ?/sec    1.00     60.8±1.00µs        ? ?/sec
filter context i32 (kept 1/2)                                                 1.01     16.6±0.28µs        ? ?/sec    1.00     16.5±0.30µs        ? ?/sec
filter context i32 high selectivity (kept 1023/1024)                          1.04      6.5±0.20µs        ? ?/sec    1.00      6.2±0.17µs        ? ?/sec
filter context i32 low selectivity (kept 1/1024)                              1.00    236.0±5.78ns        ? ?/sec    1.05    246.9±1.45ns        ? ?/sec
filter context i32 w NULLs (kept 1/2)                                         1.00     77.8±2.17µs        ? ?/sec    1.00     77.9±0.80µs        ? ?/sec
filter context i32 w NULLs high selectivity (kept 1023/1024)                  1.00     10.1±0.52µs        ? ?/sec    1.04     10.5±0.18µs        ? ?/sec
filter context i32 w NULLs low selectivity (kept 1/1024)                      1.00    446.9±4.94ns        ? ?/sec    1.06    471.6±6.49ns        ? ?/sec
filter context mixed string view (kept 1/2)                                   1.00    109.0±3.21µs        ? ?/sec    1.11    120.7±3.20µs        ? ?/sec
filter context mixed string view high selectivity (kept 1023/1024)            1.00     53.9±2.45µs        ? ?/sec    1.03     55.3±2.41µs        ? ?/sec
filter context mixed string view low selectivity (kept 1/1024)                1.00   654.9±19.57ns        ? ?/sec    1.04   677.9±18.99ns        ? ?/sec
filter context short string view (kept 1/2)                                   1.00    104.2±1.47µs        ? ?/sec    1.08    112.2±3.44µs        ? ?/sec
filter context short string view high selectivity (kept 1023/1024)            1.02     55.5±1.25µs        ? ?/sec    1.00     54.5±0.23µs        ? ?/sec
filter context short string view low selectivity (kept 1/1024)                1.00    464.2±2.70ns        ? ?/sec    1.06    491.4±7.75ns        ? ?/sec
filter context string (kept 1/2)                                              1.03   599.4±17.30µs        ? ?/sec    1.00    582.1±5.14µs        ? ?/sec
filter context string dictionary (kept 1/2)                                   1.00     17.0±0.13µs        ? ?/sec    1.02     17.3±0.27µs        ? ?/sec
filter context string dictionary high selectivity (kept 1023/1024)            1.00      7.0±0.34µs        ? ?/sec    1.02      7.2±0.27µs        ? ?/sec
filter context string dictionary low selectivity (kept 1/1024)                1.02    847.3±9.58ns        ? ?/sec    1.00    829.8±3.84ns        ? ?/sec
filter context string dictionary w NULLs (kept 1/2)                           1.00     78.8±1.05µs        ? ?/sec    1.00     78.9±2.34µs        ? ?/sec
filter context string dictionary w NULLs high selectivity (kept 1023/1024)    1.00     10.7±0.41µs        ? ?/sec    1.01     10.8±0.35µs        ? ?/sec
filter context string dictionary w NULLs low selectivity (kept 1/1024)        1.01  1076.9±14.42ns        ? ?/sec    1.00  1067.4±30.14ns        ? ?/sec
filter context string high selectivity (kept 1023/1024)                       1.00   703.0±13.80µs        ? ?/sec    1.00   703.8±19.93µs        ? ?/sec
filter context string low selectivity (kept 1/1024)                           1.00  1016.7±52.17ns        ? ?/sec    1.02  1036.2±34.58ns        ? ?/sec
filter context u8 (kept 1/2)                                                  1.00     14.9±0.05µs        ? ?/sec    1.00     15.0±0.14µs        ? ?/sec
filter context u8 high selectivity (kept 1023/1024)                           1.00  1829.3±23.69ns        ? ?/sec    1.11      2.0±0.01µs        ? ?/sec
filter context u8 low selectivity (kept 1/1024)                               1.00    231.0±5.30ns        ? ?/sec    1.03    238.8±0.83ns        ? ?/sec
filter context u8 w NULLs (kept 1/2)                                          1.00     75.9±0.20µs        ? ?/sec    1.00     76.1±0.78µs        ? ?/sec
filter context u8 w NULLs high selectivity (kept 1023/1024)                   1.00      5.1±0.08µs        ? ?/sec    1.05      5.4±0.06µs        ? ?/sec
filter context u8 w NULLs low selectivity (kept 1/1024)                       1.00   441.3±12.39ns        ? ?/sec    1.06    467.4±2.32ns        ? ?/sec
filter decimal128 (kept 1/2)                                                  1.00     49.5±0.83µs        ? ?/sec    1.18     58.6±2.81µs        ? ?/sec
filter decimal128 high selectivity (kept 1023/1024)                           1.17     61.3±2.70µs        ? ?/sec    1.00     52.6±1.25µs        ? ?/sec
filter decimal128 low selectivity (kept 1/1024)                               1.00      2.9±0.09µs        ? ?/sec    1.13      3.2±0.08µs        ? ?/sec
filter f32 (kept 1/2)                                                         1.07    166.6±7.99µs        ? ?/sec    1.00    156.4±2.84µs        ? ?/sec
filter fsb with value length 20 (kept 1/2)                                    1.12    141.6±1.19µs        ? ?/sec    1.00    126.0±3.73µs        ? ?/sec
filter fsb with value length 20 high selectivity (kept 1023/1024)             1.11     76.6±1.07µs        ? ?/sec    1.00     68.7±1.04µs        ? ?/sec
filter fsb with value length 20 low selectivity (kept 1/1024)                 1.00      2.7±0.09µs        ? ?/sec    1.29      3.5±0.10µs        ? ?/sec
filter fsb with value length 5 (kept 1/2)                                     1.17    141.8±2.30µs        ? ?/sec    1.00    121.1±0.87µs        ? ?/sec
filter fsb with value length 5 high selectivity (kept 1023/1024)              1.00     10.8±0.16µs        ? ?/sec    1.05     11.3±0.33µs        ? ?/sec
filter fsb with value length 5 low selectivity (kept 1/1024)                  1.00      2.6±0.08µs        ? ?/sec    1.28      3.3±0.02µs        ? ?/sec
filter fsb with value length 50 (kept 1/2)                                    1.05    189.3±7.05µs        ? ?/sec    1.00    181.1±9.22µs        ? ?/sec
filter fsb with value length 50 high selectivity (kept 1023/1024)             1.00    255.5±8.77µs        ? ?/sec    1.03    264.3±6.26µs        ? ?/sec
filter fsb with value length 50 low selectivity (kept 1/1024)                 1.00      2.6±0.03µs        ? ?/sec    1.27      3.3±0.10µs        ? ?/sec
filter i32 (kept 1/2)                                                         1.25     53.8±0.68µs        ? ?/sec    1.00     43.2±0.31µs        ? ?/sec
filter i32 high selectivity (kept 1023/1024)                                  1.05      8.9±0.48µs        ? ?/sec    1.00      8.4±0.32µs        ? ?/sec
filter i32 low selectivity (kept 1/1024)                                      1.24      2.9±0.06µs        ? ?/sec    1.00      2.4±0.03µs        ? ?/sec
filter optimize (kept 1/2)                                                    1.01     54.8±2.99µs        ? ?/sec    1.00     54.5±1.51µs        ? ?/sec
filter optimize high selectivity (kept 1023/1024)                             1.18      3.1±0.14µs        ? ?/sec    1.00      2.6±0.01µs        ? ?/sec
filter optimize low selectivity (kept 1/1024)                                 1.00      2.7±0.00µs        ? ?/sec    1.00      2.7±0.02µs        ? ?/sec
filter run array (kept 1/2)                                                   1.03   436.4±17.42µs        ? ?/sec    1.00    422.5±4.27µs        ? ?/sec
filter run array high selectivity (kept 1023/1024)                            1.01    452.6±7.45µs        ? ?/sec    1.00   449.3±12.94µs        ? ?/sec
filter run array low selectivity (kept 1/1024)                                1.01   336.4±10.57µs        ? ?/sec    1.00    334.5±2.82µs        ? ?/sec
filter single record batch                                                    1.23     54.3±2.92µs        ? ?/sec    1.00     44.2±0.07µs        ? ?/sec
filter u8 (kept 1/2)                                                          1.00     45.5±0.99µs        ? ?/sec    1.00     45.7±0.44µs        ? ?/sec
filter u8 high selectivity (kept 1023/1024)                                   1.05      4.0±0.11µs        ? ?/sec    1.00      3.8±0.04µs        ? ?/sec
filter u8 low selectivity (kept 1/1024)                                       1.00      3.0±0.05µs        ? ?/sec    1.12      3.3±0.11µs        ? ?/sec

Dandandan · 2025-12-04T20:29:02Z

run benchmark coalesce_kernels

alamb-ghbot · 2025-12-04T20:29:06Z

🤖 Hi @Dandandan, thanks for the request (#8951 (comment)).

scrape_comments.py only supports whitelisted benchmarks.

Standard: (none)
Criterion: arrow_reader, concatenate_kernels, filter_kernels

Please choose one or more of these with run benchmark <name> or run benchmark <name1> <name2>...
Unsupported benchmarks: coalesce_kernels.

Dandandan · 2025-12-04T20:47:08Z

@alamb I think it's ok now - I called AI (Opus 4.5) for some help on the find_nth_set_bit_position function.

Mainly needs some polish and seeing if we can improve the filter: primitive, 8192, nulls: 0.1, selectivity: 0.8 case.

alamb-ghbot · 2025-12-09T20:44:19Z

🤖: Benchmark completed

Details

group                                                                                coalesce_batches_filter                main
-----                                                                                -----------------------                ----
filter: mixed_dict, 8192, nulls: 0, selectivity: 0.001                               1.01    277.1±1.94ms        ? ?/sec    1.00    273.9±2.22ms        ? ?/sec
filter: mixed_dict, 8192, nulls: 0, selectivity: 0.01                                1.04      9.6±0.29ms        ? ?/sec    1.00      9.2±0.32ms        ? ?/sec
filter: mixed_dict, 8192, nulls: 0, selectivity: 0.1                                 1.02      4.4±0.13ms        ? ?/sec    1.00      4.3±0.10ms        ? ?/sec
filter: mixed_dict, 8192, nulls: 0, selectivity: 0.8                                 1.00      3.5±0.05ms        ? ?/sec    1.06      3.7±0.05ms        ? ?/sec
filter: mixed_dict, 8192, nulls: 0.1, selectivity: 0.001                             1.00    261.6±1.88ms        ? ?/sec    1.28    333.6±2.60ms        ? ?/sec
filter: mixed_dict, 8192, nulls: 0.1, selectivity: 0.01                              1.00     10.0±0.36ms        ? ?/sec    1.00     10.1±0.43ms        ? ?/sec
filter: mixed_dict, 8192, nulls: 0.1, selectivity: 0.1                               1.05      4.9±0.05ms        ? ?/sec    1.00      4.6±0.09ms        ? ?/sec
filter: mixed_dict, 8192, nulls: 0.1, selectivity: 0.8                               1.00      3.8±0.04ms        ? ?/sec    1.25      4.8±0.06ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0, selectivity: 0.001                               1.04     66.0±1.43ms        ? ?/sec    1.00     63.2±1.47ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0, selectivity: 0.01                                1.00     11.9±0.17ms        ? ?/sec    1.01     12.1±0.21ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0, selectivity: 0.1                                 1.01     10.6±0.47ms        ? ?/sec    1.00     10.5±0.26ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0, selectivity: 0.8                                 1.00      9.9±0.21ms        ? ?/sec    1.34     13.2±0.38ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0.1, selectivity: 0.001                             1.01     73.2±1.24ms        ? ?/sec    1.00     72.3±0.90ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0.1, selectivity: 0.01                              1.03     13.5±0.19ms        ? ?/sec    1.00     13.1±0.16ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0.1, selectivity: 0.1                               1.06     11.2±0.40ms        ? ?/sec    1.00     10.6±0.32ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0.1, selectivity: 0.8                               1.00     10.2±0.27ms        ? ?/sec    1.14     11.6±0.46ms        ? ?/sec
filter: mixed_utf8view (max_string_len=128), 8192, nulls: 0, selectivity: 0.001      1.01     49.7±0.34ms        ? ?/sec    1.00     49.0±1.09ms        ? ?/sec
filter: mixed_utf8view (max_string_len=128), 8192, nulls: 0, selectivity: 0.01       1.01      6.2±0.06ms        ? ?/sec    1.00      6.1±0.20ms        ? ?/sec
filter: mixed_utf8view (max_string_len=128), 8192, nulls: 0, selectivity: 0.1        1.04      5.0±0.20ms        ? ?/sec    1.00      4.8±0.13ms        ? ?/sec
filter: mixed_utf8view (max_string_len=128), 8192, nulls: 0, selectivity: 0.8        1.00      3.1±0.11ms        ? ?/sec    1.15      3.6±0.13ms        ? ?/sec
filter: mixed_utf8view (max_string_len=128), 8192, nulls: 0.1, selectivity: 0.001    1.03     60.2±1.30ms        ? ?/sec    1.00     58.6±0.81ms        ? ?/sec
filter: mixed_utf8view (max_string_len=128), 8192, nulls: 0.1, selectivity: 0.01     1.00      8.2±0.11ms        ? ?/sec    1.00      8.2±0.09ms        ? ?/sec
filter: mixed_utf8view (max_string_len=128), 8192, nulls: 0.1, selectivity: 0.1      1.10      6.2±0.16ms        ? ?/sec    1.00      5.6±0.13ms        ? ?/sec
filter: mixed_utf8view (max_string_len=128), 8192, nulls: 0.1, selectivity: 0.8      1.00      2.3±0.04ms        ? ?/sec    1.74      4.0±0.06ms        ? ?/sec
filter: mixed_utf8view (max_string_len=20), 8192, nulls: 0, selectivity: 0.001       1.02     43.6±0.62ms        ? ?/sec    1.00     42.9±0.57ms        ? ?/sec
filter: mixed_utf8view (max_string_len=20), 8192, nulls: 0, selectivity: 0.01        1.01      4.8±0.05ms        ? ?/sec    1.00      4.8±0.10ms        ? ?/sec
filter: mixed_utf8view (max_string_len=20), 8192, nulls: 0, selectivity: 0.1         1.09      2.6±0.20ms        ? ?/sec    1.00      2.4±0.09ms        ? ?/sec
filter: mixed_utf8view (max_string_len=20), 8192, nulls: 0, selectivity: 0.8         1.00  1181.2±18.79µs        ? ?/sec    1.32  1563.3±17.05µs        ? ?/sec
filter: mixed_utf8view (max_string_len=20), 8192, nulls: 0.1, selectivity: 0.001     1.02     53.2±0.69ms        ? ?/sec    1.00     52.1±0.63ms        ? ?/sec
filter: mixed_utf8view (max_string_len=20), 8192, nulls: 0.1, selectivity: 0.01      1.01      7.2±0.31ms        ? ?/sec    1.00      7.1±0.05ms        ? ?/sec
filter: mixed_utf8view (max_string_len=20), 8192, nulls: 0.1, selectivity: 0.1       1.08      3.9±0.10ms        ? ?/sec    1.00      3.7±0.04ms        ? ?/sec
filter: mixed_utf8view (max_string_len=20), 8192, nulls: 0.1, selectivity: 0.8       1.00      2.4±0.02ms        ? ?/sec    1.71      4.0±0.04ms        ? ?/sec
filter: primitive, 8192, nulls: 0, selectivity: 0.001                                1.00     54.1±0.43ms        ? ?/sec    1.80     97.2±0.29ms        ? ?/sec
filter: primitive, 8192, nulls: 0, selectivity: 0.01                                 1.00      5.9±0.06ms        ? ?/sec    1.59      9.3±0.19ms        ? ?/sec
filter: primitive, 8192, nulls: 0, selectivity: 0.1                                  1.00      3.9±0.42ms        ? ?/sec    1.00      3.9±0.12ms        ? ?/sec
filter: primitive, 8192, nulls: 0, selectivity: 0.8                                  1.00  1730.1±35.92µs        ? ?/sec    1.81      3.1±0.03ms        ? ?/sec
filter: primitive, 8192, nulls: 0.1, selectivity: 0.001                              1.00     59.1±0.67ms        ? ?/sec    2.13    126.1±1.67ms        ? ?/sec
filter: primitive, 8192, nulls: 0.1, selectivity: 0.01                               1.00      8.2±0.06ms        ? ?/sec    1.84     15.0±0.23ms        ? ?/sec
filter: primitive, 8192, nulls: 0.1, selectivity: 0.1                                1.00      6.5±0.30ms        ? ?/sec    1.08      7.0±0.11ms        ? ?/sec
filter: primitive, 8192, nulls: 0.1, selectivity: 0.8                                1.00      4.9±0.07ms        ? ?/sec    1.85      9.1±0.08ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0, selectivity: 0.001                          1.03     68.0±0.33ms        ? ?/sec    1.00     66.1±0.38ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0, selectivity: 0.01                           1.04      7.7±0.05ms        ? ?/sec    1.00      7.4±0.24ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0, selectivity: 0.1                            1.03      4.1±0.32ms        ? ?/sec    1.00      3.9±0.10ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0, selectivity: 0.8                            1.00  1366.2±15.03µs        ? ?/sec    1.06  1442.0±19.64µs        ? ?/sec
filter: single_utf8view, 8192, nulls: 0.1, selectivity: 0.001                        1.06     89.9±0.59ms        ? ?/sec    1.00     84.7±1.17ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0.1, selectivity: 0.01                         1.04     11.6±0.09ms        ? ?/sec    1.00     11.2±0.07ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0.1, selectivity: 0.1                          1.00      5.2±0.31ms        ? ?/sec    1.03      5.4±0.22ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0.1, selectivity: 0.8                          1.00      2.7±0.07ms        ? ?/sec    1.45      4.0±0.04ms        ? ?/sec

alamb-ghbot · 2025-12-09T20:44:22Z

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing coalesce_batches_filter (bb025cf) to ed9efe7 diff
BENCH_NAME=filter_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench filter_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=coalesce_batches_filter
Results will be posted here when complete

alamb-ghbot · 2025-12-09T21:08:49Z

🤖: Benchmark completed

Details

group                                                                         coalesce_batches_filter                main
-----                                                                         -----------------------                ----
filter context decimal128 (kept 1/2)                                          1.00     44.1±1.01µs        ? ?/sec    1.02     45.1±1.75µs        ? ?/sec
filter context decimal128 high selectivity (kept 1023/1024)                   1.00     50.7±0.96µs        ? ?/sec    1.00     50.9±1.35µs        ? ?/sec
filter context decimal128 low selectivity (kept 1/1024)                       1.00    240.2±4.97ns        ? ?/sec    1.04   248.8±27.57ns        ? ?/sec
filter context f32 (kept 1/2)                                                 1.00     77.8±0.69µs        ? ?/sec    1.00     77.7±0.42µs        ? ?/sec
filter context f32 high selectivity (kept 1023/1024)                          1.00     10.5±0.26µs        ? ?/sec    1.00     10.5±0.32µs        ? ?/sec
filter context f32 low selectivity (kept 1/1024)                              1.00    442.5±1.33ns        ? ?/sec    1.02    453.5±4.38ns        ? ?/sec
filter context fsb with value length 20 (kept 1/2)                            1.00     60.7±0.53µs        ? ?/sec    1.00     60.8±0.68µs        ? ?/sec
filter context fsb with value length 20 high selectivity (kept 1023/1024)     1.00     61.0±2.62µs        ? ?/sec    1.00     60.8±0.82µs        ? ?/sec
filter context fsb with value length 20 low selectivity (kept 1/1024)         1.00     60.7±0.52µs        ? ?/sec    1.00     60.8±0.44µs        ? ?/sec
filter context fsb with value length 5 (kept 1/2)                             1.00     60.7±0.58µs        ? ?/sec    1.00     60.8±0.43µs        ? ?/sec
filter context fsb with value length 5 high selectivity (kept 1023/1024)      1.00     60.8±0.62µs        ? ?/sec    1.00     60.7±0.34µs        ? ?/sec
filter context fsb with value length 5 low selectivity (kept 1/1024)          1.00     60.7±0.86µs        ? ?/sec    1.00     60.9±2.06µs        ? ?/sec
filter context fsb with value length 50 (kept 1/2)                            1.00     61.0±2.13µs        ? ?/sec    1.00     61.2±1.53µs        ? ?/sec
filter context fsb with value length 50 high selectivity (kept 1023/1024)     1.00     60.7±0.48µs        ? ?/sec    1.00     60.9±1.44µs        ? ?/sec
filter context fsb with value length 50 low selectivity (kept 1/1024)         1.00     60.6±0.19µs        ? ?/sec    1.00     60.7±0.21µs        ? ?/sec
filter context i32 (kept 1/2)                                                 1.01     16.7±0.08µs        ? ?/sec    1.00     16.4±0.25µs        ? ?/sec
filter context i32 high selectivity (kept 1023/1024)                          1.00      6.8±0.20µs        ? ?/sec    1.00      6.8±0.21µs        ? ?/sec
filter context i32 low selectivity (kept 1/1024)                              1.00    233.0±2.05ns        ? ?/sec    1.00    233.4±2.00ns        ? ?/sec
filter context i32 w NULLs (kept 1/2)                                         1.00     78.0±1.49µs        ? ?/sec    1.00     77.7±0.48µs        ? ?/sec
filter context i32 w NULLs high selectivity (kept 1023/1024)                  1.07     11.0±0.24µs        ? ?/sec    1.00     10.3±0.34µs        ? ?/sec
filter context i32 w NULLs low selectivity (kept 1/1024)                      1.00    444.5±4.31ns        ? ?/sec    1.03    457.1±6.10ns        ? ?/sec
filter context mixed string view (kept 1/2)                                   1.00    103.8±1.23µs        ? ?/sec    1.02    105.9±4.60µs        ? ?/sec
filter context mixed string view high selectivity (kept 1023/1024)            1.00     53.7±1.68µs        ? ?/sec    1.07     57.7±1.29µs        ? ?/sec
filter context mixed string view low selectivity (kept 1/1024)                1.00    627.1±2.50ns        ? ?/sec    1.05    658.2±2.46ns        ? ?/sec
filter context short string view (kept 1/2)                                   1.00    106.2±3.93µs        ? ?/sec    1.00    105.9±5.09µs        ? ?/sec
filter context short string view high selectivity (kept 1023/1024)            1.00     54.8±1.60µs        ? ?/sec    1.01     55.3±0.75µs        ? ?/sec
filter context short string view low selectivity (kept 1/1024)                1.00    464.9±5.81ns        ? ?/sec    1.03   479.0±25.84ns        ? ?/sec
filter context string (kept 1/2)                                              1.00   644.2±18.04µs        ? ?/sec    1.00   641.2±18.57µs        ? ?/sec
filter context string dictionary (kept 1/2)                                   1.00     17.0±0.14µs        ? ?/sec    1.03     17.5±0.29µs        ? ?/sec
filter context string dictionary high selectivity (kept 1023/1024)            1.07      8.1±0.28µs        ? ?/sec    1.00      7.5±0.19µs        ? ?/sec
filter context string dictionary low selectivity (kept 1/1024)                1.00    810.9±3.02ns        ? ?/sec    1.06    855.6±3.37ns        ? ?/sec
filter context string dictionary w NULLs (kept 1/2)                           1.00     78.6±0.35µs        ? ?/sec    1.01     79.2±1.19µs        ? ?/sec
filter context string dictionary w NULLs high selectivity (kept 1023/1024)    1.03     11.7±0.26µs        ? ?/sec    1.00     11.4±0.44µs        ? ?/sec
filter context string dictionary w NULLs low selectivity (kept 1/1024)        1.00  1051.7±28.46ns        ? ?/sec    1.03   1084.0±6.95ns        ? ?/sec
filter context string high selectivity (kept 1023/1024)                       1.00   919.0±59.05µs        ? ?/sec    1.00   918.0±58.64µs        ? ?/sec
filter context string low selectivity (kept 1/1024)                           1.00  1003.8±44.61ns        ? ?/sec    1.13  1135.8±13.05ns        ? ?/sec
filter context u8 (kept 1/2)                                                  1.00     15.1±0.29µs        ? ?/sec    1.00     15.0±0.19µs        ? ?/sec
filter context u8 high selectivity (kept 1023/1024)                           1.01  1857.9±20.28ns        ? ?/sec    1.00  1840.3±27.94ns        ? ?/sec
filter context u8 low selectivity (kept 1/1024)                               1.00    228.6±0.61ns        ? ?/sec    1.00    228.7±2.06ns        ? ?/sec
filter context u8 w NULLs (kept 1/2)                                          1.00     76.0±0.43µs        ? ?/sec    1.00     76.1±0.88µs        ? ?/sec
filter context u8 w NULLs high selectivity (kept 1023/1024)                   1.01      5.4±0.02µs        ? ?/sec    1.00      5.4±0.02µs        ? ?/sec
filter context u8 w NULLs low selectivity (kept 1/1024)                       1.00    440.0±4.62ns        ? ?/sec    1.04    455.9±3.00ns        ? ?/sec
filter decimal128 (kept 1/2)                                                  1.00     57.2±2.82µs        ? ?/sec    1.01     58.0±3.28µs        ? ?/sec
filter decimal128 high selectivity (kept 1023/1024)                           1.04     56.4±0.63µs        ? ?/sec    1.00     54.0±1.63µs        ? ?/sec
filter decimal128 low selectivity (kept 1/1024)                               1.00      3.2±0.05µs        ? ?/sec    1.03      3.2±0.01µs        ? ?/sec
filter f32 (kept 1/2)                                                         1.01    157.4±1.69µs        ? ?/sec    1.00    156.0±1.07µs        ? ?/sec
filter fsb with value length 20 (kept 1/2)                                    1.06    133.0±2.53µs        ? ?/sec    1.00    125.3±0.91µs        ? ?/sec
filter fsb with value length 20 high selectivity (kept 1023/1024)             1.02     74.3±2.39µs        ? ?/sec    1.00     72.7±2.16µs        ? ?/sec
filter fsb with value length 20 low selectivity (kept 1/1024)                 1.00      3.1±0.01µs        ? ?/sec    1.12      3.5±0.03µs        ? ?/sec
filter fsb with value length 5 (kept 1/2)                                     1.14    137.7±1.51µs        ? ?/sec    1.00    121.1±0.67µs        ? ?/sec
filter fsb with value length 5 high selectivity (kept 1023/1024)              1.00     11.6±0.22µs        ? ?/sec    1.00     11.5±0.44µs        ? ?/sec
filter fsb with value length 5 low selectivity (kept 1/1024)                  1.00      3.1±0.04µs        ? ?/sec    1.11      3.4±0.03µs        ? ?/sec
filter fsb with value length 50 (kept 1/2)                                    1.01   168.4±11.42µs        ? ?/sec    1.00    167.1±3.00µs        ? ?/sec
filter fsb with value length 50 high selectivity (kept 1023/1024)             1.00    216.6±5.18µs        ? ?/sec    1.00    217.1±4.37µs        ? ?/sec
filter fsb with value length 50 low selectivity (kept 1/1024)                 1.00      3.1±0.04µs        ? ?/sec    1.08      3.4±0.01µs        ? ?/sec
filter i32 (kept 1/2)                                                         1.07     46.4±0.18µs        ? ?/sec    1.00     43.3±0.21µs        ? ?/sec
filter i32 high selectivity (kept 1023/1024)                                  1.00      8.9±0.19µs        ? ?/sec    1.01      9.0±0.16µs        ? ?/sec
filter i32 low selectivity (kept 1/1024)                                      1.38      3.3±0.05µs        ? ?/sec    1.00      2.4±0.03µs        ? ?/sec
filter optimize (kept 1/2)                                                    1.01     54.5±0.31µs        ? ?/sec    1.00     54.2±0.26µs        ? ?/sec
filter optimize high selectivity (kept 1023/1024)                             1.00      2.6±0.02µs        ? ?/sec    1.01      2.6±0.05µs        ? ?/sec
filter optimize low selectivity (kept 1/1024)                                 1.11      3.1±0.08µs        ? ?/sec    1.00      2.8±0.06µs        ? ?/sec
filter run array (kept 1/2)                                                   1.01   427.0±10.06µs        ? ?/sec    1.00   423.4±12.15µs        ? ?/sec
filter run array high selectivity (kept 1023/1024)                            1.00    453.3±9.13µs        ? ?/sec    1.00    452.3±3.27µs        ? ?/sec
filter run array low selectivity (kept 1/1024)                                1.01    338.6±9.42µs        ? ?/sec    1.00    334.9±2.17µs        ? ?/sec
filter single record batch                                                    1.03     45.7±0.70µs        ? ?/sec    1.00     44.3±0.23µs        ? ?/sec
filter u8 (kept 1/2)                                                          1.00     45.1±0.11µs        ? ?/sec    1.01     45.6±0.11µs        ? ?/sec
filter u8 high selectivity (kept 1023/1024)                                   1.00      3.8±0.02µs        ? ?/sec    1.00      3.8±0.03µs        ? ?/sec
filter u8 low selectivity (kept 1/1024)                                       1.00      3.3±0.02µs        ? ?/sec    1.00      3.3±0.02µs        ? ?/sec

alamb · 2025-12-11T20:29:02Z

Hi @Dandandan -- I am working through the arrow-rs review backlog. coalesce benchmarks look better. Filter kernels look potentially slower. I'll rerun and try to see if we can reproduce the results

alamb · 2025-12-11T20:29:18Z

run benchmark filter_kernels

alamb-ghbot · 2025-12-11T21:19:44Z

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing coalesce_batches_filter (bb025cf) to ed9efe7 diff
BENCH_NAME=filter_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench filter_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=coalesce_batches_filter
Results will be posted here when complete

alamb-ghbot · 2025-12-11T21:44:07Z

🤖: Benchmark completed

Details

group                                                                         coalesce_batches_filter                main
-----                                                                         -----------------------                ----
filter context decimal128 (kept 1/2)                                          1.00     45.9±6.33µs        ? ?/sec    1.01     46.4±6.05µs        ? ?/sec
filter context decimal128 high selectivity (kept 1023/1024)                   1.03     51.1±2.22µs        ? ?/sec    1.00     49.4±1.18µs        ? ?/sec
filter context decimal128 low selectivity (kept 1/1024)                       1.00    241.1±2.07ns        ? ?/sec    1.01    244.2±1.03ns        ? ?/sec
filter context f32 (kept 1/2)                                                 1.00     78.0±1.94µs        ? ?/sec    1.00     77.9±0.71µs        ? ?/sec
filter context f32 high selectivity (kept 1023/1024)                          1.00     10.4±0.49µs        ? ?/sec    1.00     10.4±0.31µs        ? ?/sec
filter context f32 low selectivity (kept 1/1024)                              1.00    451.1±1.68ns        ? ?/sec    1.02    458.4±7.82ns        ? ?/sec
filter context fsb with value length 20 (kept 1/2)                            1.00     60.6±0.28µs        ? ?/sec    1.00     60.6±0.24µs        ? ?/sec
filter context fsb with value length 20 high selectivity (kept 1023/1024)     1.00     60.8±0.96µs        ? ?/sec    1.00     60.7±0.90µs        ? ?/sec
filter context fsb with value length 20 low selectivity (kept 1/1024)         1.00     60.5±0.24µs        ? ?/sec    1.00     60.8±0.81µs        ? ?/sec
filter context fsb with value length 5 (kept 1/2)                             1.00     61.1±1.47µs        ? ?/sec    1.00     60.8±0.80µs        ? ?/sec
filter context fsb with value length 5 high selectivity (kept 1023/1024)      1.00     60.8±0.80µs        ? ?/sec    1.00     60.7±0.76µs        ? ?/sec
filter context fsb with value length 5 low selectivity (kept 1/1024)          1.00     60.7±0.66µs        ? ?/sec    1.00     61.0±1.31µs        ? ?/sec
filter context fsb with value length 50 (kept 1/2)                            1.00     60.6±0.93µs        ? ?/sec    1.00     60.8±0.33µs        ? ?/sec
filter context fsb with value length 50 high selectivity (kept 1023/1024)     1.00     60.9±0.62µs        ? ?/sec    1.00     60.7±0.25µs        ? ?/sec
filter context fsb with value length 50 low selectivity (kept 1/1024)         1.00     60.6±0.46µs        ? ?/sec    1.00     60.7±0.60µs        ? ?/sec
filter context i32 (kept 1/2)                                                 1.01     16.6±0.09µs        ? ?/sec    1.00     16.5±0.17µs        ? ?/sec
filter context i32 high selectivity (kept 1023/1024)                          1.00      6.4±0.40µs        ? ?/sec    1.02      6.6±0.40µs        ? ?/sec
filter context i32 low selectivity (kept 1/1024)                              1.01    238.3±1.05ns        ? ?/sec    1.00    235.9±3.08ns        ? ?/sec
filter context i32 w NULLs (kept 1/2)                                         1.00     77.7±0.54µs        ? ?/sec    1.00     77.7±1.59µs        ? ?/sec
filter context i32 w NULLs high selectivity (kept 1023/1024)                  1.00     10.2±0.45µs        ? ?/sec    1.05     10.8±0.41µs        ? ?/sec
filter context i32 w NULLs low selectivity (kept 1/1024)                      1.00    454.0±5.90ns        ? ?/sec    1.02    462.1±7.68ns        ? ?/sec
filter context mixed string view (kept 1/2)                                   1.00    104.5±4.74µs        ? ?/sec    1.02    106.5±4.35µs        ? ?/sec
filter context mixed string view high selectivity (kept 1023/1024)            1.00     53.9±1.70µs        ? ?/sec    1.04     56.2±1.61µs        ? ?/sec
filter context mixed string view low selectivity (kept 1/1024)                1.00    675.7±7.20ns        ? ?/sec    1.01    684.0±2.67ns        ? ?/sec
filter context short string view (kept 1/2)                                   1.00    104.8±5.30µs        ? ?/sec    1.02    107.0±3.28µs        ? ?/sec
filter context short string view high selectivity (kept 1023/1024)            1.01     55.5±1.57µs        ? ?/sec    1.00     54.9±1.38µs        ? ?/sec
filter context short string view low selectivity (kept 1/1024)                1.01    482.5±4.17ns        ? ?/sec    1.00    477.0±1.44ns        ? ?/sec
filter context string (kept 1/2)                                              1.00   585.3±12.49µs        ? ?/sec    1.03   603.2±17.73µs        ? ?/sec
filter context string dictionary (kept 1/2)                                   1.01     17.4±0.13µs        ? ?/sec    1.00     17.2±0.16µs        ? ?/sec
filter context string dictionary high selectivity (kept 1023/1024)            1.00      7.2±0.50µs        ? ?/sec    1.09      7.9±0.19µs        ? ?/sec
filter context string dictionary low selectivity (kept 1/1024)                1.00    832.1±8.58ns        ? ?/sec    1.01   842.1±14.13ns        ? ?/sec
filter context string dictionary w NULLs (kept 1/2)                           1.01     79.1±1.46µs        ? ?/sec    1.00     78.6±0.86µs        ? ?/sec
filter context string dictionary w NULLs high selectivity (kept 1023/1024)    1.00     11.1±0.46µs        ? ?/sec    1.01     11.2±0.26µs        ? ?/sec
filter context string dictionary w NULLs low selectivity (kept 1/1024)        1.00  1064.7±16.00ns        ? ?/sec    1.02   1086.4±7.18ns        ? ?/sec
filter context string high selectivity (kept 1023/1024)                       1.00   670.3±23.59µs        ? ?/sec    1.37  917.2±161.37µs        ? ?/sec
filter context string low selectivity (kept 1/1024)                           1.02   1037.6±5.15ns        ? ?/sec    1.00  1015.4±17.73ns        ? ?/sec
filter context u8 (kept 1/2)                                                  1.00     15.0±0.23µs        ? ?/sec    1.00     15.0±0.08µs        ? ?/sec
filter context u8 high selectivity (kept 1023/1024)                           1.00  1820.1±12.84ns        ? ?/sec    1.00  1826.1±31.22ns        ? ?/sec
filter context u8 low selectivity (kept 1/1024)                               1.00    230.7±1.42ns        ? ?/sec    1.00    231.1±2.77ns        ? ?/sec
filter context u8 w NULLs (kept 1/2)                                          1.00     76.0±0.66µs        ? ?/sec    1.00     76.2±0.61µs        ? ?/sec
filter context u8 w NULLs high selectivity (kept 1023/1024)                   1.00      5.4±0.12µs        ? ?/sec    1.00      5.4±0.03µs        ? ?/sec
filter context u8 w NULLs low selectivity (kept 1/1024)                       1.00    446.4±8.90ns        ? ?/sec    1.02    457.0±1.87ns        ? ?/sec
filter decimal128 (kept 1/2)                                                  1.00     51.9±4.04µs        ? ?/sec    1.10     57.0±3.08µs        ? ?/sec
filter decimal128 high selectivity (kept 1023/1024)                           1.00     52.6±1.80µs        ? ?/sec    1.00     52.5±1.20µs        ? ?/sec
filter decimal128 low selectivity (kept 1/1024)                               1.00      3.2±0.10µs        ? ?/sec    1.02      3.2±0.04µs        ? ?/sec
filter f32 (kept 1/2)                                                         1.01    157.2±1.78µs        ? ?/sec    1.00    156.4±1.54µs        ? ?/sec
filter fsb with value length 20 (kept 1/2)                                    1.06    132.2±1.80µs        ? ?/sec    1.00    124.8±0.92µs        ? ?/sec
filter fsb with value length 20 high selectivity (kept 1023/1024)             1.00     71.8±2.78µs        ? ?/sec    1.00     71.6±2.42µs        ? ?/sec
filter fsb with value length 20 low selectivity (kept 1/1024)                 1.00      3.2±0.02µs        ? ?/sec    1.07      3.4±0.03µs        ? ?/sec
filter fsb with value length 5 (kept 1/2)                                     1.13    137.3±0.79µs        ? ?/sec    1.00    121.2±0.60µs        ? ?/sec
filter fsb with value length 5 high selectivity (kept 1023/1024)              1.00     11.3±0.65µs        ? ?/sec    1.01     11.4±0.63µs        ? ?/sec
filter fsb with value length 5 low selectivity (kept 1/1024)                  1.00      3.1±0.01µs        ? ?/sec    1.09      3.4±0.06µs        ? ?/sec
filter fsb with value length 50 (kept 1/2)                                    1.00   164.1±13.09µs        ? ?/sec    1.04   170.8±17.51µs        ? ?/sec
filter fsb with value length 50 high selectivity (kept 1023/1024)             1.00    204.1±6.86µs        ? ?/sec    1.01    206.4±8.06µs        ? ?/sec
filter fsb with value length 50 low selectivity (kept 1/1024)                 1.00      3.2±0.03µs        ? ?/sec    1.06      3.4±0.02µs        ? ?/sec
filter i32 (kept 1/2)                                                         1.07     46.4±0.12µs        ? ?/sec    1.00     43.3±0.16µs        ? ?/sec
filter i32 high selectivity (kept 1023/1024)                                  1.00      8.5±0.36µs        ? ?/sec    1.05      9.0±0.39µs        ? ?/sec
filter i32 low selectivity (kept 1/1024)                                      1.39      3.3±0.06µs        ? ?/sec    1.00      2.4±0.01µs        ? ?/sec
filter optimize (kept 1/2)                                                    1.01     54.5±0.54µs        ? ?/sec    1.00     54.1±0.37µs        ? ?/sec
filter optimize high selectivity (kept 1023/1024)                             1.00      2.6±0.01µs        ? ?/sec    1.01      2.6±0.01µs        ? ?/sec
filter optimize low selectivity (kept 1/1024)                                 1.11      3.0±0.02µs        ? ?/sec    1.00      2.8±0.02µs        ? ?/sec
filter run array (kept 1/2)                                                   1.00    426.0±5.10µs        ? ?/sec    1.00   424.0±11.28µs        ? ?/sec
filter run array high selectivity (kept 1023/1024)                            1.00    450.1±4.75µs        ? ?/sec    1.00    450.9±5.01µs        ? ?/sec
filter run array low selectivity (kept 1/1024)                                1.01    337.5±5.50µs        ? ?/sec    1.00    335.4±3.55µs        ? ?/sec
filter single record batch                                                    1.03     45.6±0.24µs        ? ?/sec    1.00     44.3±0.42µs        ? ?/sec
filter u8 (kept 1/2)                                                          1.00     45.1±0.11µs        ? ?/sec    1.01     45.6±0.20µs        ? ?/sec
filter u8 high selectivity (kept 1023/1024)                                   1.02      3.8±0.03µs        ? ?/sec    1.00      3.7±0.03µs        ? ?/sec
filter u8 low selectivity (kept 1/1024)                                       1.00      3.3±0.05µs        ? ?/sec    1.00      3.3±0.02µs        ? ?/sec

alamb · 2025-12-11T21:58:05Z

run benchmark filter_kernels

Dandandan · 2025-12-11T22:13:47Z

Hi @Dandandan -- I am working through the arrow-rs review backlog. coalesce benchmarks look better. Filter kernels look potentially slower. I'll rerun and try to see if we can reproduce the results

I don't think the filter kernels should have any impact other than the threshold, but that isn't covered by a benchmark.

alamb-ghbot · 2025-12-11T22:15:24Z

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing coalesce_batches_filter (bb025cf) to ed9efe7 diff
BENCH_NAME=filter_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench filter_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=coalesce_batches_filter
Results will be posted here when complete

alamb-ghbot · 2025-12-11T22:39:41Z

🤖: Benchmark completed

Details

group                                                                         coalesce_batches_filter                main
-----                                                                         -----------------------                ----
filter context decimal128 (kept 1/2)                                          1.06     47.0±7.08µs        ? ?/sec    1.00     44.3±1.94µs        ? ?/sec
filter context decimal128 high selectivity (kept 1023/1024)                   1.02     51.2±1.84µs        ? ?/sec    1.00     50.0±2.98µs        ? ?/sec
filter context decimal128 low selectivity (kept 1/1024)                       1.00    241.4±2.88ns        ? ?/sec    1.00    240.8±4.26ns        ? ?/sec
filter context f32 (kept 1/2)                                                 1.00     77.9±0.90µs        ? ?/sec    1.00     77.7±0.95µs        ? ?/sec
filter context f32 high selectivity (kept 1023/1024)                          1.02     10.5±0.43µs        ? ?/sec    1.00     10.3±0.44µs        ? ?/sec
filter context f32 low selectivity (kept 1/1024)                              1.00    442.0±7.36ns        ? ?/sec    1.03    454.6±1.42ns        ? ?/sec
filter context fsb with value length 20 (kept 1/2)                            1.00     60.6±0.67µs        ? ?/sec    1.00     60.9±0.62µs        ? ?/sec
filter context fsb with value length 20 high selectivity (kept 1023/1024)     1.00     60.7±0.74µs        ? ?/sec    1.00     60.8±0.52µs        ? ?/sec
filter context fsb with value length 20 low selectivity (kept 1/1024)         1.00     60.6±0.30µs        ? ?/sec    1.00     60.8±0.61µs        ? ?/sec
filter context fsb with value length 5 (kept 1/2)                             1.00     60.7±0.52µs        ? ?/sec    1.01     61.1±0.80µs        ? ?/sec
filter context fsb with value length 5 high selectivity (kept 1023/1024)      1.00     60.9±1.82µs        ? ?/sec    1.00     60.7±0.56µs        ? ?/sec
filter context fsb with value length 5 low selectivity (kept 1/1024)          1.00     60.7±0.43µs        ? ?/sec    1.00     60.6±0.16µs        ? ?/sec
filter context fsb with value length 50 (kept 1/2)                            1.00     60.8±0.32µs        ? ?/sec    1.02     61.9±5.83µs        ? ?/sec
filter context fsb with value length 50 high selectivity (kept 1023/1024)     1.00     60.9±1.96µs        ? ?/sec    1.00     60.7±0.69µs        ? ?/sec
filter context fsb with value length 50 low selectivity (kept 1/1024)         1.00     60.6±0.42µs        ? ?/sec    1.00     60.7±0.34µs        ? ?/sec
filter context i32 (kept 1/2)                                                 1.02     17.0±0.11µs        ? ?/sec    1.00     16.6±0.09µs        ? ?/sec
filter context i32 high selectivity (kept 1023/1024)                          1.00      6.5±0.45µs        ? ?/sec    1.00      6.6±0.38µs        ? ?/sec
filter context i32 low selectivity (kept 1/1024)                              1.02    239.3±1.79ns        ? ?/sec    1.00    234.5±2.04ns        ? ?/sec
filter context i32 w NULLs (kept 1/2)                                         1.01     78.0±0.70µs        ? ?/sec    1.00     77.5±0.39µs        ? ?/sec
filter context i32 w NULLs high selectivity (kept 1023/1024)                  1.05     10.6±0.52µs        ? ?/sec    1.00     10.1±0.53µs        ? ?/sec
filter context i32 w NULLs low selectivity (kept 1/1024)                      1.00   450.4±20.45ns        ? ?/sec    1.03    463.5±3.66ns        ? ?/sec
filter context mixed string view (kept 1/2)                                   1.01    105.2±5.62µs        ? ?/sec    1.00    104.2±4.47µs        ? ?/sec
filter context mixed string view high selectivity (kept 1023/1024)            1.00     54.9±1.34µs        ? ?/sec    1.01     55.5±1.30µs        ? ?/sec
filter context mixed string view low selectivity (kept 1/1024)                1.00    663.5±1.51ns        ? ?/sec    1.02    678.2±2.96ns        ? ?/sec
filter context short string view (kept 1/2)                                   1.00    103.8±4.66µs        ? ?/sec    1.01    104.5±5.17µs        ? ?/sec
filter context short string view high selectivity (kept 1023/1024)            1.00     53.3±1.67µs        ? ?/sec    1.04     55.5±1.91µs        ? ?/sec
filter context short string view low selectivity (kept 1/1024)                1.00    471.2±8.20ns        ? ?/sec    1.01    476.8±5.12ns        ? ?/sec
filter context string (kept 1/2)                                              1.00   576.6±11.44µs        ? ?/sec    1.01   584.2±12.53µs        ? ?/sec
filter context string dictionary (kept 1/2)                                   1.00     17.0±0.13µs        ? ?/sec    1.01     17.1±0.11µs        ? ?/sec
filter context string dictionary high selectivity (kept 1023/1024)            1.02      7.6±0.29µs        ? ?/sec    1.00      7.4±0.58µs        ? ?/sec
filter context string dictionary low selectivity (kept 1/1024)                1.03    829.0±9.39ns        ? ?/sec    1.00    806.5±4.65ns        ? ?/sec
filter context string dictionary w NULLs (kept 1/2)                           1.00     78.8±0.44µs        ? ?/sec    1.00     78.4±1.77µs        ? ?/sec
filter context string dictionary w NULLs high selectivity (kept 1023/1024)    1.06     11.5±0.49µs        ? ?/sec    1.00     10.9±0.44µs        ? ?/sec
filter context string dictionary w NULLs low selectivity (kept 1/1024)        1.00  1053.4±19.67ns        ? ?/sec    1.00  1054.2±15.33ns        ? ?/sec
filter context string high selectivity (kept 1023/1024)                       1.04   683.6±37.71µs        ? ?/sec    1.00   659.9±22.70µs        ? ?/sec
filter context string low selectivity (kept 1/1024)                           1.00   1045.0±4.96ns        ? ?/sec    1.01   1059.2±5.93ns        ? ?/sec
filter context u8 (kept 1/2)                                                  1.00     15.0±0.22µs        ? ?/sec    1.00     15.0±0.09µs        ? ?/sec
filter context u8 high selectivity (kept 1023/1024)                           1.00  1817.8±13.65ns        ? ?/sec    1.00  1809.0±22.61ns        ? ?/sec
filter context u8 low selectivity (kept 1/1024)                               1.02    233.1±8.29ns        ? ?/sec    1.00    227.5±3.72ns        ? ?/sec
filter context u8 w NULLs (kept 1/2)                                          1.00     76.2±0.83µs        ? ?/sec    1.00     75.9±0.33µs        ? ?/sec
filter context u8 w NULLs high selectivity (kept 1023/1024)                   1.00      5.3±0.09µs        ? ?/sec    1.00      5.3±0.09µs        ? ?/sec
filter context u8 w NULLs low selectivity (kept 1/1024)                       1.00    436.9±3.27ns        ? ?/sec    1.04    455.9±5.89ns        ? ?/sec
filter decimal128 (kept 1/2)                                                  1.00     50.7±3.62µs        ? ?/sec    1.14     57.8±3.52µs        ? ?/sec
filter decimal128 high selectivity (kept 1023/1024)                           1.01     53.4±1.60µs        ? ?/sec    1.00     52.6±1.58µs        ? ?/sec
filter decimal128 low selectivity (kept 1/1024)                               1.00      3.2±0.01µs        ? ?/sec    1.02      3.2±0.03µs        ? ?/sec
filter f32 (kept 1/2)                                                         1.01    157.2±0.54µs        ? ?/sec    1.00    155.8±1.49µs        ? ?/sec
filter fsb with value length 20 (kept 1/2)                                    1.05    131.9±2.38µs        ? ?/sec    1.00    125.3±1.78µs        ? ?/sec
filter fsb with value length 20 high selectivity (kept 1023/1024)             1.01     74.1±2.24µs        ? ?/sec    1.00     73.6±2.14µs        ? ?/sec
filter fsb with value length 20 low selectivity (kept 1/1024)                 1.00      3.1±0.03µs        ? ?/sec    1.10      3.4±0.02µs        ? ?/sec
filter fsb with value length 5 (kept 1/2)                                     1.13    137.3±0.78µs        ? ?/sec    1.00    121.4±0.88µs        ? ?/sec
filter fsb with value length 5 high selectivity (kept 1023/1024)              1.06     11.7±0.67µs        ? ?/sec    1.00     11.1±0.61µs        ? ?/sec
filter fsb with value length 5 low selectivity (kept 1/1024)                  1.00      3.1±0.03µs        ? ?/sec    1.08      3.4±0.01µs        ? ?/sec
filter fsb with value length 50 (kept 1/2)                                    1.02   165.8±14.63µs        ? ?/sec    1.00   163.2±12.91µs        ? ?/sec
filter fsb with value length 50 high selectivity (kept 1023/1024)             1.02    214.8±7.61µs        ? ?/sec    1.00    210.2±6.72µs        ? ?/sec
filter fsb with value length 50 low selectivity (kept 1/1024)                 1.00      3.1±0.03µs        ? ?/sec    1.07      3.3±0.05µs        ? ?/sec
filter i32 (kept 1/2)                                                         1.07     46.6±0.57µs        ? ?/sec    1.00     43.4±0.77µs        ? ?/sec
filter i32 high selectivity (kept 1023/1024)                                  1.00      8.7±0.38µs        ? ?/sec    1.00      8.8±0.41µs        ? ?/sec
filter i32 low selectivity (kept 1/1024)                                      1.39      3.3±0.02µs        ? ?/sec    1.00      2.4±0.01µs        ? ?/sec
filter optimize (kept 1/2)                                                    1.00     54.5±0.44µs        ? ?/sec    1.00     54.4±0.93µs        ? ?/sec
filter optimize high selectivity (kept 1023/1024)                             1.00      2.6±0.04µs        ? ?/sec    1.00      2.6±0.03µs        ? ?/sec
filter optimize low selectivity (kept 1/1024)                                 1.10      3.1±0.09µs        ? ?/sec    1.00      2.8±0.14µs        ? ?/sec
filter run array (kept 1/2)                                                   1.00    425.0±1.68µs        ? ?/sec    1.00    423.0±7.01µs        ? ?/sec
filter run array high selectivity (kept 1023/1024)                            1.00    449.7±6.02µs        ? ?/sec    1.00    449.4±4.66µs        ? ?/sec
filter run array low selectivity (kept 1/1024)                                1.00    336.6±2.12µs        ? ?/sec    1.00    335.4±2.43µs        ? ?/sec
filter single record batch                                                    1.03     45.6±0.15µs        ? ?/sec    1.00     44.2±0.14µs        ? ?/sec
filter u8 (kept 1/2)                                                          1.00     45.2±0.87µs        ? ?/sec    1.01     45.7±0.74µs        ? ?/sec
filter u8 high selectivity (kept 1023/1024)                                   1.02      3.8±0.02µs        ? ?/sec    1.00      3.8±0.05µs        ? ?/sec
filter u8 low selectivity (kept 1/1024)                                       1.00      3.3±0.07µs        ? ?/sec    1.00      3.3±0.07µs        ? ?/sec

alamb · 2025-12-11T22:42:32Z

Ok, thank you -- I will plan to review this one carefully in the morning

alamb

Thanks @Dandandan -- this looks really exciting; I had a few comments.

I suggest we break this PR up into several smaller ones (now that you have proof the benchmarks are working well):

Add BooleanBufferBuilder::extend (and tests)
Add BooleanBuffer::find_nth_set_bit_position (and tests)
Add the changes to coalesce

alamb · 2025-12-12T20:50:51Z

arrow-buffer/src/builder/null.rs

+    /// assert_eq!(builder.len(), 4);
+    /// ```
+    pub fn extend<I: Iterator<Item = bool>>(&mut self, iter: I) {
+        let (lower, upper) = iter.size_hint();


I think this method would be more generally when appending to any BooleanBuffer rather than just NullBufferBuilder

As part of the goal to consolidate mutable boolean operations in BooleanBufferBuilder so it is easier to find (and optimize) them, would you be willing to move this code to BooleanBufferBuilder so that the code in NullBufferBuilder looks like something like this (which is what most other methods in NullBufferBuilder look like)?

pub fn extend<I: Iterator<Item = bool>>(&mut self, iter: I) { // Materialize since we're about to append bits self.materialize_if_needed(); self.bitmap_builder.as_mut().unwrap().extend(iter) }

Sounds good, I'll do that

alamb · 2025-12-12T20:55:14Z

arrow-buffer/src/builder/null.rs

+            let mut iter = iter.peekable();
+
+            // Process full u64 chunks (64 bits at a time)
+            while bit_idx + 64 <= end_bit && iter.peek().is_some() {


As a follow on PR, it might be worth aligning first on 64 bit boundaries (so the underlying code doesn't have to handle aligning) -- aka handle bits 0..63 (until 64 bit alignment) specially and then use the u64 path

alamb · 2025-12-12T20:56:37Z

arrow-buffer/src/builder/null.rs

+                }
+                let byte_idx = (bit_idx - start_len) / 8 + start_byte;
+                // Write the u64 chunk as 8 bytes
+                slice[byte_idx..byte_idx + 8].copy_from_slice(&chunk.to_le_bytes());


could try unsafe here too as you ensured the right length above

alamb · 2025-12-12T20:58:31Z

arrow-buffer/src/builder/null.rs

+        // Test extend with non-aligned start (tests bit-by-bit path)
+        let mut builder = NullBufferBuilder::new(0);
+        builder.append_non_null(); // Start at bit 1 (non-aligned)
+        builder.extend([false, true, false, true].iter().copied());


I think we should probably test non aligned writes with more than 64 bits as well (this only copies 4 bits)

alamb · 2025-12-12T21:00:17Z

arrow-select/src/coalesce.rs

        batch: RecordBatch,
        filter: &BooleanArray,
    ) -> Result<(), ArrowError> {
-        // TODO: optimize this to avoid materializing (copying the results


alamb · 2025-12-12T21:04:31Z

arrow-select/src/coalesce.rs

 }

+/// Find the position after the n-th set bit in a boolean array starting from `start`.
+/// Returns the position after the n-th set bit, or the end of the array if fewer than n bits are set.


I recommend we move this code into BooleanBuffer as well so it is easier to find / reuse

arrow-select/src/filter.rs

alamb · 2025-12-12T21:13:36Z

arrow-select/src/coalesce.rs


+    /// Copy rows at the given indices from the current source array into the in-progress array
+    fn copy_rows_by_filter(&mut self, filter: &FilterPredicate) -> Result<(), ArrowError> {
+        // Default implementation: iterate over indices from the filter


It seems like as a follow on we should implement something similar for the byte array filter types? If that is true I'll file a ticket

Correct, for views & bytes array

alamb · 2025-12-12T21:16:53Z

arrow-select/src/coalesce/primitive.rs

+                    self.nulls.append_n_non_nulls(count);
+                }
+            }
+            IterationStrategy::Slices(slices) => {


I think this function needs some tests

I ran code coverage like this

cargo llvm-cov test --html -p arrow-buffer -p arrow-select

And there appears to be no coverage

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

Dandandan · 2025-12-13T08:32:12Z

Thanks @Dandandan -- this looks really exciting; I had a few comments.

I suggest we break this PR up into several smaller ones (now that you have proof the benchmarks are working well):

Add BooleanBufferBuilder::extend (and tests)

Add BooleanBuffer::find_nth_set_bit_position (and tests)

Add the changes to coalesce

Sounds like a good plan, I'll follow that!

MVP for apache#8957 awaits for apache#8951 very first version for behaviour review, optimizations TBD Signed-off-by: 蔡略 <cailue@apache.org>

mapleFU · 2025-12-14T13:47:21Z

arrow-buffer/src/builder/null.rs

+        let mut bit_idx = start_len;
+        let end_bit = start_len + len;
+
+        // Process in chunks of 64 bits when byte-aligned for better performance


I'm a bit curious, why this don't have some part for unaligned an aligned handling, and

handle_unaligned() // handled start_len % 8 header handle_aligned() // handle inner payloads handle_unaligned() // handle_trailer

I think this is the same comment as @alamb has?

Yeah perhaps this can improve performance (will see guided by benchmarks).

I just checked - this seems an additional ~30% improvement for null handling:

filter: primitive, 8192, nulls: 0.1, selectivity: 0.1 time: [2.4060 ms 2.4096 ms 2.4133 ms] change: [−33.920% −32.902% −32.274%] (p = 0.00 < 0.05) Performance has improved. Found 1 outliers among 100 measurements (1.00%) 1 (1.00%) high mild filter: primitive, 8192, nulls: 0.1, selectivity: 0.8 time: [2.1610 ms 2.1666 ms 2.1728 ms] change: [−29.488% −28.499% −27.767%] (p = 0.00 < 0.05) Performance has improved. Found 7 outliers among 100 measurements (7.00%) 4 (4.00%) high mild 3 (3.00%) high severe

Dandandan · 2025-12-15T08:25:28Z

run benchmark coalesce_kernels

alamb-ghbot · 2025-12-15T08:25:39Z

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing coalesce_batches_filter (f718f2e) to ed9efe7 diff
BENCH_NAME=coalesce_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench coalesce_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=coalesce_batches_filter
Results will be posted here when complete

Dandandan · 2025-12-15T08:36:49Z

run benchmark coalesce_kernels

alamb-ghbot · 2025-12-15T08:45:53Z

🤖: Benchmark completed

Details

group                                                                                coalesce_batches_filter                main
-----                                                                                -----------------------                ----
filter: mixed_dict, 8192, nulls: 0, selectivity: 0.001                               1.00    257.2±3.16ms        ? ?/sec    1.02    263.0±3.88ms        ? ?/sec
filter: mixed_dict, 8192, nulls: 0, selectivity: 0.01                                1.00      8.6±0.14ms        ? ?/sec    1.00      8.6±0.11ms        ? ?/sec
filter: mixed_dict, 8192, nulls: 0, selectivity: 0.1                                 1.00      4.0±0.08ms        ? ?/sec    1.02      4.1±0.08ms        ? ?/sec
filter: mixed_dict, 8192, nulls: 0, selectivity: 0.8                                 1.00      3.3±0.10ms        ? ?/sec    1.08      3.5±0.03ms        ? ?/sec
filter: mixed_dict, 8192, nulls: 0.1, selectivity: 0.001                             1.00    242.3±3.47ms        ? ?/sec    1.31    317.6±4.11ms        ? ?/sec
filter: mixed_dict, 8192, nulls: 0.1, selectivity: 0.01                              1.00      9.3±0.13ms        ? ?/sec    1.00      9.3±0.34ms        ? ?/sec
filter: mixed_dict, 8192, nulls: 0.1, selectivity: 0.1                               1.01      4.6±0.06ms        ? ?/sec    1.00      4.5±0.10ms        ? ?/sec
filter: mixed_dict, 8192, nulls: 0.1, selectivity: 0.8                               1.00      3.7±0.04ms        ? ?/sec    1.24      4.6±0.05ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0, selectivity: 0.001                               1.00     59.6±0.52ms        ? ?/sec    1.01     59.9±1.07ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0, selectivity: 0.01                                1.00     11.2±0.11ms        ? ?/sec    1.03     11.6±0.15ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0, selectivity: 0.1                                 1.00      9.1±0.27ms        ? ?/sec    1.03      9.3±0.29ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0, selectivity: 0.8                                 1.00      7.8±0.20ms        ? ?/sec    1.41     11.0±0.27ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0.1, selectivity: 0.001                             1.00     69.7±0.42ms        ? ?/sec    1.00     69.4±0.68ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0.1, selectivity: 0.01                              1.00     12.8±0.10ms        ? ?/sec    1.00     12.7±0.14ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0.1, selectivity: 0.1                               1.02     10.0±0.35ms        ? ?/sec    1.00      9.8±0.28ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0.1, selectivity: 0.8                               1.00      8.6±0.20ms        ? ?/sec    1.14      9.8±0.22ms        ? ?/sec
filter: mixed_utf8view (max_string_len=128), 8192, nulls: 0, selectivity: 0.001      1.00     48.8±0.30ms        ? ?/sec    1.00     48.6±0.63ms        ? ?/sec
filter: mixed_utf8view (max_string_len=128), 8192, nulls: 0, selectivity: 0.01       1.00      5.9±0.04ms        ? ?/sec    1.01      6.0±0.17ms        ? ?/sec
filter: mixed_utf8view (max_string_len=128), 8192, nulls: 0, selectivity: 0.1        1.00      4.3±0.11ms        ? ?/sec    1.05      4.6±0.20ms        ? ?/sec
filter: mixed_utf8view (max_string_len=128), 8192, nulls: 0, selectivity: 0.8        1.00      2.6±0.03ms        ? ?/sec    1.22      3.1±0.08ms        ? ?/sec
filter: mixed_utf8view (max_string_len=128), 8192, nulls: 0.1, selectivity: 0.001    1.00     58.0±0.49ms        ? ?/sec    1.01     58.4±0.73ms        ? ?/sec
filter: mixed_utf8view (max_string_len=128), 8192, nulls: 0.1, selectivity: 0.01     1.00      7.9±0.06ms        ? ?/sec    1.00      7.9±0.06ms        ? ?/sec
filter: mixed_utf8view (max_string_len=128), 8192, nulls: 0.1, selectivity: 0.1      1.00      5.4±0.20ms        ? ?/sec    1.00      5.5±0.15ms        ? ?/sec
filter: mixed_utf8view (max_string_len=128), 8192, nulls: 0.1, selectivity: 0.8      1.00      2.2±0.02ms        ? ?/sec    1.75      3.9±0.06ms        ? ?/sec
filter: mixed_utf8view (max_string_len=20), 8192, nulls: 0, selectivity: 0.001       1.01     42.7±0.22ms        ? ?/sec    1.00     42.5±0.84ms        ? ?/sec
filter: mixed_utf8view (max_string_len=20), 8192, nulls: 0, selectivity: 0.01        1.00      4.7±0.03ms        ? ?/sec    1.00      4.7±0.03ms        ? ?/sec
filter: mixed_utf8view (max_string_len=20), 8192, nulls: 0, selectivity: 0.1         1.05      2.5±0.21ms        ? ?/sec    1.00      2.4±0.19ms        ? ?/sec
filter: mixed_utf8view (max_string_len=20), 8192, nulls: 0, selectivity: 0.8         1.00  1100.1±30.40µs        ? ?/sec    1.40  1535.0±16.88µs        ? ?/sec
filter: mixed_utf8view (max_string_len=20), 8192, nulls: 0.1, selectivity: 0.001     1.00     52.1±0.82ms        ? ?/sec    1.00     52.0±0.33ms        ? ?/sec
filter: mixed_utf8view (max_string_len=20), 8192, nulls: 0.1, selectivity: 0.01      1.00      7.0±0.04ms        ? ?/sec    1.01      7.1±0.04ms        ? ?/sec
filter: mixed_utf8view (max_string_len=20), 8192, nulls: 0.1, selectivity: 0.1       1.01      3.7±0.16ms        ? ?/sec    1.00      3.7±0.19ms        ? ?/sec
filter: mixed_utf8view (max_string_len=20), 8192, nulls: 0.1, selectivity: 0.8       1.00      2.3±0.02ms        ? ?/sec    1.70      3.9±0.02ms        ? ?/sec
filter: primitive, 8192, nulls: 0, selectivity: 0.001                                1.00     53.7±0.17ms        ? ?/sec    1.82     97.8±0.57ms        ? ?/sec
filter: primitive, 8192, nulls: 0, selectivity: 0.01                                 1.00      5.9±0.03ms        ? ?/sec    1.59      9.3±0.12ms        ? ?/sec
filter: primitive, 8192, nulls: 0, selectivity: 0.1                                  1.00      3.5±0.37ms        ? ?/sec    1.13      3.9±0.31ms        ? ?/sec
filter: primitive, 8192, nulls: 0, selectivity: 0.8                                  1.00   1673.0±9.89µs        ? ?/sec    1.86      3.1±0.02ms        ? ?/sec
filter: primitive, 8192, nulls: 0.1, selectivity: 0.001                              1.00     59.2±1.20ms        ? ?/sec    2.14    126.3±0.72ms        ? ?/sec
filter: primitive, 8192, nulls: 0.1, selectivity: 0.01                               1.00      8.1±0.07ms        ? ?/sec    1.85     15.0±0.45ms        ? ?/sec
filter: primitive, 8192, nulls: 0.1, selectivity: 0.1                                1.00      6.7±0.49ms        ? ?/sec    1.05      7.0±0.28ms        ? ?/sec
filter: primitive, 8192, nulls: 0.1, selectivity: 0.8                                1.00      4.9±0.13ms        ? ?/sec    1.87      9.1±0.09ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0, selectivity: 0.001                          1.01     66.9±0.24ms        ? ?/sec    1.00     66.2±0.76ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0, selectivity: 0.01                           1.03      7.6±0.04ms        ? ?/sec    1.00      7.3±0.19ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0, selectivity: 0.1                            1.00      3.9±0.10ms        ? ?/sec    1.04      4.0±0.27ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0, selectivity: 0.8                            1.00   1285.2±8.42µs        ? ?/sec    1.11  1421.9±17.75µs        ? ?/sec
filter: single_utf8view, 8192, nulls: 0.1, selectivity: 0.001                        1.07     89.6±0.36ms        ? ?/sec    1.00     84.1±0.60ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0.1, selectivity: 0.01                         1.02     11.5±0.21ms        ? ?/sec    1.00     11.3±0.08ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0.1, selectivity: 0.1                          1.00      5.6±0.32ms        ? ?/sec    1.00      5.6±0.34ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0.1, selectivity: 0.8                          1.00      2.7±0.02ms        ? ?/sec    1.49      3.9±0.05ms        ? ?/sec

alamb-ghbot · 2025-12-15T08:45:57Z

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing coalesce_batches_filter (b235243) to ed9efe7 diff
BENCH_NAME=coalesce_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench coalesce_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=coalesce_batches_filter
Results will be posted here when complete

alamb-ghbot · 2025-12-15T09:06:16Z

🤖: Benchmark completed

Details

group                                                                                coalesce_batches_filter                main
-----                                                                                -----------------------                ----
filter: mixed_dict, 8192, nulls: 0, selectivity: 0.001                               1.00    257.2±2.50ms        ? ?/sec    1.00    256.6±3.74ms        ? ?/sec
filter: mixed_dict, 8192, nulls: 0, selectivity: 0.01                                1.04      8.8±0.18ms        ? ?/sec    1.00      8.5±0.10ms        ? ?/sec
filter: mixed_dict, 8192, nulls: 0, selectivity: 0.1                                 1.03      4.2±0.13ms        ? ?/sec    1.00      4.1±0.12ms        ? ?/sec
filter: mixed_dict, 8192, nulls: 0, selectivity: 0.8                                 1.00      3.3±0.04ms        ? ?/sec    1.08      3.5±0.03ms        ? ?/sec
filter: mixed_dict, 8192, nulls: 0.1, selectivity: 0.001                             1.00    244.5±3.19ms        ? ?/sec    1.27    310.3±4.52ms        ? ?/sec
filter: mixed_dict, 8192, nulls: 0.1, selectivity: 0.01                              1.01      9.4±0.17ms        ? ?/sec    1.00      9.3±0.24ms        ? ?/sec
filter: mixed_dict, 8192, nulls: 0.1, selectivity: 0.1                               1.03      4.6±0.11ms        ? ?/sec    1.00      4.5±0.10ms        ? ?/sec
filter: mixed_dict, 8192, nulls: 0.1, selectivity: 0.8                               1.00      3.7±0.07ms        ? ?/sec    1.23      4.6±0.09ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0, selectivity: 0.001                               1.00     59.7±0.49ms        ? ?/sec    1.00     59.5±0.60ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0, selectivity: 0.01                                1.00     11.5±0.19ms        ? ?/sec    1.01     11.5±0.15ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0, selectivity: 0.1                                 1.04      9.5±0.33ms        ? ?/sec    1.00      9.2±0.40ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0, selectivity: 0.8                                 1.00      7.7±0.15ms        ? ?/sec    1.33     10.3±0.21ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0.1, selectivity: 0.001                             1.01     70.3±0.37ms        ? ?/sec    1.00     69.6±2.61ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0.1, selectivity: 0.01                              1.00     12.8±0.15ms        ? ?/sec    1.00     12.8±0.23ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0.1, selectivity: 0.1                               1.01      9.6±0.33ms        ? ?/sec    1.00      9.5±0.21ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0.1, selectivity: 0.8                               1.00      8.3±0.22ms        ? ?/sec    1.17      9.8±0.21ms        ? ?/sec
filter: mixed_utf8view (max_string_len=128), 8192, nulls: 0, selectivity: 0.001      1.02     49.5±0.41ms        ? ?/sec    1.00     48.4±0.46ms        ? ?/sec
filter: mixed_utf8view (max_string_len=128), 8192, nulls: 0, selectivity: 0.01       1.00      5.9±0.06ms        ? ?/sec    1.00      5.9±0.10ms        ? ?/sec
filter: mixed_utf8view (max_string_len=128), 8192, nulls: 0, selectivity: 0.1        1.03      4.6±0.25ms        ? ?/sec    1.00      4.5±0.23ms        ? ?/sec
filter: mixed_utf8view (max_string_len=128), 8192, nulls: 0, selectivity: 0.8        1.00      2.6±0.05ms        ? ?/sec    1.14      3.0±0.04ms        ? ?/sec
filter: mixed_utf8view (max_string_len=128), 8192, nulls: 0.1, selectivity: 0.001    1.01     59.2±0.47ms        ? ?/sec    1.00     58.5±0.79ms        ? ?/sec
filter: mixed_utf8view (max_string_len=128), 8192, nulls: 0.1, selectivity: 0.01     1.00      8.0±0.25ms        ? ?/sec    1.01      8.1±0.23ms        ? ?/sec
filter: mixed_utf8view (max_string_len=128), 8192, nulls: 0.1, selectivity: 0.1      1.00      5.5±0.25ms        ? ?/sec    1.03      5.7±0.22ms        ? ?/sec
filter: mixed_utf8view (max_string_len=128), 8192, nulls: 0.1, selectivity: 0.8      1.00      2.3±0.02ms        ? ?/sec    1.73      3.9±0.04ms        ? ?/sec
filter: mixed_utf8view (max_string_len=20), 8192, nulls: 0, selectivity: 0.001       1.02     43.5±0.42ms        ? ?/sec    1.00     42.6±0.28ms        ? ?/sec
filter: mixed_utf8view (max_string_len=20), 8192, nulls: 0, selectivity: 0.01        1.00      4.7±0.03ms        ? ?/sec    1.00      4.7±0.10ms        ? ?/sec
filter: mixed_utf8view (max_string_len=20), 8192, nulls: 0, selectivity: 0.1         1.10      2.5±0.19ms        ? ?/sec    1.00      2.3±0.18ms        ? ?/sec
filter: mixed_utf8view (max_string_len=20), 8192, nulls: 0, selectivity: 0.8         1.00  1176.8±34.76µs        ? ?/sec    1.28   1506.1±8.68µs        ? ?/sec
filter: mixed_utf8view (max_string_len=20), 8192, nulls: 0.1, selectivity: 0.001     1.02     53.5±0.65ms        ? ?/sec    1.00     52.4±0.43ms        ? ?/sec
filter: mixed_utf8view (max_string_len=20), 8192, nulls: 0.1, selectivity: 0.01      1.00      7.0±0.04ms        ? ?/sec    1.01      7.1±0.04ms        ? ?/sec
filter: mixed_utf8view (max_string_len=20), 8192, nulls: 0.1, selectivity: 0.1       1.03      3.8±0.19ms        ? ?/sec    1.00      3.7±0.12ms        ? ?/sec
filter: mixed_utf8view (max_string_len=20), 8192, nulls: 0.1, selectivity: 0.8       1.00      2.4±0.06ms        ? ?/sec    1.64      3.9±0.07ms        ? ?/sec
filter: primitive, 8192, nulls: 0, selectivity: 0.001                                1.00     55.4±0.82ms        ? ?/sec    1.76     97.3±1.34ms        ? ?/sec
filter: primitive, 8192, nulls: 0, selectivity: 0.01                                 1.00      5.8±0.02ms        ? ?/sec    1.60      9.3±0.07ms        ? ?/sec
filter: primitive, 8192, nulls: 0, selectivity: 0.1                                  1.00      3.6±0.41ms        ? ?/sec    1.10      4.0±0.37ms        ? ?/sec
filter: primitive, 8192, nulls: 0, selectivity: 0.8                                  1.00  1681.5±15.51µs        ? ?/sec    1.86      3.1±0.06ms        ? ?/sec
filter: primitive, 8192, nulls: 0.1, selectivity: 0.001                              1.00     62.1±0.50ms        ? ?/sec    2.02    125.4±0.57ms        ? ?/sec
filter: primitive, 8192, nulls: 0.1, selectivity: 0.01                               1.00     10.1±0.15ms        ? ?/sec    1.50     15.0±0.07ms        ? ?/sec
filter: primitive, 8192, nulls: 0.1, selectivity: 0.1                                1.00      6.4±0.25ms        ? ?/sec    1.16      7.4±0.36ms        ? ?/sec
filter: primitive, 8192, nulls: 0.1, selectivity: 0.8                                1.00      4.5±0.03ms        ? ?/sec    2.02      9.1±0.05ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0, selectivity: 0.001                          1.02     67.3±1.39ms        ? ?/sec    1.00     66.2±0.99ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0, selectivity: 0.01                           1.04      7.6±0.04ms        ? ?/sec    1.00      7.3±0.12ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0, selectivity: 0.1                            1.00      4.1±0.31ms        ? ?/sec    1.02      4.2±0.36ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0, selectivity: 0.8                            1.00   1335.7±6.87µs        ? ?/sec    1.06   1412.2±9.47µs        ? ?/sec
filter: single_utf8view, 8192, nulls: 0.1, selectivity: 0.001                        1.07     89.5±0.97ms        ? ?/sec    1.00     83.9±0.28ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0.1, selectivity: 0.01                         1.04     11.6±0.28ms        ? ?/sec    1.00     11.2±0.12ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0.1, selectivity: 0.1                          1.01      5.5±0.36ms        ? ?/sec    1.00      5.5±0.31ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0.1, selectivity: 0.8                          1.00      2.7±0.01ms        ? ?/sec    1.46      3.9±0.04ms        ? ?/sec

Dandandan · 2025-12-15T09:08:54Z

filter: primitive, 8192, nulls: 0.1, selectivity: 0.1                                1.00      6.4±0.25ms        ? ?/sec    1.16      7.4±0.36ms        ? ?/sec
filter: primitive, 8192, nulls: 0.1, selectivity: 0.8                                1.00      4.5±0.03ms        ? ?/sec    2.02      9.1±0.05ms        ? ?/sec

nice

MVP for #8957 awaits for #8951 very first version for reviewers to confirm behaviour, optimizations TBD Signed-off-by: 蔡略 <cailue@apache.org>

alamb · 2025-12-17T20:27:20Z

I suggest we break this PR up into several smaller ones (now that you have proof the benchmarks are working well):

@Dandandan would you like help getting this PR into shape / creating the smaller PRs?

Make filtered coalescing faster for primitive / byte types

6ecd42b

Make filtered coalescing faster for primitive

github-actions bot added the arrow Changes to the arrow crate label Dec 4, 2025

Make filtered coalescing faster for primitive types

a8df36f

Dandandan changed the title ~~Make filtered coalescing faster for primitive types~~ Make push_batch_with_filter faster for primitive types Dec 4, 2025

Dandandan added 3 commits December 4, 2025 16:41

Faster api

f20702b

Faster api

124b4e3

Faster api

79bd847

Dandandan changed the title ~~Make push_batch_with_filter faster for primitive types~~ Make push_batch_with_filter faster for primitive types: up to 10x faster Dec 4, 2025

Dandandan changed the title ~~Make push_batch_with_filter faster for primitive types: up to 10x faster~~ Make push_batch_with_filter up to 10x faster for primitive types Dec 4, 2025

Faster api

b2fc66f

Cleanup

0872a9b

Dandandan commented Dec 4, 2025

View reviewed changes

Dandandan marked this pull request as draft December 4, 2025 17:08

Fix?

b7b3f18

Dandandan changed the title ~~Make push_batch_with_filter up to 10x faster for primitive types~~ Make push_batch_with_filter up to 2x faster for primitive types Dec 4, 2025

optimize

7758889

Dandandan changed the title ~~Make push_batch_with_filter up to 2x faster for primitive types~~ Make push_batch_with_filter up to 3x faster for primitive types Dec 4, 2025

Dandandan mentioned this pull request Dec 4, 2025

Add coalesce_kernels to allowed list alamb/datafusion-benchmarking#2

Closed

perf

dcf4864

alamb added the performance label Dec 11, 2025

alamb reviewed Dec 12, 2025

View reviewed changes

Dandandan and others added 2 commits December 13, 2025 05:02

Update arrow-select/src/filter.rs

ca19422

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

Update arrow-select/src/filter.rs

f718f2e

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

ClSlaid added a commit to ClSlaid/arrow-rs that referenced this pull request Dec 14, 2025

feat: impl BatchCoalescer::push_batch_with_indices

e70bc45

MVP for apache#8957 awaits for apache#8951 very first version for behaviour review, optimizations TBD Signed-off-by: 蔡略 <cailue@apache.org>

ClSlaid mentioned this pull request Dec 14, 2025

feat: impl BatchCoalescer::push_batch_with_indices #8991

Merged

mapleFU reviewed Dec 14, 2025

View reviewed changes

Move / optimize

b235243

alamb pushed a commit that referenced this pull request Dec 15, 2025

feat: impl BatchCoalescer::push_batch_with_indices (#8991)

91234b5

MVP for #8957 awaits for #8951 very first version for reviewers to confirm behaviour, optimizations TBD Signed-off-by: 蔡略 <cailue@apache.org>

Make push_batch_with_filter up to 3x faster for primitive types #8951

Are you sure you want to change the base?

Make push_batch_with_filter up to 3x faster for primitive types #8951

Uh oh!

Conversation

Dandandan commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Dandandan commented Dec 4, 2025

Uh oh!

alamb commented Dec 4, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Dec 4, 2025

Uh oh!

alamb commented Dec 4, 2025

Uh oh!

alamb-ghbot commented Dec 4, 2025

Uh oh!

Dandandan commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alamb-ghbot commented Dec 4, 2025

Uh oh!

alamb-ghbot commented Dec 4, 2025

Uh oh!

Dandandan commented Dec 4, 2025

Uh oh!

alamb-ghbot commented Dec 4, 2025

Uh oh!

Dandandan commented Dec 4, 2025

Uh oh!

alamb-ghbot commented Dec 9, 2025

Uh oh!

alamb-ghbot commented Dec 9, 2025

Uh oh!

alamb-ghbot commented Dec 9, 2025

Uh oh!

alamb commented Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alamb commented Dec 11, 2025

Uh oh!

alamb-ghbot commented Dec 11, 2025

Uh oh!

alamb-ghbot commented Dec 11, 2025

Uh oh!

alamb commented Dec 11, 2025

Uh oh!

Dandandan commented Dec 11, 2025

Uh oh!

alamb-ghbot commented Dec 11, 2025

Uh oh!

alamb-ghbot commented Dec 11, 2025

Uh oh!

alamb commented Dec 11, 2025

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Make `push_batch_with_filter` up to 3x faster for primitive types #8951

Make `push_batch_with_filter` up to 3x faster for primitive types #8951

Dandandan commented Dec 4, 2025 •

edited

Loading

Dandandan commented Dec 4, 2025 •

edited

Loading

alamb commented Dec 11, 2025 •

edited

Loading