Implement groupreduce API by pxl-th · Pull Request #559 · JuliaGPU/KernelAbstractions.jl

pxl-th · 2025-01-30T19:04:24Z

Implement reduction API. Supports two types of algorithms:

thread: reduction performed by threads: uses shmem of length groupsize, no bank conflict, no divergence.
warp: reduction performed by shlf_down within warps: uses shmem of length 32, reduction within warps storing results in shmem, followed by final warp reduction using values stored in shmem. Backends are required to only implement shlf_down intrinsic which AMDGPU/CUDA/Metal have (no sure about other backends).
query function to check if backend supports warp reduction KA.__supports_warp_reduction().

res = @groupreduce op val neutral

Optionally limit number of threads that participate in reduction.

res = @groupreduce op val neutral 128 # first 128 threads will perform reduction

src/reduce.jl

test/groupreduce.jl

github-actions · 2025-01-30T19:18:44Z

Benchmark Results

	main	`7c96e5a`...	main/7c96e5a4ab5554...
saxpy/default/Float16/1024	0.0586 ± 0.026 ms	0.742 ± 0.0059 μs	79
saxpy/default/Float16/1048576	0.89 ± 0.025 ms	0.173 ± 0.0045 ms	5.15
saxpy/default/Float16/16384	0.063 ± 0.027 ms	3.33 ± 0.023 μs	18.9
saxpy/default/Float16/2048	0.0591 ± 0.023 ms	0.92 ± 0.012 μs	64.2
saxpy/default/Float16/256	0.0623 ± 0.026 ms	0.597 ± 0.0047 μs	104
saxpy/default/Float16/262144	0.269 ± 0.026 ms	0.0439 ± 0.00033 ms	6.14
saxpy/default/Float16/32768	0.0749 ± 0.027 ms	6.01 ± 0.047 μs	12.5
saxpy/default/Float16/4096	0.0626 ± 0.025 ms	1.31 ± 0.023 μs	47.9
saxpy/default/Float16/512	0.0605 ± 0.026 ms	0.657 ± 0.0054 μs	92.2
saxpy/default/Float16/64	0.0633 ± 0.026 ms	0.566 ± 0.0048 μs	112
saxpy/default/Float16/65536	0.103 ± 0.027 ms	11.6 ± 0.13 μs	8.88
saxpy/default/Float32/1024	0.0616 ± 0.026 ms	0.65 ± 0.011 μs	94.9
saxpy/default/Float32/1048576	0.476 ± 0.024 ms	0.234 ± 0.015 ms	2.04
saxpy/default/Float32/16384	0.055 ± 0.026 ms	2.77 ± 0.12 μs	19.9
saxpy/default/Float32/2048	0.0531 ± 0.023 ms	0.757 ± 0.043 μs	70.1
saxpy/default/Float32/256	0.0604 ± 0.026 ms	0.576 ± 0.0062 μs	105
saxpy/default/Float32/262144	0.16 ± 0.034 ms	0.0577 ± 0.0033 ms	2.78
saxpy/default/Float32/32768	0.0601 ± 0.026 ms	5.31 ± 0.29 μs	11.3
saxpy/default/Float32/4096	0.061 ± 0.024 ms	1.13 ± 0.086 μs	53.8
saxpy/default/Float32/512	0.0607 ± 0.026 ms	0.612 ± 0.01 μs	99.1
saxpy/default/Float32/64	0.0623 ± 0.026 ms	0.563 ± 0.006 μs	111
saxpy/default/Float32/65536	0.0745 ± 0.029 ms	12.5 ± 0.66 μs	5.98
saxpy/default/Float64/1024	0.057 ± 0.026 ms	0.751 ± 0.058 μs	75.9
saxpy/default/Float64/1048576	0.504 ± 0.049 ms	0.502 ± 0.022 ms	1
saxpy/default/Float64/16384	0.0548 ± 0.025 ms	5.24 ± 0.19 μs	10.4
saxpy/default/Float64/2048	0.0519 ± 0.023 ms	1.13 ± 0.078 μs	45.9
saxpy/default/Float64/256	0.0619 ± 0.026 ms	0.582 ± 0.0068 μs	106
saxpy/default/Float64/262144	0.17 ± 0.029 ms	0.0886 ± 0.0076 ms	1.92
saxpy/default/Float64/32768	0.0626 ± 0.025 ms	11.9 ± 0.92 μs	5.25
saxpy/default/Float64/4096	0.0605 ± 0.024 ms	1.68 ± 0.11 μs	36
saxpy/default/Float64/512	0.0612 ± 0.026 ms	0.635 ± 0.01 μs	96.5
saxpy/default/Float64/64	0.0621 ± 0.025 ms	0.555 ± 0.0061 μs	112
saxpy/default/Float64/65536	0.0847 ± 0.026 ms	23.1 ± 1.8 μs	3.67
saxpy/static workgroup=(1024,)/Float16/1024	0.056 ± 0.026 ms	2.16 ± 0.027 μs	25.9
saxpy/static workgroup=(1024,)/Float16/1048576	0.899 ± 0.028 ms	0.158 ± 0.0069 ms	5.7
saxpy/static workgroup=(1024,)/Float16/16384	0.0594 ± 0.025 ms	4.41 ± 0.079 μs	13.5
saxpy/static workgroup=(1024,)/Float16/2048	0.0571 ± 0.023 ms	2.33 ± 0.027 μs	24.5
saxpy/static workgroup=(1024,)/Float16/256	0.0603 ± 0.025 ms	2.81 ± 0.033 μs	21.5
saxpy/static workgroup=(1024,)/Float16/262144	0.268 ± 0.027 ms	0.0428 ± 0.0018 ms	6.26
saxpy/static workgroup=(1024,)/Float16/32768	0.0724 ± 0.025 ms	6.81 ± 0.15 μs	10.6
saxpy/static workgroup=(1024,)/Float16/4096	0.0619 ± 0.026 ms	2.67 ± 0.035 μs	23.2
saxpy/static workgroup=(1024,)/Float16/512	0.0585 ± 0.026 ms	3.25 ± 0.035 μs	18
saxpy/static workgroup=(1024,)/Float16/64	0.0598 ± 0.025 ms	2.51 ± 0.22 μs	23.9
saxpy/static workgroup=(1024,)/Float16/65536	0.101 ± 0.025 ms	12.6 ± 0.38 μs	8.06
saxpy/static workgroup=(1024,)/Float32/1024	0.0588 ± 0.026 ms	2.23 ± 0.03 μs	26.4
saxpy/static workgroup=(1024,)/Float32/1048576	0.46 ± 0.025 ms	0.201 ± 0.024 ms	2.29
saxpy/static workgroup=(1024,)/Float32/16384	0.0518 ± 0.024 ms	4.4 ± 0.25 μs	11.8
saxpy/static workgroup=(1024,)/Float32/2048	0.0519 ± 0.022 ms	2.4 ± 0.04 μs	21.7
saxpy/static workgroup=(1024,)/Float32/256	0.0605 ± 0.025 ms	2.68 ± 0.043 μs	22.6
saxpy/static workgroup=(1024,)/Float32/262144	0.159 ± 0.035 ms	0.0485 ± 0.0037 ms	3.28
saxpy/static workgroup=(1024,)/Float32/32768	0.0573 ± 0.025 ms	7.49 ± 0.42 μs	7.66
saxpy/static workgroup=(1024,)/Float32/4096	0.0556 ± 0.025 ms	2.66 ± 0.065 μs	20.9
saxpy/static workgroup=(1024,)/Float32/512	0.0581 ± 0.026 ms	2.69 ± 0.031 μs	21.6
saxpy/static workgroup=(1024,)/Float32/64	0.0604 ± 0.025 ms	2.7 ± 5.6 μs	22.4
saxpy/static workgroup=(1024,)/Float32/65536	0.0714 ± 0.028 ms	14.6 ± 1.3 μs	4.89
saxpy/static workgroup=(1024,)/Float64/1024	0.056 ± 0.025 ms	2.32 ± 0.048 μs	24.1
saxpy/static workgroup=(1024,)/Float64/1048576	0.499 ± 0.044 ms	0.499 ± 0.051 ms	0.999
saxpy/static workgroup=(1024,)/Float64/16384	0.0541 ± 0.025 ms	7.41 ± 0.49 μs	7.3
saxpy/static workgroup=(1024,)/Float64/2048	0.0507 ± 0.023 ms	2.61 ± 0.067 μs	19.4
saxpy/static workgroup=(1024,)/Float64/256	0.0605 ± 0.025 ms	2.66 ± 0.061 μs	22.7
saxpy/static workgroup=(1024,)/Float64/262144	0.168 ± 0.029 ms	0.0992 ± 0.0086 ms	1.7
saxpy/static workgroup=(1024,)/Float64/32768	0.0613 ± 0.025 ms	14.5 ± 1.4 μs	4.23
saxpy/static workgroup=(1024,)/Float64/4096	0.0538 ± 0.025 ms	3.15 ± 0.14 μs	17.1
saxpy/static workgroup=(1024,)/Float64/512	0.059 ± 0.025 ms	2.66 ± 0.062 μs	22.2
saxpy/static workgroup=(1024,)/Float64/64	0.0618 ± 0.025 ms	2.62 ± 0.065 μs	23.6
saxpy/static workgroup=(1024,)/Float64/65536	0.0834 ± 0.027 ms	26.5 ± 2.2 μs	3.15
time_to_load	1.09 ± 0.0082 s	0.304 ± 0.0042 s	3.6

Benchmark Plots

A plot of the benchmark results have been uploaded as an artifact to the workflow run for this PR.
Go to "Actions"->"Benchmark a pull request"->[the most recent run]->"Artifacts" (at the bottom).

test/groupreduce.jl

pxl-th · 2025-01-30T19:39:52Z

~~@vchuravy not sure about CPU errors (regarding @index(Local)). Any idea?~~

UPD: #218 (comment)

github-actions · 2025-02-01T10:18:48Z

test/groupreduce.jl

+end
+
+function groupreduce_testsuite(backend, AT)
+    @testset "@groupreduce" begin


Suggested change

@testset "@groupreduce" begin

return @testset "@groupreduce" begin

github-actions · 2025-02-01T10:19:45Z

test/groupreduce.jl

+end
+
+function groupreduce_testsuite(backend, AT)
+    @testset "@groupreduce" begin


Suggested change

@testset "@groupreduce" begin

return @testset "@groupreduce" begin

github-actions · 2025-02-01T12:46:16Z

test/groupreduce.jl

+    groupsizes = "$backend" == "oneAPIBackend" ?
+        (256,) :
+        (256, 512, 1024)
+    @testset "@groupreduce" begin


Suggested change

@testset "@groupreduce" begin

return @testset "@groupreduce" begin

vchuravy · 2025-02-03T09:20:27Z

src/reduce.jl

+- `algo` specifies which reduction algorithm to use:
+    - `Reduction.thread`:
+        Perform thread group reduction (requires `groupsize * sizeof(T)` bytes of shared memory).
+        Available accross all backends.
+    - `Reduction.warp`:
+        Perform warp group reduction (requires `32 * sizeof(T)` bytes of shared memory).
+        Potentially faster, since requires fewer writes to shared memory.
+        To query if backend supports warp reduction, use `supports_warp_reduction(backend)`.


Why is that needed? Shouldn't the backend go and use warp reductions if it can?

I'm now doing an auto-selection of the algorithm based on device function __supports_warp_reduction().

vchuravy · 2025-02-03T09:21:32Z

src/reduce.jl

+    while s > 0x00
+        if (local_idx - 0x01) < s
+            other_idx = local_idx + s
+            if other_idx ≤ groupsize
+                @inbounds storage[local_idx] = op(storage[local_idx], storage[other_idx])
+            end
+        end
+        @synchronize()


Currently this is not legal.

#262 might need to wait until #556

(I assume this code is GPU only anyways)

vchuravy · 2025-02-03T09:23:23Z

src/reduce.jl

+macro shfl_down(val, offset)
+    return quote
+        $__shfl_down($(esc(val)), $(esc(offset)))
+    end
+end


If it isn't user-facing or needs special CPU handling you don't need to introduce a new macro

Yeah, removed macro.

vchuravy · 2025-02-03T09:24:36Z

test/groupreduce.jl

+@kernel function groupreduce_1!(y, x, op, neutral, algo)
+    i = @index(Global)
+    val = i > length(x) ? neutral : x[i]
+    res = @groupreduce(op, val, neutral, algo)
+    i == 1 && (y[1] = res)
+end
+
+@kernel function groupreduce_2!(y, x, op, neutral, algo, ::Val{groupsize}) where {groupsize}
+    i = @index(Global)
+    val = i > length(x) ? neutral : x[i]
+    res = @groupreduce(op, val, neutral, algo, groupsize)
+    i == 1 && (y[1] = res)
+end


These need to be cpu=false since you are using non-top-level @synchronize

vchuravy

Thanks! This currently doesn't fully work since calling a function with a __ctx__ argument is GPU only.

#558 is also rearing it's ugly head. I suspect we will need a macro-free kernel language in KA for writing this correctly.

github-actions · 2025-02-03T22:55:04Z

test/groupreduce.jl

+@kernel cpu=false function groupreduce_1!(y, x, op, neutral)
+    i = @index(Global)


Suggested change

@kernel cpu=false function groupreduce_1!(y, x, op, neutral)

i = @index(Global)

@kernel cpu = false function groupreduce_1!(y, x, op, neutral)

@kernel cpu = false function groupreduce_2!(y, x, op, neutral, ::Val{groupsize}) where {groupsize}

test/groupreduce.jl

github-actions · 2025-02-03T23:37:38Z

test/groupreduce.jl

+@kernel cpu=false function groupreduce_1!(y, x, op, neutral)
+    i = @index(Global)


Suggested change

@kernel cpu=false function groupreduce_1!(y, x, op, neutral)

i = @index(Global)

@kernel cpu = false function groupreduce_1!(y, x, op, neutral)

@kernel cpu = false function groupreduce_2!(y, x, op, neutral, ::Val{groupsize}) where {groupsize}

github-actions · 2025-02-03T23:37:38Z

test/groupreduce.jl

+    groupsizes = "$backend" == "oneAPIBackend" ?
+        (256,) :
+        (256, 512, 1024)
+    @testset "@groupreduce" begin


Suggested change

@testset "@groupreduce" begin

return @testset "@groupreduce" begin

github-actions · 2025-02-03T23:41:16Z

test/groupreduce.jl

+@kernel cpu=false function groupreduce_1!(y, x, op, neutral)
+    i = @index(Global)


Suggested change

@kernel cpu=false function groupreduce_1!(y, x, op, neutral)

i = @index(Global)

@kernel cpu = false function groupreduce_1!(y, x, op, neutral)

@kernel cpu = false function groupreduce_2!(y, x, op, neutral, ::Val{groupsize}) where {groupsize}

github-actions · 2025-02-03T23:41:17Z

test/groupreduce.jl

+    groupsizes = "$backend" == "oneAPIBackend" ?
+        (256,) :
+        (256, 512, 1024)
+    @testset "@groupreduce" begin


Suggested change

@testset "@groupreduce" begin

return @testset "@groupreduce" begin

jonas-schulze · 2025-04-29T08:56:38Z

I just saw that JuliaGPU/AMDGPU.jl#729 has been closed. What is the prospect of this PR, or the general idea of a reduction abstraction to be added to KA.jl?

vchuravy · 2025-04-29T09:23:44Z

or the general idea of a reduction abstraction to be added to KA.jl?

That is still a goal of mine, but I am prioritizing getting the CPU backend to POCL done.

pxl-th · 2025-04-30T09:27:10Z

JuliaGPU/AMDGPU.jl#729 was closed, because from my testing I didn't see a major performance improvement from the warp reduce and in some cases (like fused softmax) it was actually slower.
So I plan to just have shmem groupreduce API for now, unless people actually want warp-based reduction.

As for merging this PR my assumption was that #562 should go first, @vchuravy correct me if I'm wrong.

Hamiltonian-Action · 2025-06-23T11:53:19Z

Apologies if my assessment is not entirely accurate as I am not intimately familiar with all the internal intricacies of KernelAbstractions.jl. I am implementing reductions for a different project and recalled that this draft PR exists.
Disregarding the issues already raised with synchronisation primitives, I foresee that this implementation will face issues when unsafe_indices = false and the ndrange does not fit nicely into a multiple of whatever group size ends up being chosen, due to relying on uninitialised shared memory.

Imagine, for example, ndrange = MAX_GROUP_SIZE + 4 for whatever the back-end's maximum group size is -- this launched two blocks of 1024 threads when testing on CUDA, with all but the first 4 threads of the second block being masked away, as confirmed via @index(Local, Linear) == 1 && @print("group size = ", prod(@groupsize())) and @index(Group, Linear) == 2 && @print("index = ", @index(Local, Linear)). In this case, the second block will perform a reduction on a 1024 item wide buffer in shared memory, but only the first 4 items will be properly initialised. The very first step of the reduction loop will have threads 1:4 combine their values with items 513:516, which are just garbage memory. Being zero initialised by default does not help either, as there is no guarantee that the all-zero bit pattern will match the neutral element of the operation.

pxl-th · 2025-06-24T10:10:33Z

Apologies if my assessment is not entirely accurate as I am not intimately familiar with all the internal intricacies of KernelAbstractions.jl. I am implementing reductions for a different project and recalled that this draft PR exists. Disregarding the issues already raised with synchronisation primitives, I foresee that this implementation will face issues when unsafe_indices = false and the ndrange does not fit nicely into a multiple of whatever group size ends up being chosen, due to relying on uninitialised shared memory.

Imagine, for example, ndrange = MAX_GROUP_SIZE + 4 for whatever the back-end's maximum group size is -- this launched two blocks of 1024 threads when testing on CUDA, with all but the first 4 threads of the second block being masked away, as confirmed via @index(Local, Linear) == 1 && @print("group size = ", prod(@groupsize())) and @index(Group, Linear) == 2 && @print("index = ", @index(Local, Linear)). In this case, the second block will perform a reduction on a 1024 item wide buffer in shared memory, but only the first 4 items will be properly initialised. The very first step of the reduction loop will have threads 1:4 combine their values with items 513:516, which are just garbage memory. Being zero initialised by default does not help either, as there is no guarantee that the all-zero bit pattern will match the neutral element of the operation.

In this case ndrange should be a multiple of groupsize and you should perform boundschecking explicity.
See how I did this with online softmax implementation.

Hamiltonian-Action · 2025-06-24T11:37:22Z

In this case ndrange should be a multiple of groupsize and you should perform boundschecking explicity. See how I did this with online softmax implementation.

That's a perfectly valid solution. The approach I took was to dictate that any reduction call must be made with unsafe_indices = true but there's no meaningful difference between the two. In any case, I only wished to highlight that in order to keep it in mind and hopefully either address it or have it be communicated in this PR's documentation. There is a possible way around this, doing some calculations with @ndrange() in order to figure out whether you are at the tail-end and then modifying the reduction appropriately, but I can understand if you would rather avoid such hideous code.

VarLad · 2025-11-13T01:28:02Z

@pxl-th Does it make sense to have a way to fetch warpsize and maximum number of warps from the kernel?
I think all backends provide this feature so it should be at least feasible.

VarLad · 2025-11-13T01:41:54Z

Also, would it make sense to export function shfl_down end (and potential ways to get warpsize and max number of warps) in a separate PR without the reduction part?
Just so that it can start getting wrapped and used.

pxl-th added 2 commits January 30, 2025 02:16

Implement groupreduce API

e1a110f

Simplify algo selection

ff4097f

github-actions bot reviewed Jan 30, 2025

View reviewed changes

pxl-th mentioned this pull request Jan 30, 2025

Support KA warp groupreduce API JuliaGPU/AMDGPU.jl#729

Closed

pxl-th requested a review from vchuravy January 30, 2025 19:08

Refactor

6a35eb8

github-actions bot reviewed Jan 30, 2025

View reviewed changes

test/groupreduce.jl Show resolved Hide resolved

Correction

224e8c8

github-actions bot reviewed Jan 30, 2025

View reviewed changes

test/groupreduce.jl Show resolved Hide resolved

Disable groupreduce tests for CPU

4a8e707

github-actions bot reviewed Feb 1, 2025

View reviewed changes

Strip globalvars of types

7c923fb

github-actions bot reviewed Feb 1, 2025

View reviewed changes

Limit groupsize to 256 for oneAPI

a647992

github-actions bot reviewed Feb 1, 2025

View reviewed changes

vchuravy reviewed Feb 3, 2025

View reviewed changes

Auto-select reduction algorithm & remove at-shfl_down macro

cbc8bd5

github-actions bot reviewed Feb 3, 2025

View reviewed changes

pxl-th requested a review from vchuravy February 3, 2025 22:56

Fix default algo selection

bb77270

github-actions bot reviewed Feb 3, 2025

View reviewed changes

Don't test on Float64

db5abc5

github-actions bot reviewed Feb 3, 2025

View reviewed changes

pxl-th marked this pull request as draft February 5, 2025 22:48

Separate algorithms

618c840

vchuravy mentioned this pull request Feb 6, 2025

KernelIntrinsics #562

Closed

pxl-th added 3 commits February 28, 2025 12:39

Add warp_groupreduce tests

344d484

Update tests & docs

1cd2d2f

Cleanup

7c96e5a

VarLad mentioned this pull request Nov 5, 2025

mapreducedim! is super slow JuliaGPU/OpenCL.jl#352

Open

vchuravy mentioned this pull request Nov 7, 2025

KernelIntrinsics API #635

Merged

christiangnrd mentioned this pull request Nov 22, 2025

shfl_down intrinsic #661

Closed

	@testset "@groupreduce" begin
	return @testset "@groupreduce" begin

		@kernel cpu=false function groupreduce_1!(y, x, op, neutral)
		i = @index(Global)

Conversation

pxl-th commented Jan 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Jan 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark Results

Benchmark Plots

Uh oh!

Uh oh!

pxl-th commented Jan 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot Feb 1, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot Feb 1, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot Feb 1, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vchuravy left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot Feb 3, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot Feb 3, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot Feb 3, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot Feb 3, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot Feb 3, 2025

Choose a reason for hiding this comment

Uh oh!

jonas-schulze commented Apr 29, 2025

Uh oh!

vchuravy commented Apr 29, 2025

Uh oh!

pxl-th commented Apr 30, 2025

Uh oh!

Hamiltonian-Action commented Jun 23, 2025

Uh oh!

pxl-th commented Jun 24, 2025

Uh oh!

Hamiltonian-Action commented Jun 24, 2025

Uh oh!

VarLad commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

pxl-th commented Jan 30, 2025 •

edited

Loading

github-actions bot commented Jan 30, 2025 •

edited

Loading

pxl-th commented Jan 30, 2025 •

edited

Loading

VarLad commented Nov 13, 2025 •

edited

Loading