Conversation
Benchmark Results
Benchmark PlotsA plot of the benchmark results have been uploaded as an artifact to the workflow run for this PR. |
|
UPD: #218 (comment) |
src/reduce.jl
Outdated
| - `algo` specifies which reduction algorithm to use: | ||
| - `Reduction.thread`: | ||
| Perform thread group reduction (requires `groupsize * sizeof(T)` bytes of shared memory). | ||
| Available accross all backends. | ||
| - `Reduction.warp`: | ||
| Perform warp group reduction (requires `32 * sizeof(T)` bytes of shared memory). | ||
| Potentially faster, since requires fewer writes to shared memory. | ||
| To query if backend supports warp reduction, use `supports_warp_reduction(backend)`. |
There was a problem hiding this comment.
Why is that needed? Shouldn't the backend go and use warp reductions if it can?
There was a problem hiding this comment.
I'm now doing an auto-selection of the algorithm based on device function __supports_warp_reduction().
| while s > 0x00 | ||
| if (local_idx - 0x01) < s | ||
| other_idx = local_idx + s | ||
| if other_idx ≤ groupsize | ||
| @inbounds storage[local_idx] = op(storage[local_idx], storage[other_idx]) | ||
| end | ||
| end | ||
| @synchronize() |
There was a problem hiding this comment.
(I assume this code is GPU only anyways)
src/reduce.jl
Outdated
| macro shfl_down(val, offset) | ||
| return quote | ||
| $__shfl_down($(esc(val)), $(esc(offset))) | ||
| end | ||
| end |
There was a problem hiding this comment.
If it isn't user-facing or needs special CPU handling you don't need to introduce a new macro
test/groupreduce.jl
Outdated
| @kernel function groupreduce_1!(y, x, op, neutral, algo) | ||
| i = @index(Global) | ||
| val = i > length(x) ? neutral : x[i] | ||
| res = @groupreduce(op, val, neutral, algo) | ||
| i == 1 && (y[1] = res) | ||
| end | ||
|
|
||
| @kernel function groupreduce_2!(y, x, op, neutral, algo, ::Val{groupsize}) where {groupsize} | ||
| i = @index(Global) | ||
| val = i > length(x) ? neutral : x[i] | ||
| res = @groupreduce(op, val, neutral, algo, groupsize) | ||
| i == 1 && (y[1] = res) | ||
| end |
There was a problem hiding this comment.
These need to be cpu=false since you are using non-top-level @synchronize
| @kernel cpu=false function groupreduce_1!(y, x, op, neutral) | ||
| i = @index(Global) |
There was a problem hiding this comment.
| @kernel cpu=false function groupreduce_1!(y, x, op, neutral) | |
| i = @index(Global) | |
| @kernel cpu = false function groupreduce_1!(y, x, op, neutral) | |
| @kernel cpu = false function groupreduce_2!(y, x, op, neutral, ::Val{groupsize}) where {groupsize} |
| @kernel cpu=false function groupreduce_1!(y, x, op, neutral) | ||
| i = @index(Global) |
There was a problem hiding this comment.
| @kernel cpu=false function groupreduce_1!(y, x, op, neutral) | |
| i = @index(Global) | |
| @kernel cpu = false function groupreduce_1!(y, x, op, neutral) | |
| @kernel cpu = false function groupreduce_2!(y, x, op, neutral, ::Val{groupsize}) where {groupsize} |
| groupsizes = "$backend" == "oneAPIBackend" ? | ||
| (256,) : | ||
| (256, 512, 1024) | ||
| @testset "@groupreduce" begin |
There was a problem hiding this comment.
| @testset "@groupreduce" begin | |
| return @testset "@groupreduce" begin |
| @kernel cpu=false function groupreduce_1!(y, x, op, neutral) | ||
| i = @index(Global) |
There was a problem hiding this comment.
| @kernel cpu=false function groupreduce_1!(y, x, op, neutral) | |
| i = @index(Global) | |
| @kernel cpu = false function groupreduce_1!(y, x, op, neutral) | |
| @kernel cpu = false function groupreduce_2!(y, x, op, neutral, ::Val{groupsize}) where {groupsize} |
| groupsizes = "$backend" == "oneAPIBackend" ? | ||
| (256,) : | ||
| (256, 512, 1024) | ||
| @testset "@groupreduce" begin |
There was a problem hiding this comment.
| @testset "@groupreduce" begin | |
| return @testset "@groupreduce" begin |
|
I just saw that JuliaGPU/AMDGPU.jl#729 has been closed. What is the prospect of this PR, or the general idea of a reduction abstraction to be added to KA.jl? |
That is still a goal of mine, but I am prioritizing getting the CPU backend to POCL done. |
|
JuliaGPU/AMDGPU.jl#729 was closed, because from my testing I didn't see a major performance improvement from the warp reduce and in some cases (like fused softmax) it was actually slower. As for merging this PR my assumption was that #562 should go first, @vchuravy correct me if I'm wrong. |
|
Apologies if my assessment is not entirely accurate as I am not intimately familiar with all the internal intricacies of Imagine, for example, |
In this case |
That's a perfectly valid solution. The approach I took was to dictate that any reduction call must be made with |
|
@pxl-th Does it make sense to have a way to fetch warpsize and maximum number of warps from the kernel? |
|
Also, would it make sense to export |
Implement reduction API. Supports two types of algorithms:
groupsize, no bank conflict, no divergence.shlf_downwithin warps: uses shmem of length32, reduction within warps storing results in shmem, followed by final warp reduction using values stored in shmem. Backends are required to only implementshlf_downintrinsic which AMDGPU/CUDA/Metal have (no sure about other backends).KA.__supports_warp_reduction().res = @groupreduce op val neutral