`shfl_down` intrinsic by christiangnrd · Pull Request #661 · JuliaGPU/KernelAbstractions.jl

christiangnrd · 2025-11-22T02:52:19Z

Requires #668 ~~This may be all that's needed?~~

~~Could maybe add simdgroup (warps, subgroups) indexing intrinsics but I'd have to check if every backend supports this (I assume they would?)~~

github-actions · 2025-11-22T02:52:44Z

Your PR requires formatting changes to meet the project's style guidelines.
Please consider running Runic (git runic main) to apply these changes.

Click here to view the suggested changes.

diff --git a/test/intrinsics.jl b/test/intrinsics.jl
index 68fa9e48..d27de5e9 100644
--- a/test/intrinsics.jl
+++ b/test/intrinsics.jl
@@ -36,10 +36,10 @@ function shfl_down_test_kernel(a, b)
         value = temp[idx]
 
         value = value + KI.shfl_down(value, 16)
-        value = value + KI.shfl_down(value,  8)
-        value = value + KI.shfl_down(value,  4)
-        value = value + KI.shfl_down(value,  2)
-        value = value + KI.shfl_down(value,  1)
+        value = value + KI.shfl_down(value, 8)
+        value = value + KI.shfl_down(value, 4)
+        value = value + KI.shfl_down(value, 2)
+        value = value + KI.shfl_down(value, 1)
 
         b[idx] = value
     end
@@ -152,7 +152,7 @@ function intrinsics_testsuite(backend, AT)
             dev_a = AT(a)
             dev_b = AT(zeros(T, 32))
 
-            KI.@kernel backend() workgroupsize=32 shfl_down_test_kernel(dev_a, dev_b)
+            KI.@kernel backend() workgroupsize = 32 shfl_down_test_kernel(dev_a, dev_b)
 
             b = Array(dev_b)
             @test sum(a) ≈ b[1]

vchuravy · 2025-11-22T03:10:44Z

So the backends that I am worried about is Metal and to a lesser extend Intel.

vchuravy · 2025-11-22T03:11:57Z

test/intrinsics.jl

+    # This is not valid
+    idx = KI.get_local_id().x
+
+    temp = KI.localmemory(eltype(b), 32)


So we need a query function to find the subgroup size? Then pass that to a Val?

Currently this is like #559 where it assumes that subgroup size is always 32.

The "This is not valid" is because it's using the local_id but we could do like #559 and modulo 32 to find subgroup position and stuff

So I think AMD has some chips where subgroup size is 64. So we should have some way for the use to query this (even if it is just on the host)

GPU Backend Host-Side Method Device-Side Method (Intrinsic)

Metal thread_execution_width property of MTLComputePipelineState (need compiled kernel) [[threads_per_simdgroup()

AMDGPU wavefrontsize(dev::HIPDevice) wavefrontsize()

CUDA warpsize(dev::CuDevice) warpsize()

OpenCL get_sub_group_size()

oneAPI get_sub_group_size()?

Is Metal the only backend that currently lacks dynamic local memory?

On OpenCL and oneAPI, the host side methods are probably CL_DEVICE_SUB_GROUP_SIZES_INTEL + clDeviceInfo and subGroupSizes + zeDeviceGetComputeProperties, respectively

refs:
OpenCL extension doc
Intel levelZero docs
pocl cuda driver

christiangnrd · 2025-11-22T20:10:18Z

So the backends that I am worried about is Metal and to a lesser extend Intel.

What are your worries?

Hamiltonian-Action · 2025-12-20T00:03:07Z

For the sake of maintaining sanity, lest undocumented behaviour run amok, should this PR eventually be merged then would it be possible to explicitly specify whether the behaviour is synchronising or not?

I understand that this is still very much the early stages of some work-in-progress but this code segment from the test suite indicates assumed synchronicity -- otherwise, it is perfectly legal behaviour for this reduction to induce a race condition -- but the juxtaposition with vendor nomenclature suggests the contrary.

value = value + KI.shfl_down(value, 16)
value = value + KI.shfl_down(value,  8)
value = value + KI.shfl_down(value,  4)
value = value + KI.shfl_down(value,  2)
value = value + KI.shfl_down(value,  1)

christiangnrd · 2025-12-20T13:13:05Z

@Hamiltonian-Action Thanks for the comment, I’ll make sure to fix the tests when I return to this.

Hamiltonian-Action · 2025-12-22T00:19:48Z

The KernelAbstractions tests are not the principal concern, given that they are internal to the package itself. Rather, my comment was more towards explicitly specifying the assumed synchronicity or lack thereof at the level of the API rather than leaving it up to the individual backends. In essence, one of the following additions to the docstring and specification would be encouraged depending on whether _sync variants are also to be introduced:
1- MUST synchronise
2- MUST NOT synchronise
3- MAY synchronise

christiangnrd · 2025-12-22T00:56:47Z

Noted. The idea is that it should follow the backend conventions so I’ll make sure to go through the documentation and ensure that behaviour is consistent and documented

This reverts commit 956dc2e.

Co-Authored-By: Anton Smirnov <tonysmn97@gmail.com>

christiangnrd · 2026-01-03T19:44:20Z

Closing as #668 is required to do right so I integrated this with the PR

christiangnrd force-pushed the shfl_down branch from 17c7234 to ac11a2f Compare November 22, 2025 03:03

vchuravy reviewed Nov 22, 2025

View reviewed changes

christiangnrd marked this pull request as draft December 20, 2025 13:13

christiangnrd force-pushed the shfl_down branch 2 times, most recently from 530821a to 6852410 Compare January 2, 2026 20:16

christiangnrd and others added 15 commits January 2, 2026 18:37

Add test for kernels with multiple shared buffers

0ed8290

KernelIntrinsics Tweaks

e798981

Fix temporary AK compat

93753bb

Improve KI tests

300d432

Initial subgroups support

15597e4

[TEMP]

28b5884

[Temp] CI

6f09cac

Add oneAPI branch

59e6359

Adjust test

08a82d4

Add CI for not-yet-existing branches

bd4ad1c

Revert "[TEMP]"

9acda75

This reverts commit 956dc2e.

shfl_down intrinsics

77c6e20

Co-Authored-By: Anton Smirnov <tonysmn97@gmail.com>

Add note about need to synchronize

9e5f965

Fixup

f8692be

Fix shfl_down test

84730d2

christiangnrd force-pushed the shfl_down branch from 6852410 to 84730d2 Compare January 2, 2026 22:38

1.10 fix in docstring

cdb07c3

christiangnrd closed this Jan 3, 2026

GPU Backend	Host-Side Method	Device-Side Method (Intrinsic)
Metal	`thread_execution_width` property of `MTLComputePipelineState` (need compiled kernel)	`[[threads_per_simdgroup()`
AMDGPU	`wavefrontsize(dev::HIPDevice)`	`wavefrontsize()`
CUDA	`warpsize(dev::CuDevice)`	`warpsize()`
OpenCL		`get_sub_group_size()`
oneAPI		`get_sub_group_size()`?

Conversation

christiangnrd commented Nov 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Nov 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vchuravy commented Nov 22, 2025

Uh oh!

vchuravy Nov 22, 2025

Choose a reason for hiding this comment

Uh oh!

christiangnrd Nov 22, 2025

Choose a reason for hiding this comment

Uh oh!

christiangnrd Nov 22, 2025

Choose a reason for hiding this comment

Uh oh!

vchuravy Nov 22, 2025

Choose a reason for hiding this comment

Uh oh!

christiangnrd Nov 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

christiangnrd Nov 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

VarLad Nov 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

christiangnrd commented Nov 22, 2025

Uh oh!

Hamiltonian-Action commented Dec 20, 2025

Uh oh!

christiangnrd commented Dec 20, 2025

Uh oh!

Hamiltonian-Action commented Dec 22, 2025

Uh oh!

christiangnrd commented Dec 22, 2025

Uh oh!

christiangnrd commented Jan 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

christiangnrd commented Nov 22, 2025 •

edited

Loading

github-actions bot commented Nov 22, 2025 •

edited

Loading

christiangnrd Nov 22, 2025 •

edited

Loading

christiangnrd Nov 22, 2025 •

edited

Loading

VarLad Nov 23, 2025 •

edited

Loading