adding atomic support with atomix#299
Conversation
tkf
left a comment
There was a problem hiding this comment.
KernelAbstractions.jl doesn't have to depend on UnsafeAtomicsLLVM.jl (and LLVM.jl)
Co-authored-by: Takafumi Arakaki <takafumi.a@gmail.com>
vchuravy
left a comment
There was a problem hiding this comment.
Looks great!
Probably needs docs as well as AMDGPU support.
Co-authored-by: Valentin Churavy <vchuravy@users.noreply.github.com>
|
Hi! As for 3. What about atomic primitives like atomic_add!(...), I'd like to say that I have several kernels that use Also I'm curious if it will support things like: @atomic max(x[i], v) |
Co-authored-by: Takafumi Arakaki <takafumi.a@gmail.com>
|
I don't mind reworking this PR and #282 so we get both the macro and better ordering support from Atomix and also the I figure most people will want to use the macro, but some people will prefer the |
|
Let's merge this for now and then you can open a second PR? |
|
This one is not ready to be merged |
|
Oops. I got excited that it passed tests :) |
|
It was missing docs and tests, at least... I will add them when I get the chance. To be fair, atomix should have all the necessary tests, I just wanted to double check here. Documentation does not need to be long, but having a section for atomics with an example would go a long way. |
|
I was just waiting to add docs until we settled the atomic "primitive" discussion. |
|
I've tried this PR and it looks like on CPU it only supports integer types. ErrorERROR: LoadError: InvalidIRError: compiling kernel #gpu_splat!(KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicCheck, Nothing, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.StaticSize{(512,)}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, Nothing}}, CuDeviceVector{Int64, 1}, CuDeviceVector{UInt32, 1}, CuDeviceVector{Int64, 1}) resulted in invalid LLVM IR
Reason: unsupported dynamic function invocation (call to modify!)
Stacktrace:
[1] modify!
@ ~/.julia/packages/Atomix/F9VIX/src/core.jl:33
[2] macro expansion
@ ~/code/a.jl:28
[3] gpu_splat!
@ ~/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80
[4] gpu_splat!
@ ./none:0
Hint: catch this exception as `err` and call `code_typed(err; interactive = true)` to introspect the erronous code
Stacktrace:
[1] check_ir(job::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{typeof(gpu_splat!), Tuple{KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicCheck, Nothing, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.StaticSize{(512,)}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, Nothing}}, CuDeviceVector{Int64, 1}, CuDeviceVector{UInt32, 1}, CuDeviceVector{Int64, 1}}}}, args::LLVM.Module)
@ GPUCompiler ~/.julia/packages/GPUCompiler/XyxTy/src/validation.jl:139
[2] macro expansion
@ ~/.julia/packages/GPUCompiler/XyxTy/src/driver.jl:409 [inlined]
[3] macro expansion
@ ~/.julia/packages/TimerOutputs/LDL7n/src/TimerOutput.jl:252 [inlined]
[4] macro expansion
@ ~/.julia/packages/GPUCompiler/XyxTy/src/driver.jl:407 [inlined]
[5] emit_asm(job::GPUCompiler.CompilerJob, ir::LLVM.Module; strip::Bool, validate::Bool, format::LLVM.API.LLVMCodeGenFileType)
@ GPUCompiler ~/.julia/packages/GPUCompiler/XyxTy/src/utils.jl:64
[6] cufunction_compile(job::GPUCompiler.CompilerJob, ctx::LLVM.Context)
@ CUDA ~/.julia/packages/CUDA/GGwVa/src/compiler/execution.jl:354
[7] #224
@ ~/.julia/packages/CUDA/GGwVa/src/compiler/execution.jl:347 [inlined]
[8] JuliaContext(f::CUDA.var"#224#225"{GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{typeof(gpu_splat!), Tuple{KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicCheck, Nothing, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.StaticSize{(512,)}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, Nothing}}, CuDeviceVector{Int64, 1}, CuDeviceVector{UInt32, 1}, CuDeviceVector{Int64, 1}}}}})
@ GPUCompiler ~/.julia/packages/GPUCompiler/XyxTy/src/driver.jl:74
[9] cufunction_compile(job::GPUCompiler.CompilerJob)
@ CUDA ~/.julia/packages/CUDA/GGwVa/src/compiler/execution.jl:346
[10] cached_compilation(cache::Dict{UInt64, Any}, job::GPUCompiler.CompilerJob, compiler::typeof(CUDA.cufunction_compile), linker::typeof(CUDA.cufunction_link))
@ GPUCompiler ~/.julia/packages/GPUCompiler/XyxTy/src/cache.jl:90
[11] cufunction(f::typeof(gpu_splat!), tt::Type{Tuple{KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicCheck, Nothing, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.StaticSize{(512,)}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, Nothing}}, CuDeviceVector{Int64, 1}, CuDeviceVector{UInt32, 1}, CuDeviceVector{Int64, 1}}}; name::Nothing, kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
@ CUDA ~/.julia/packages/CUDA/GGwVa/src/compiler/execution.jl:299
[12] cufunction(f::typeof(gpu_splat!), tt::Type{Tuple{KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicCheck, Nothing, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.StaticSize{(512,)}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, Nothing}}, CuDeviceVector{Int64, 1}, CuDeviceVector{UInt32, 1}, CuDeviceVector{Int64, 1}}})
@ CUDA ~/.julia/packages/CUDA/GGwVa/src/compiler/execution.jl:293
[13] macro expansion
@ ~/.julia/packages/CUDA/GGwVa/src/compiler/execution.jl:102 [inlined]
[14] (::KernelAbstractions.Kernel{CUDADevice, KernelAbstractions.NDIteration.StaticSize{(512,)}, KernelAbstractions.NDIteration.DynamicSize, typeof(gpu_splat!)})(::CuArray{Int64, 1, CUDA.Mem.DeviceBuffer}, ::Vararg{Any}; ndrange::Int64, dependencies::CUDAKernels.CudaEvent, workgroupsize::Nothing, progress::Function)
@ CUDAKernels ~/.julia/packages/CUDAKernels/4VLF4/src/CUDAKernels.jl:272
[15] main()
@ Main ~/code/a.jl:40
[16] top-level scope
@ ~/code/a.jl:42
in expression starting at /home/pxl-th/code/a.jl:42MWE: Codeusing CUDA
using CUDAKernels
using KernelAbstractions
using KernelAbstractions: @atomic
CUDA.allowscalar(false)
n_threads(::CPU) = Threads.nthreads()
n_threads(::CUDADevice) = 512
Base.rand(::CPU, T, shape) = rand(T, shape)
Base.rand(::CUDADevice, T, shape) = CUDA.rand(T, shape)
to_device(::CPU, x) = copy(x)
to_device(::CUDADevice, x) = CuArray(x)
@kernel function splat!(grid, @Const(indices), @Const(mlp_out))
i = @index(Global)
idx = indices[i]
@atomic max(grid[idx], mlp_out[i])
end
function main()
#device = CPU()
device = CUDADevice()
n = 16
indices = to_device(device, UInt32.(collect(1:n)))
mlp_out = rand(device, Int64, n) # errors on CPU with Float32
grid = rand(device, Int64, n) # errors on CPU with Float32
wait(splat!(device, n_threads(device), n)(grid, indices, mlp_out))
end
main() |
|
I was unable to replicate this error by running the provided code with 1.7.1 and 1.8.0-beta3 (just pulled from git). What OS are you using? Also, could you show the outputs of |
|
I'm on Ubuntu 22.04
|
|
I've just updated MWE code, before I included code that does not error :) |
|
Right, I see the comments now, sorry! try |
|
Yes, that works, thanks! Although there is another issue, which is not critical for me, but might be worth mentioning: MWE: Codeusing CUDA
using CUDAKernels
using KernelAbstractions
using KernelAbstractions: @atomic
CUDA.allowscalar(false)
const NERF_STEPS = UInt32(1024)
const MIN_CONE_STEPSIZE = √3f0 / NERF_STEPS
n_threads(::CPU) = Threads.nthreads()
n_threads(::CUDADevice) = 512
Base.rand(::CPU, T, shape) = rand(T, shape)
Base.rand(::CUDADevice, T, shape) = CUDA.rand(T, shape)
Base.zeros(::CPU, T, shape) = zeros(T, shape)
Base.zeros(::CUDADevice, T, shape) = CUDA.zeros(T, shape)
to_device(::CPU, x) = copy(x)
to_device(::CUDADevice, x) = CuArray(x)
@inline density_activation(x) = exp(x)
@kernel function splat!(grid, @Const(indices), @Const(mlp_out))
i = @index(Global)
idx = indices[i]
old, new = @atomic max(grid[idx], mlp_out[i])
@atomic grid[idx] = old
end
function main()
# device = CPU()
device = CUDADevice()
n = 16
indices = to_device(device, UInt32.(collect(1:n)))
mlp_out = rand(device, Int64, n)
grid = zeros(device, Int64, n)
wait(splat!(device, n_threads(device), n)(grid, indices, mlp_out))
end
main()ErrorERROR: LoadError: LLVM error: Cannot select: 0x77adc60: ch = AtomicStore<(store seq_cst (s64) into %ir.41, addrspace 1)> 0x4926e08:1, 0x6f629c8, 0x4926e08, /home/pxl-th/.julia/packages/LLVM/YSJ2s/src/interop/base.jl:40 @[ /home/pxl-th/.julia/packages/LLVM/YSJ2s/src/interop/atomics.jl:245 @[ /home/pxl-th/.julia/packages/LLVM/YSJ2s/src/interop/atomics.jl:201 @[ /home/pxl-th/.julia/packages/UnsafeAtomicsLLVM/i4GMj/src/internal.jl:11 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/core.jl:14 @[ /home/pxl-th/code/a.jl:29 @[ /home/pxl-th/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80 @[ none:0 ] ] ] ] ] ] ]
0x6f629c8: i64 = add 0x6f626f0, Constant:i64<-8>, /home/pxl-th/.julia/packages/LLVM/YSJ2s/src/interop/pointer.jl:110 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:80 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/references.jl:99 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/core.jl:30 @[ /home/pxl-th/code/a.jl:28 @[ /home/pxl-th/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80 @[ none:0 ] ] ] ] ] ]
0x6f626f0: i64 = add 0x6f62ea8, 0x7195f28, /home/pxl-th/.julia/packages/LLVM/YSJ2s/src/interop/pointer.jl:110 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:80 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/references.jl:99 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/core.jl:30 @[ /home/pxl-th/code/a.jl:28 @[ /home/pxl-th/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80 @[ none:0 ] ] ] ] ] ]
0x6f62ea8: i64,ch = CopyFromReg 0x65cc278, Register:i64 %0, /home/pxl-th/.julia/packages/LLVM/YSJ2s/src/interop/pointer.jl:110 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:80 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/references.jl:99 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/core.jl:30 @[ /home/pxl-th/code/a.jl:28 @[ /home/pxl-th/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80 @[ none:0 ] ] ] ] ] ]
0x4926cd0: i64 = Register %0
0x7195f28: i64 = shl nuw nsw 0x77ae5b8, Constant:i32<3>, int.jl:88 @[ abstractarray.jl:1189 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:80 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/references.jl:99 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/core.jl:30 @[ /home/pxl-th/code/a.jl:28 @[ /home/pxl-th/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80 @[ none:0 ] ] ] ] ] ] ]
0x77ae5b8: i64 = AssertZext 0x71ce208, ValueType:ch:i32, int.jl:88 @[ abstractarray.jl:1189 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:80 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/references.jl:99 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/core.jl:30 @[ /home/pxl-th/code/a.jl:28 @[ /home/pxl-th/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80 @[ none:0 ] ] ] ] ] ] ]
0x71ce208: i64,ch = CopyFromReg 0x65cc278, Register:i64 %9, int.jl:88 @[ abstractarray.jl:1189 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:80 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/references.jl:99 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/core.jl:30 @[ /home/pxl-th/code/a.jl:28 @[ /home/pxl-th/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80 @[ none:0 ] ] ] ] ] ] ]
0x66aca88: i64 = Register %9
0x66ac0c8: i32 = Constant<3>
0x4926720: i64 = Constant<-8>
0x4926e08: i64,ch = AtomicLoadMax<(load store seq_cst (s64) on %ir.39, addrspace 1)> 0x7195c50:1, 0x6f629c8, 0x7195c50, /home/pxl-th/.julia/packages/LLVM/YSJ2s/src/interop/base.jl:40 @[ /home/pxl-th/.julia/packages/LLVM/YSJ2s/src/interop/atomics.jl:270 @[ /home/pxl-th/.julia/packages/LLVM/YSJ2s/src/interop/atomics.jl:270 @[ /home/pxl-th/.julia/packages/LLVM/YSJ2s/src/interop/atomics.jl:374 @[ /home/pxl-th/.julia/packages/UnsafeAtomicsLLVM/i4GMj/src/internal.jl:18 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/core.jl:33 @[ /home/pxl-th/code/a.jl:28 @[ /home/pxl-th/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80 @[ none:0 ] ] ] ] ] ] ] ]
0x6f629c8: i64 = add 0x6f626f0, Constant:i64<-8>, /home/pxl-th/.julia/packages/LLVM/YSJ2s/src/interop/pointer.jl:110 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:80 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/references.jl:99 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/core.jl:30 @[ /home/pxl-th/code/a.jl:28 @[ /home/pxl-th/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80 @[ none:0 ] ] ] ] ] ]
0x6f626f0: i64 = add 0x6f62ea8, 0x7195f28, /home/pxl-th/.julia/packages/LLVM/YSJ2s/src/interop/pointer.jl:110 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:80 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/references.jl:99 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/core.jl:30 @[ /home/pxl-th/code/a.jl:28 @[ /home/pxl-th/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80 @[ none:0 ] ] ] ] ] ]
0x6f62ea8: i64,ch = CopyFromReg 0x65cc278, Register:i64 %0, /home/pxl-th/.julia/packages/LLVM/YSJ2s/src/interop/pointer.jl:110 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:80 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/references.jl:99 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/core.jl:30 @[ /home/pxl-th/code/a.jl:28 @[ /home/pxl-th/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80 @[ none:0 ] ] ] ] ] ]
0x4926cd0: i64 = Register %0
0x7195f28: i64 = shl nuw nsw 0x77ae5b8, Constant:i32<3>, int.jl:88 @[ abstractarray.jl:1189 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:80 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/references.jl:99 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/core.jl:30 @[ /home/pxl-th/code/a.jl:28 @[ /home/pxl-th/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80 @[ none:0 ] ] ] ] ] ] ]
0x77ae5b8: i64 = AssertZext 0x71ce208, ValueType:ch:i32, int.jl:88 @[ abstractarray.jl:1189 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:80 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/references.jl:99 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/core.jl:30 @[ /home/pxl-th/code/a.jl:28 @[ /home/pxl-th/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80 @[ none:0 ] ] ] ] ] ] ]
0x71ce208: i64,ch = CopyFromReg 0x65cc278, Register:i64 %9, int.jl:88 @[ abstractarray.jl:1189 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:80 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/references.jl:99 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/core.jl:30 @[ /home/pxl-th/code/a.jl:28 @[ /home/pxl-th/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80 @[ none:0 ] ] ] ] ] ] ]
0x66aca88: i64 = Register %9
0x66ac0c8: i32 = Constant<3>
0x4926720: i64 = Constant<-8>
0x7195c50: i64,ch = llvm.nvvm.ldg.global.i<(load (s64) from %ir.34, addrspace 1)> 0x65cc278, TargetConstant:i64<5104>, 0x6f62b68, Constant:i32<8>, /home/pxl-th/.julia/packages/LLVM/YSJ2s/src/interop/base.jl:40 @[ /home/pxl-th/.julia/packages/LLVM/YSJ2s/src/interop/pointer.jl:120 @[ /home/pxl-th/.julia/packages/LLVM/YSJ2s/src/interop/pointer.jl:120 @[ /home/pxl-th/.julia/packages/LLVM/YSJ2s/src/interop/pointer.jl:219 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/pointer.jl:40 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/pointer.jl:48 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:184 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:232 @[ /home/pxl-th/code/a.jl:28 @[ /home/pxl-th/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80 @[ none:0 ] ] ] ] ] ] ] ] ] ]
0x49271b0: i64 = TargetConstant<5104>
0x6f62b68: i64 = add 0x4926b98, Constant:i64<-8>, /home/pxl-th/.julia/packages/LLVM/YSJ2s/src/interop/pointer.jl:110 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/pointer.jl:39 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/pointer.jl:48 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:184 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:232 @[ /home/pxl-th/code/a.jl:28 @[ /home/pxl-th/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80 @[ none:0 ] ] ] ] ] ] ]
0x4926b98: i64 = add 0x6f62278, 0x4926ed8, /home/pxl-th/.julia/packages/LLVM/YSJ2s/src/interop/pointer.jl:110 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/pointer.jl:39 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/pointer.jl:48 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:184 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:232 @[ /home/pxl-th/code/a.jl:28 @[ /home/pxl-th/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80 @[ none:0 ] ] ] ] ] ] ]
0x6f62278: i64,ch = CopyFromReg 0x65cc278, Register:i64 %4, /home/pxl-th/.julia/packages/LLVM/YSJ2s/src/interop/pointer.jl:110 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/pointer.jl:39 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/pointer.jl:48 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:184 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:232 @[ /home/pxl-th/code/a.jl:28 @[ /home/pxl-th/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80 @[ none:0 ] ] ] ] ] ] ]
0x7803a58: i64 = Register %4
0x4926ed8: i64 = shl nuw nsw 0x71ce2d8, Constant:i32<3>, int.jl:88 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/pointer.jl:39 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/pointer.jl:48 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:184 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:232 @[ /home/pxl-th/code/a.jl:28 @[ /home/pxl-th/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80 @[ none:0 ] ] ] ] ] ] ]
0x71ce2d8: i64,ch = CopyFromReg 0x65cc278, Register:i64 %8, int.jl:88 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/pointer.jl:39 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/pointer.jl:48 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:184 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:232 @[ /home/pxl-th/code/a.jl:28 @[ /home/pxl-th/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80 @[ none:0 ] ] ] ] ] ] ]
0x77ae620: i64 = Register %8
0x66ac0c8: i32 = Constant<3>
0x4926720: i64 = Constant<-8>
0x7196880: i32 = Constant<8>
In function: _Z21julia_gpu_splat__430516CompilerMetadataI10StaticSizeI5_16__E12DynamicCheckvv7NDRangeILi1ES0_I4_1__ES0_I6_512__EvvEE13CuDeviceArrayI5Int64Li1ELi1EES3_I6UInt32Li1ELi1EES3_IS4_Li1ELi1EE
Stacktrace:
[1] handle_error(reason::Cstring)
@ LLVM ~/.julia/packages/LLVM/YSJ2s/src/core/context.jl:105
[2] LLVMTargetMachineEmitToMemoryBuffer
@ ~/.julia/packages/LLVM/YSJ2s/lib/13/libLLVM_h.jl:947 [inlined]
[3] emit(tm::LLVM.TargetMachine, mod::LLVM.Module, filetype::LLVM.API.LLVMCodeGenFileType)
@ LLVM ~/.julia/packages/LLVM/YSJ2s/src/targetmachine.jl:45
[4] mcgen(job::GPUCompiler.CompilerJob, mod::LLVM.Module, format::LLVM.API.LLVMCodeGenFileType)
@ GPUCompiler ~/.julia/packages/GPUCompiler/XyxTy/src/mcgen.jl:74
[5] macro expansion
@ ~/.julia/packages/TimerOutputs/LDL7n/src/TimerOutput.jl:252 [inlined]
[6] macro expansion
@ ~/.julia/packages/GPUCompiler/XyxTy/src/driver.jl:421 [inlined]
[7] macro expansion
@ ~/.julia/packages/TimerOutputs/LDL7n/src/TimerOutput.jl:252 [inlined]
[8] macro expansion
@ ~/.julia/packages/GPUCompiler/XyxTy/src/driver.jl:418 [inlined]
[9] emit_asm(job::GPUCompiler.CompilerJob, ir::LLVM.Module; strip::Bool, validate::Bool, format::LLVM.API.LLVMCodeGenFileType)
@ GPUCompiler ~/.julia/packages/GPUCompiler/XyxTy/src/utils.jl:64
[10] cufunction_compile(job::GPUCompiler.CompilerJob, ctx::LLVM.Context)
@ CUDA ~/.julia/packages/CUDA/GGwVa/src/compiler/execution.jl:354
[11] #224
@ ~/.julia/packages/CUDA/GGwVa/src/compiler/execution.jl:347 [inlined]
[12] JuliaContext(f::CUDA.var"#224#225"{GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{typeof(gpu_splat!), Tuple{KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.StaticSize{(16,)}, KernelAbstractions.NDIteration.DynamicCheck, Nothing, Nothing, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.StaticSize{(1,)}, KernelAbstractions.NDIteration.StaticSize{(512,)}, Nothing, Nothing}}, CuDeviceVector{Int64, 1}, CuDeviceVector{UInt32, 1}, CuDeviceVector{Int64, 1}}}}})
@ GPUCompiler ~/.julia/packages/GPUCompiler/XyxTy/src/driver.jl:74
[13] cufunction_compile(job::GPUCompiler.CompilerJob)
@ CUDA ~/.julia/packages/CUDA/GGwVa/src/compiler/execution.jl:346
[14] cached_compilation(cache::Dict{UInt64, Any}, job::GPUCompiler.CompilerJob, compiler::typeof(CUDA.cufunction_compile), linker::typeof(CUDA.cufunction_link))
@ GPUCompiler ~/.julia/packages/GPUCompiler/XyxTy/src/cache.jl:90
[15] cufunction(f::typeof(gpu_splat!), tt::Type{Tuple{KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.StaticSize{(16,)}, KernelAbstractions.NDIteration.DynamicCheck, Nothing, Nothing, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.StaticSize{(1,)}, KernelAbstractions.NDIteration.StaticSize{(512,)}, Nothing, Nothing}}, CuDeviceVector{Int64, 1}, CuDeviceVector{UInt32, 1}, CuDeviceVector{Int64, 1}}}; name::Nothing, kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
@ CUDA ~/.julia/packages/CUDA/GGwVa/src/compiler/execution.jl:299
[16] cufunction(f::typeof(gpu_splat!), tt::Type{Tuple{KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.StaticSize{(16,)}, KernelAbstractions.NDIteration.DynamicCheck, Nothing, Nothing, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.StaticSize{(1,)}, KernelAbstractions.NDIteration.StaticSize{(512,)}, Nothing, Nothing}}, CuDeviceVector{Int64, 1}, CuDeviceVector{UInt32, 1}, CuDeviceVector{Int64, 1}}})
@ CUDA ~/.julia/packages/CUDA/GGwVa/src/compiler/execution.jl:292
[17] macro expansion
@ ~/.julia/packages/CUDA/GGwVa/src/compiler/execution.jl:102 [inlined]
[18] (::KernelAbstractions.Kernel{CUDADevice, KernelAbstractions.NDIteration.StaticSize{(512,)}, KernelAbstractions.NDIteration.StaticSize{(16,)}, typeof(gpu_splat!)})(::CuArray{Int64, 1, CUDA.Mem.DeviceBuffer}, ::Vararg{Any}; ndrange::Nothing, dependencies::CUDAKernels.CudaEvent, workgroupsize::Nothing, progress::Function)
@ CUDAKernels ~/.julia/packages/CUDAKernels/JJJ1U/src/CUDAKernels.jl:273
[19] Kernel
@ ~/.julia/packages/CUDAKernels/JJJ1U/src/CUDAKernels.jl:268 [inlined]
[20] main()
@ Main ~/code/a.jl:42
[21] top-level scope
@ ~/code/a.jl:45
in expression starting at /home/pxl-th/code/a.jl:45 |
|
Ah, I can replicate this error, but I am not sure if it is an Atomix or KernelAbstractions issue. It seems like the CPU version works fine, so maybe it's with UnsafeAtomicsLLVM? Would you be willing to open up a new issue either here or on Atomix (https://github.com/JuliaConcurrent/Atomix.jl) and ping @tkf? |
|
It's an LLVM issue but workaroundable at the level of (e.g.) CUDA.jl. See: JuliaConcurrent/Atomix.jl#33 |
|
@pxl-th, if you are still having trouble with Atomix, I created a separate PR with the atomic support from Core.Intrinsics and CUDA directly in #306. I also added the pkg commands to load in the subdirectory of CUDAKernels in a comment so you can just use it for now if you need. I've been struggling to get things to work as well, so I also added testing infrastructure for Atomix in #308. Hopefully we can iron out all the details there and get all this sorted. If you have run into any issues, please document them there! |
After some discussions on #282, we decided to use Atomix for atomic support in KA.
A few quick questions:
@atomicmacro, we need to specify that we are using theAtomix.@atomicmacro in code that needs atomic operations. Should we overdub any@atomicmacros in KA to specifically use Atomix?atomic_add!(...), andatomic_sub!(...)from Atomic attempts #282? These come from eitherCUDAorCore.Intrinsics. Maybe it's a good idea to useAtomixon top of Atomic attempts #282? I don't know how many people will use the primitives over the macro, to be honest.Note, this should not be merged until JuliaRegistries/General#61002 is automerged.