[Do not merge] Switch to GPUArrays.jl reduction implementation#628
[Do not merge] Switch to GPUArrays.jl reduction implementation#628christiangnrd wants to merge 1 commit intomainfrom
Conversation
|
Your PR requires formatting changes to meet the project's style guidelines. Click here to view the suggested changes.diff --git a/perf/runbenchmarks.jl b/perf/runbenchmarks.jl
index ba5e0d40..1d7901c5 100644
--- a/perf/runbenchmarks.jl
+++ b/perf/runbenchmarks.jl
@@ -1,6 +1,6 @@
# benchmark suite execution and codespeed submission
using Pkg
-Pkg.add(url="https://github.com/christiangnrd/GPUArrays.jl", rev="akreduce")
+Pkg.add(url = "https://github.com/christiangnrd/GPUArrays.jl", rev = "akreduce")
using Metal
diff --git a/test/runtests.jl b/test/runtests.jl
index 4ee51134..fb376e4f 100644
--- a/test/runtests.jl
+++ b/test/runtests.jl
@@ -6,7 +6,7 @@ import REPL
using Test
using Pkg
-Pkg.add(url="https://github.com/christiangnrd/GPUArrays.jl", rev="akreduce")
+Pkg.add(url = "https://github.com/christiangnrd/GPUArrays.jl", rev = "akreduce")
# Quit without erroring if Metal loaded without issues on unsupported platforms
if !Sys.isapple() |
|
Leaving the current |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #628 +/- ##
=======================================
Coverage 80.63% 80.63%
=======================================
Files 61 61
Lines 2722 2722
=======================================
Hits 2195 2195
Misses 527 527 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Metal Benchmarks
Details
| Benchmark suite | Current: c0eddd1 | Previous: 1942968 | Ratio |
|---|---|---|---|
latency/precompile |
9830015416 ns |
9844653958 ns |
1.00 |
latency/ttfp |
3989128875 ns |
3972040229 ns |
1.00 |
latency/import |
1281988208 ns |
1275530958.5 ns |
1.01 |
integration/metaldevrt |
830312.5 ns |
828500 ns |
1.00 |
integration/byval/slices=1 |
1532291.5 ns |
1536750 ns |
1.00 |
integration/byval/slices=3 |
8864917 ns |
9632625 ns |
0.92 |
integration/byval/reference |
1535333 ns |
1543583 ns |
0.99 |
integration/byval/slices=2 |
2554083 ns |
2621958.5 ns |
0.97 |
kernel/indexing |
582792 ns |
567792 ns |
1.03 |
kernel/indexing_checked |
577208 ns |
569292 ns |
1.01 |
kernel/launch |
9042 ns |
9208 ns |
0.98 |
array/construct |
6125 ns |
6625 ns |
0.92 |
array/broadcast |
579250 ns |
583375 ns |
0.99 |
array/random/randn/Float32 |
821167 ns |
784333 ns |
1.05 |
array/random/randn!/Float32 |
622625 ns |
623250 ns |
1.00 |
array/random/rand!/Int64 |
555395.5 ns |
547458 ns |
1.01 |
array/random/rand!/Float32 |
584125 ns |
585291 ns |
1.00 |
array/random/rand/Int64 |
777375 ns |
771250 ns |
1.01 |
array/random/rand/Float32 |
628375 ns |
622687 ns |
1.01 |
array/accumulate/Int64/1d |
1261292 ns |
1277104.5 ns |
0.99 |
array/accumulate/Int64/dims=1 |
1800500 ns |
1868333 ns |
0.96 |
array/accumulate/Int64/dims=2 |
2165958.5 ns |
2183625 ns |
0.99 |
array/accumulate/Int64/dims=1L |
11643104 ns |
11737104 ns |
0.99 |
array/accumulate/Int64/dims=2L |
9718917 ns |
9771416.5 ns |
0.99 |
array/accumulate/Float32/1d |
1141375 ns |
1142833 ns |
1.00 |
array/accumulate/Float32/dims=1 |
1562333.5 ns |
1570458 ns |
0.99 |
array/accumulate/Float32/dims=2 |
1865875 ns |
1931625 ns |
0.97 |
array/accumulate/Float32/dims=1L |
9890916.5 ns |
9864375 ns |
1.00 |
array/accumulate/Float32/dims=2L |
7298500 ns |
7308021 ns |
1.00 |
array/reductions/reduce/Int64/1d |
1077583 ns |
1373353.5 ns |
0.78 |
array/reductions/reduce/Int64/dims=1 |
987500 ns |
1069291.5 ns |
0.92 |
array/reductions/reduce/Int64/dims=2 |
935145.5 ns |
1193292 ns |
0.78 |
array/reductions/reduce/Int64/dims=1L |
2350750 ns |
2113062.5 ns |
1.11 |
array/reductions/reduce/Int64/dims=2L |
2815291 ns |
3456458 ns |
0.81 |
array/reductions/reduce/Float32/1d |
1029750 ns |
971625 ns |
1.06 |
array/reductions/reduce/Float32/dims=1 |
956125 ns |
808458 ns |
1.18 |
array/reductions/reduce/Float32/dims=2 |
870375 ns |
768979 ns |
1.13 |
array/reductions/reduce/Float32/dims=1L |
1659354.5 ns |
1739041 ns |
0.95 |
array/reductions/reduce/Float32/dims=2L |
2781167 ns |
1772125 ns |
1.57 |
array/reductions/mapreduce/Int64/1d |
1000375 ns |
1456146 ns |
0.69 |
array/reductions/mapreduce/Int64/dims=1 |
936083 ns |
1074875 ns |
0.87 |
array/reductions/mapreduce/Int64/dims=2 |
873500 ns |
1206417 ns |
0.72 |
array/reductions/mapreduce/Int64/dims=1L |
2346562.5 ns |
2119292 ns |
1.11 |
array/reductions/mapreduce/Int64/dims=2L |
2844729 ns |
3444375 ns |
0.83 |
array/reductions/mapreduce/Float32/1d |
1045959 ns |
990792 ns |
1.06 |
array/reductions/mapreduce/Float32/dims=1 |
947959 ns |
810062.5 ns |
1.17 |
array/reductions/mapreduce/Float32/dims=2 |
868041.5 ns |
761104 ns |
1.14 |
array/reductions/mapreduce/Float32/dims=1L |
1668167 ns |
1740812.5 ns |
0.96 |
array/reductions/mapreduce/Float32/dims=2L |
2815354.5 ns |
1781292 ns |
1.58 |
array/private/copyto!/gpu_to_gpu |
636791 ns |
651375 ns |
0.98 |
array/private/copyto!/cpu_to_gpu |
795791 ns |
805542 ns |
0.99 |
array/private/copyto!/gpu_to_cpu |
811292 ns |
817667 ns |
0.99 |
array/private/iteration/findall/int |
1657000 ns |
1646500 ns |
1.01 |
array/private/iteration/findall/bool |
1451937.5 ns |
1444584 ns |
1.01 |
array/private/iteration/findfirst/int |
2074750 ns |
1754958.5 ns |
1.18 |
array/private/iteration/findfirst/bool |
1635145.5 ns |
1703625 ns |
0.96 |
array/private/iteration/scalar |
5542583.5 ns |
4772500 ns |
1.16 |
array/private/iteration/logical |
2734958 ns |
2536917 ns |
1.08 |
array/private/iteration/findmin/1d |
1870167 ns |
1815666 ns |
1.03 |
array/private/iteration/findmin/2d |
1891583.5 ns |
1431750 ns |
1.32 |
array/private/copy |
573791.5 ns |
538167 ns |
1.07 |
array/shared/copyto!/gpu_to_gpu |
83750 ns |
86375 ns |
0.97 |
array/shared/copyto!/cpu_to_gpu |
82625 ns |
86583 ns |
0.95 |
array/shared/copyto!/gpu_to_cpu |
91458 ns |
84833 ns |
1.08 |
array/shared/iteration/findall/int |
1643437.5 ns |
1609874.5 ns |
1.02 |
array/shared/iteration/findall/bool |
1471812.5 ns |
1464354 ns |
1.01 |
array/shared/iteration/findfirst/int |
1830375 ns |
1377750 ns |
1.33 |
array/shared/iteration/findfirst/bool |
1385917 ns |
1319166 ns |
1.05 |
array/shared/iteration/scalar |
206917 ns |
217500 ns |
0.95 |
array/shared/iteration/logical |
2750042 ns |
2288708.5 ns |
1.20 |
array/shared/iteration/findmin/1d |
1607895.5 ns |
1421750 ns |
1.13 |
array/shared/iteration/findmin/2d |
1917291.5 ns |
1430854.5 ns |
1.34 |
array/shared/copy |
251042 ns |
248666 ns |
1.01 |
array/permutedims/4d |
2442208 ns |
2438438 ns |
1.00 |
array/permutedims/2d |
1184291.5 ns |
1193250 ns |
0.99 |
array/permutedims/3d |
1737625 ns |
1768458 ns |
0.98 |
metal/synchronization/stream |
19667 ns |
19916 ns |
0.99 |
metal/synchronization/context |
20292 ns |
20375 ns |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
I think I'd rather we do it in one pass, because the change needs to be made across back-ends. |
|
In any case, despite some regressions the overall performance seems better here than over in CUDA.jl. |
Don't remove the file yet to avoid merge conflict with #627