[Do not merge] Switch to GPUArrays.jl accumulate implementation#625
[Do not merge] Switch to GPUArrays.jl accumulate implementation#625christiangnrd wants to merge 3 commits intomainfrom
accumulate implementation#625Conversation
There was a problem hiding this comment.
Metal Benchmarks
Details
| Benchmark suite | Current: 84f519a | Previous: de3fd23 | Ratio |
|---|---|---|---|
latency/precompile |
25018084416 ns |
25055876416 ns |
1.00 |
latency/ttfp |
2129990500 ns |
2125052000 ns |
1.00 |
latency/import |
1225508166 ns |
1219352833 ns |
1.01 |
integration/metaldevrt |
956625 ns |
968354.5 ns |
0.99 |
integration/byval/slices=1 |
1644625 ns |
1660375 ns |
0.99 |
integration/byval/slices=3 |
10295687.5 ns |
8945875 ns |
1.15 |
integration/byval/reference |
1633875 ns |
1638208 ns |
1.00 |
integration/byval/slices=2 |
2747625 ns |
2721062.5 ns |
1.01 |
kernel/indexing |
692437.5 ns |
703875 ns |
0.98 |
kernel/indexing_checked |
681375 ns |
694208 ns |
0.98 |
kernel/launch |
13020.5 ns |
12875 ns |
1.01 |
array/construct |
6292 ns |
6083 ns |
1.03 |
array/broadcast |
660542 ns |
670666.5 ns |
0.98 |
array/random/randn/Float32 |
849938 ns |
879916 ns |
0.97 |
array/random/randn!/Float32 |
621917 ns |
639812.5 ns |
0.97 |
array/random/rand!/Int64 |
554792 ns |
567000 ns |
0.98 |
array/random/rand!/Float32 |
589083 ns |
602916 ns |
0.98 |
array/random/rand/Int64 |
752104.5 ns |
754292 ns |
1.00 |
array/random/rand/Float32 |
545291 ns |
574541 ns |
0.95 |
array/accumulate/Int64/1d |
2378188 ns |
1336875 ns |
1.78 |
array/accumulate/Int64/dims=1 |
2295312.5 ns |
1912291.5 ns |
1.20 |
array/accumulate/Int64/dims=2 |
2555417 ns |
2256916.5 ns |
1.13 |
array/accumulate/Int64/dims=1L |
6595145.5 ns |
11644666.5 ns |
0.57 |
array/accumulate/Int64/dims=2L |
18580062.5 ns |
9900979.5 ns |
1.88 |
array/accumulate/Float32/1d |
1685084 ns |
1245625 ns |
1.35 |
array/accumulate/Float32/dims=1 |
2124459 ns |
1630541.5 ns |
1.30 |
array/accumulate/Float32/dims=2 |
2386125 ns |
1968750 ns |
1.21 |
array/accumulate/Float32/dims=1L |
5082146 ns |
9898709 ns |
0.51 |
array/accumulate/Float32/dims=2L |
14983750 ns |
7337354 ns |
2.04 |
array/reductions/reduce/Int64/1d |
1349687.5 ns |
1381500.5 ns |
0.98 |
array/reductions/reduce/Int64/dims=1 |
1177333 ns |
1154562.5 ns |
1.02 |
array/reductions/reduce/Int64/dims=2 |
1291041 ns |
1287541 ns |
1.00 |
array/reductions/reduce/Int64/dims=1L |
2127500 ns |
2078000 ns |
1.02 |
array/reductions/reduce/Int64/dims=2L |
3575749.5 ns |
3569083 ns |
1.00 |
array/reductions/reduce/Float32/1d |
1015125 ns |
1047333.5 ns |
0.97 |
array/reductions/reduce/Float32/dims=1 |
885375 ns |
899875 ns |
0.98 |
array/reductions/reduce/Float32/dims=2 |
800416 ns |
801708.5 ns |
1.00 |
array/reductions/reduce/Float32/dims=1L |
1386084 ns |
1393042 ns |
1.00 |
array/reductions/reduce/Float32/dims=2L |
1909291 ns |
1903875 ns |
1.00 |
array/reductions/mapreduce/Int64/1d |
1338875 ns |
1353375 ns |
0.99 |
array/reductions/mapreduce/Int64/dims=1 |
1142416 ns |
1160042 ns |
0.98 |
array/reductions/mapreduce/Int64/dims=2 |
1287270.5 ns |
1282979 ns |
1.00 |
array/reductions/mapreduce/Int64/dims=1L |
2103667 ns |
2111146 ns |
1.00 |
array/reductions/mapreduce/Int64/dims=2L |
3442541.5 ns |
3466062 ns |
0.99 |
array/reductions/mapreduce/Float32/1d |
975541 ns |
1083604 ns |
0.90 |
array/reductions/mapreduce/Float32/dims=1 |
890208.5 ns |
902542 ns |
0.99 |
array/reductions/mapreduce/Float32/dims=2 |
788042 ns |
819041.5 ns |
0.96 |
array/reductions/mapreduce/Float32/dims=1L |
1385458.5 ns |
1404791.5 ns |
0.99 |
array/reductions/mapreduce/Float32/dims=2L |
1918042 ns |
1904375 ns |
1.01 |
array/private/copyto!/gpu_to_gpu |
642271 ns |
661417 ns |
0.97 |
array/private/copyto!/cpu_to_gpu |
819417 ns |
827708 ns |
0.99 |
array/private/copyto!/gpu_to_cpu |
818854.5 ns |
823833 ns |
0.99 |
array/private/iteration/findall/int |
1746687.5 ns |
1654645.5 ns |
1.06 |
array/private/iteration/findall/bool |
1575458 ns |
1502750 ns |
1.05 |
array/private/iteration/findfirst/int |
1933875 ns |
2023208 ns |
0.96 |
array/private/iteration/findfirst/bool |
1745458 ns |
1852750 ns |
0.94 |
array/private/iteration/scalar |
4033375 ns |
5040709 ns |
0.80 |
array/private/iteration/logical |
2604125 ns |
2707041 ns |
0.96 |
array/private/iteration/findmin/1d |
1990291 ns |
2059979 ns |
0.97 |
array/private/iteration/findmin/2d |
1640084 ns |
1638750 ns |
1.00 |
array/private/copy |
580229.5 ns |
566958.5 ns |
1.02 |
array/shared/copyto!/gpu_to_gpu |
80708 ns |
79375 ns |
1.02 |
array/shared/copyto!/cpu_to_gpu |
79708 ns |
81333 ns |
0.98 |
array/shared/copyto!/gpu_to_cpu |
80000 ns |
78750 ns |
1.02 |
array/shared/iteration/findall/int |
1761916.5 ns |
1657354 ns |
1.06 |
array/shared/iteration/findall/bool |
1683500 ns |
1507000 ns |
1.12 |
array/shared/iteration/findfirst/int |
1539875 ns |
1648125 ns |
0.93 |
array/shared/iteration/findfirst/bool |
1427729.5 ns |
1429542 ns |
1.00 |
array/shared/iteration/scalar |
161459 ns |
159083 ns |
1.01 |
array/shared/iteration/logical |
2442375 ns |
2359208 ns |
1.04 |
array/shared/iteration/findmin/1d |
1511833 ns |
1598729.5 ns |
0.95 |
array/shared/iteration/findmin/2d |
1630792 ns |
1642520.5 ns |
0.99 |
array/shared/copy |
250604 ns |
253958 ns |
0.99 |
array/permutedims/4d |
2465500 ns |
2460792 ns |
1.00 |
array/permutedims/2d |
1249208.5 ns |
1249583.5 ns |
1.00 |
array/permutedims/3d |
1743167 ns |
1743375 ns |
1.00 |
metal/synchronization/stream |
14875 ns |
14875 ns |
1 |
metal/synchronization/context |
15708 ns |
15500 ns |
1.01 |
This comment was automatically generated by workflow using github-action-benchmark.
|
Your PR requires formatting changes to meet the project's style guidelines. Click here to view the suggested changes.diff --git a/test/runtests.jl b/test/runtests.jl
index 9b6b0c3d..6d16c110 100644
--- a/test/runtests.jl
+++ b/test/runtests.jl
@@ -11,7 +11,7 @@ if parse(Bool, get(ENV, "BUILDKITE", "false"))
end
using Pkg
-Pkg.add(url="https://github.com/christiangnrd/GPUArrays.jl", rev="accumulatetests")
+Pkg.add(url = "https://github.com/christiangnrd/GPUArrays.jl", rev = "accumulatetests")
# Quit without erroring if Metal loaded without issues on unsupported platforms
if !Sys.isapple() |
|
As expected, some small regressions for most accumulate benchmarks, with a massive regression when accumulating along rows of a 3x1000000 matrix.
|
accumulate implementation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #625 +/- ##
==========================================
- Coverage 80.63% 80.35% -0.29%
==========================================
Files 61 60 -1
Lines 2722 2678 -44
==========================================
- Hits 2195 2152 -43
+ Misses 527 526 -1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
accumulate implementationaccumulate implementation
I don't see a massive slowdown? |
|
@maleadt The accumulate |
|
Oh OK, I didn't consider 2x a "massive slowdown" :-) Still something to look at of course, but much less dramatic than the 7x regressions we e.g. saw against CUDA.jl's reduction. |
b296d15 to
84f519a
Compare
Opened to run benchmarks.
Todo: