Conversation
generally more performant to iterate from left to right since Julia is column-major
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #102 +/- ##
==========================================
- Coverage 97.24% 97.21% -0.03%
==========================================
Files 4 4
Lines 435 431 -4
==========================================
- Hits 423 419 -4
Misses 12 12 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Any chance you can produce timings before and after the pr for an example to show the difference locally? |
|
It's a very contrived example, which uses the size of the L2 cache (because fitting it in the L1 cache wasn't as significant as expected, probably because of runtime dispatch / allocations) and doesn't give that great a speedup...I think the best argument for doing this is that it generally should be better, and would probably be made more effective with type stability and thus less runtime dispatch and allocs, which I tried to improve with an extra function barrier in a new commit. Benchmark on PRjulia> @benchmark mul!(out, p, M) setup=(mdims=(4,32768,10); M=randn(ComplexF64, mdims); out=similar(M); p=plan_fft(out, 2))
BenchmarkTools.Trial: 101 samples with 1 evaluation per sample.
Range (min … max): 35.603 ms … 46.353 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 36.979 ms ┊ GC (median): 0.00%
Time (mean ± σ): 37.236 ms ± 1.384 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
▆▃▄▁▆ ▄▁▁▁▁█ ▆ ▆
▄▁▄▇█████▆██████▆█▁▄█▁▄▇▄▆▄▄▁▁▄▄▁▁▄▄▄▄▁▁▁▁▁▁▁▁▁▁▁▁▁▄▁▁▁▁▁▁▄ ▄
35.6 ms Histogram: frequency by time 41.8 ms <
Memory estimate: 19.38 KiB, allocs estimate: 831.Benchmark on mainjulia> @benchmark mul!(out, p, M) setup=(mdims=(4,32768,10); M=randn(ComplexF64, mdims); out=similar(M); p=plan_fft(out, 2))
BenchmarkTools.Trial: 70 samples with 1 evaluation per sample.
Range (min … max): 58.159 ms … 64.638 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 58.988 ms ┊ GC (median): 0.00%
Time (mean ± σ): 59.194 ms ± 884.073 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▃█ ▁ ▃▃ ▁ ▁ ▃▁ ▁
▄▄▄▄▄▄▇▁▇██▇█▇██▇▇▁█▄▄█▄██▄▄▇▁▁▄█▇▁▁▁▁▄▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▄ ▁
58.2 ms Histogram: frequency by time 61.2 ms <
Memory estimate: 19.19 KiB, allocs estimate: 819.The improvement is outside of the margin of error. versioninfojulia> versioninfo()
Julia Version 1.12.4
Commit 01a2eadb04 (2026-01-06 16:56 UTC)
Build Info:
Official https://julialang.org release
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: 8 × 11th Gen Intel(R) Core(TM) i5-1135G7 @ 2.40GHz
WORD_SIZE: 64
LLVM: libLLVM-18.1.7 (ORCJIT, tigerlake)
GC: Built with stock GC
Threads: 8 default, 1 interactive, 8 GC (on 8 virtual cores)
Environment:
JULIA_CONDAPKG_BACKEND = Null
JULIA_DEPOT_PATH = Q:\.julia;
JULIA_NUM_THREADS = autoedit: actually the size of the first two dimensions of |
|
Nevertheless, it makes sense---I'd bet it's technically better still if you used 16384 (2^14) instead of 32768 (2^15) just because we have a better power-of-four FFT and so the looping/memory read bottleneck would be more of the issue over arithmetic throughput. Obviously further contrived, just wanted to see a bit of evidence that the memory throughput doesn't get washed out by the compiler or something. |
dannys4
left a comment
There was a problem hiding this comment.
Overall, thanks for cleaning up the loops. It reads much better now. I'll approve as soon as gitignore is cleaned up.
This reverts commit e9410af.
|
Although imo it's a good idea to leave the |
|
Ah I see, didn't consider that. I'll save it for another day. |
As I understand it, it's generally slightly more performant to iterate from left to right since Julia is column-major, although there's going to be limited benefit here as these views are strided in memory. If there isn't a reason for the current iteration order, this order is a little bit better.
edit: oops, sticky fingers, accidentally closed this