Skip to content

Change iteration order #102

Merged
dannys4 merged 4 commits intoJuliaMath:mainfrom
wheeheee:patch-1
Feb 4, 2026
Merged

Change iteration order #102
dannys4 merged 4 commits intoJuliaMath:mainfrom
wheeheee:patch-1

Conversation

@wheeheee
Copy link
Contributor

@wheeheee wheeheee commented Feb 3, 2026

As I understand it, it's generally slightly more performant to iterate from left to right since Julia is column-major, although there's going to be limited benefit here as these views are strided in memory. If there isn't a reason for the current iteration order, this order is a little bit better.

edit: oops, sticky fingers, accidentally closed this

generally more performant to iterate from left to right since Julia is column-major
@codecov
Copy link

codecov bot commented Feb 3, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 97.21%. Comparing base (707282f) to head (de3eb54).
⚠️ Report is 5 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #102      +/-   ##
==========================================
- Coverage   97.24%   97.21%   -0.03%     
==========================================
  Files           4        4              
  Lines         435      431       -4     
==========================================
- Hits          423      419       -4     
  Misses         12       12              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@wheeheee wheeheee closed this Feb 3, 2026
@wheeheee wheeheee reopened this Feb 3, 2026
@dannys4
Copy link
Collaborator

dannys4 commented Feb 3, 2026

Any chance you can produce timings before and after the pr for an example to show the difference locally?

@wheeheee
Copy link
Contributor Author

wheeheee commented Feb 3, 2026

It's a very contrived example, which uses the size of the L2 cache (because fitting it in the L1 cache wasn't as significant as expected, probably because of runtime dispatch / allocations) and doesn't give that great a speedup...I think the best argument for doing this is that it generally should be better, and would probably be made more effective with type stability and thus less runtime dispatch and allocs, which I tried to improve with an extra function barrier in a new commit.

Benchmark on PR
julia> @benchmark mul!(out, p, M) setup=(mdims=(4,32768,10); M=randn(ComplexF64, mdims); out=similar(M); p=plan_fft(out, 2))
BenchmarkTools.Trial: 101 samples with 1 evaluation per sample.
 Range (min … max):  35.603 ms … 46.353 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     36.979 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   37.236 ms ±  1.384 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

      ▆▃▄▁▆ ▄▁▁▁▁█ ▆  ▆
  ▄▁▄▇█████▆██████▆█▁▄█▁▄▇▄▆▄▄▁▁▄▄▁▁▄▄▄▄▁▁▁▁▁▁▁▁▁▁▁▁▁▄▁▁▁▁▁▁▄ ▄
  35.6 ms         Histogram: frequency by time        41.8 ms <

 Memory estimate: 19.38 KiB, allocs estimate: 831.
Benchmark on main
julia> @benchmark mul!(out, p, M) setup=(mdims=(4,32768,10); M=randn(ComplexF64, mdims); out=similar(M); p=plan_fft(out, 2))
BenchmarkTools.Trial: 70 samples with 1 evaluation per sample.
 Range (min … max):  58.159 ms …  64.638 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     58.988 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   59.194 ms ± 884.073 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

           ▃█ ▁ ▃▃   ▁  ▁ ▃▁      ▁
  ▄▄▄▄▄▄▇▁▇██▇█▇██▇▇▁█▄▄█▄██▄▄▇▁▁▄█▇▁▁▁▁▄▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▄ ▁
  58.2 ms         Histogram: frequency by time         61.2 ms <

 Memory estimate: 19.19 KiB, allocs estimate: 819.

The improvement is outside of the margin of error.

versioninfo
julia> versioninfo()
Julia Version 1.12.4
Commit 01a2eadb04 (2026-01-06 16:56 UTC)
Build Info:
  Official https://julialang.org release
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 8 × 11th Gen Intel(R) Core(TM) i5-1135G7 @ 2.40GHz
  WORD_SIZE: 64
  LLVM: libLLVM-18.1.7 (ORCJIT, tigerlake)
  GC: Built with stock GC
Threads: 8 default, 1 interactive, 8 GC (on 8 virtual cores)
Environment:
  JULIA_CONDAPKG_BACKEND = Null
  JULIA_DEPOT_PATH = Q:\.julia;
  JULIA_NUM_THREADS = auto

edit: actually the size of the first two dimensions of ComplexF64s isn't exactly the size of my L2 cache because I forgot about the Complex part, but it still works so whatever...

@dannys4
Copy link
Collaborator

dannys4 commented Feb 3, 2026

Nevertheless, it makes sense---I'd bet it's technically better still if you used 16384 (2^14) instead of 32768 (2^15) just because we have a better power-of-four FFT and so the looping/memory read bottleneck would be more of the issue over arithmetic throughput. Obviously further contrived, just wanted to see a bit of evidence that the memory throughput doesn't get washed out by the compiler or something.

Copy link
Collaborator

@dannys4 dannys4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, thanks for cleaning up the loops. It reads much better now. I'll approve as soon as gitignore is cleaned up.

@wheeheee
Copy link
Contributor Author

wheeheee commented Feb 4, 2026

Although imo it's a good idea to leave the / off Manifest.toml, as they are produced by running benchmarks too.

@dannys4
Copy link
Collaborator

dannys4 commented Feb 4, 2026

Ah I see, didn't consider that. I'll save it for another day.

@dannys4 dannys4 added this pull request to the merge queue Feb 4, 2026
Merged via the queue into JuliaMath:main with commit f4c4fb8 Feb 4, 2026
8 checks passed
@wheeheee wheeheee deleted the patch-1 branch February 4, 2026 03:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants