-
Notifications
You must be signed in to change notification settings - Fork 5.8k
Optimize xoshiro256 for NVIDIA Grace #2540
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Add compiler hints to enable vectorization Memory layout change leads to better codegen on Neoverse V2.
|
I am going to be on vacation during the Christmas period, @rj-jesus is going to watch over this PR. |
|
Perhaps we should work together on this. Im working on similar performance #2512 |
folly/random/xoshiro256pp.h
Outdated
| for (uint64_t result_count = 0; result_count < VecResCount; result_count++) { | ||
| for (uint64_t state_count = 0; state_count < StateSize; state_count++) { | ||
| state[idx(state_count, result_count)] = splitmix64(seed_val); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| for (uint64_t result_count = 0; result_count < VecResCount; result_count++) { | |
| for (uint64_t state_count = 0; state_count < StateSize; state_count++) { | |
| state[idx(state_count, result_count)] = splitmix64(seed_val); | |
| for (uint64_t result_idx = 0; result_idx < VecResCount; result_idx++) { | |
| for (uint64_t state_idx = 0; state_idx < StateSize; state_idx++) { | |
| state[idx(state_idx, result_idx)] = splitmix64(seed_val); |
| result_type res[ResultCount]; | ||
| }; | ||
| vector_type state[VecResCount][StateSize]{}; | ||
| static constexpr uint64_t VecResCount = 16; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This, plus the elimination of the vector_type choice means it's actually cutting the buffer size in half under x86_64, which we don't want.
Any idea what the auto-vectorization looks like for this new version on x86_64? IIRC the original was a bit annoying to get right when I first wrote this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know what autovectorisation looks like, but I believe Lukas had mentioned performance was better on x86_64 as well. I'll run some numbers to check and post back.
Would you rather we keep the old buffer size for x86_64 in any case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll likely need to take another look at the buffer size after the final state of this diff lands, we should probably keep the old buffer size for x86_64 at least for now, since it's mostly going to be dependent on the width of the underlying vectors and how many of ALU vector operations the CPU can execute per cycle.
| curState[3] = rotl(curState[3], 45); | ||
| // By default, the compiler will prefer to unroll the loop completely, deactivating vectorization. | ||
| #if defined(__clang__) | ||
| #pragma clang loop unroll(disable) vectorize_width(8) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which versions of clang and GCC? And this probably will end up being different for different architectures, so may need to be guarded with a check for FOLLY_AARCH64.
Side-stepping the auto vectorizer entirely is part of the reason I wrote the original to use vector types directly, but I was also almost purely focused on x86_64 there, not aarch64.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was with Clang 17.0.6 and GCC 13.3.0, I believe. If you'd rather avoid relying on autovectorisation, we should be able to use the Neon implementation Lukas posted above. Just let me know what you'd prefer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I generally prefer auto-vectorization over explicitly using the intrinsics, since it means less archticture-specific code, but if the NEON version ends up faster, then that's the version we should be using. That said, as long as clang lets you do most of the basic operations on uint64x2_t without explicitly calling the intrinsics (like the prior code was using for __v4du, then it might be possible to get away with not explicitly calling the intrinsics, which would be far cleaner overall since you end up with far less platform-specific code.
Hi @kielfriedt, I think that's a good idea. I seem to see lower performance than expected with #2512: FWIW, just decreasing |
|
I added your code changes to mine and got a small speed up. I'm not seeing any negative impact on x86 latest gen. testing Neoverse V2 code from #2512: combined code using 4 pipeline switch: 4 pipelines V2 performance uplift: Neoverse N2 code from #2512: xoshiro256 2.13ns 469.39M combined code using 4 pipeline switch: 4 pipelines N2 performance uplift: combined code using 2 pipeline switch: 2 pipelines N2 performance uplift: intel emerald rapids Original Code no modification: combined code: performance uplift: |
| #elif defined(__GNUC__) | ||
| #pragma GCC unroll 4 | ||
| #endif | ||
| for (int i = 0; i < VecResCount; ++i) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have -Wsign-compare enabled as an error internally, so this currently causes an error. Swapping i to unsigned fixes it.
| for (int i = 0; i < VecResCount; ++i) { | |
| for (uint64_t i = 0; i < VecResCount; ++i) { |
| result_type res[ResultCount]; | ||
| }; | ||
| vector_type state[VecResCount][StateSize]{}; | ||
| static constexpr uint64_t VecResCount = 16; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll likely need to take another look at the buffer size after the final state of this diff lands, we should probably keep the old buffer size for x86_64 at least for now, since it's mostly going to be dependent on the width of the underlying vectors and how many of ALU vector operations the CPU can execute per cycle.
| curState[3] = rotl(curState[3], 45); | ||
| // By default, the compiler will prefer to unroll the loop completely, deactivating vectorization. | ||
| #if defined(__clang__) | ||
| #pragma clang loop unroll(disable) vectorize_width(8) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I generally prefer auto-vectorization over explicitly using the intrinsics, since it means less archticture-specific code, but if the NEON version ends up faster, then that's the version we should be using. That said, as long as clang lets you do most of the basic operations on uint64x2_t without explicitly calling the intrinsics (like the prior code was using for __v4du, then it might be possible to get away with not explicitly calling the intrinsics, which would be far cleaner overall since you end up with far less platform-specific code.
Hi @kielfriedt, what compiler version and flags are you using to build the sources? As I mentioned above I see significantly better performance with upstream sources, but performance seems to be highly dependent on inlining decisions. |
This PR optimizes the xoshiro256 implementation for AArch64 architectures, addressing performance bottlenecks and undefined behavior in the current code. We observed a ~2-3x performance increase in benchmarks by adjusting memory layout, strict aliasing compliance, and buffer sizing.
Changes:
Benchmark Analysis
We achieve a speedup of 2X for xoshiro256_64 and 3X for xoshiro256 on NVIDIA Grace with Clang 17:
On this branch:
and on main
These benchmarks evaluate the next() function and are highly unstable and heavily dependent on compiler inlining decisions and compiler version. To ensure reliable data that isolates the algorithm's performance, we benchmarked the calc() function with inlining explicitly disabled.
As shown in the chart, the optimized auto-vectorized version (Buffer Size 16) achieves significantly higher throughput on Grace processor’s compared to the current implementation.
Implementation Choice:
We opted for an auto-vectorization approach using pragmas. This method yields performance equivalent to manual intrinsics while keeping the codebase cleaner, more maintainable, and portable for future hardware iterations. In addition, we can use one codebase for all CPUs.
Alternative Implementation:
For reference, we have drafted a version using manual NEON intrinsics. While we recommend the autovec version, the intrinsic version is included below.
Benchmark results for Grace at 3.2 GHz. y axis shows iterations of calc per second, multiplied by buffer size (equiv. to how many numbers we can compute per second). This is with inlining of calc explicitly disabled.
Versions are the following: