Skip to content

Conversation

@krenzland
Copy link
Contributor

This PR optimizes the xoshiro256 implementation for AArch64 architectures, addressing performance bottlenecks and undefined behavior in the current code. We observed a ~2-3x performance increase in benchmarks by adjusting memory layout, strict aliasing compliance, and buffer sizing.

Changes:

  • Flat Memory Layout (1D vs 2D): We switched from a 2D array to a flattened 1D array. A flat memory layout is essential for Clang to reliably perform auto-vectorization, which fails with the original multi-dimensional structure.
  • Strict Aliasing Safety (Memcpy vs Union): We replaced the use of union for type punning with memcpy. Using unions for type punning invokes undefined behavior under C++ strict aliasing rules. memcpy is standard-compliant, safe, and generates identical machine code.
  • Increased Buffer Size (8 to 16): We increased the buffer size to 16. This aligns with the strategy used in the AVX implementation and significantly improves throughput.

Benchmark Analysis
We achieve a speedup of 2X for xoshiro256_64 and 3X for xoshiro256 on NVIDIA Grace with Clang 17:
On this branch:

$ numactl --cpunodebind=0 --membind=0 --physcpubind=0 ./random_benchmark
xoshiro256                                                312.98ps     3.20G
xoshiro256_64                                             625.72ps     1.60G

and on main

$ numactl --cpunodebind=0 --membind=0 --physcpubind=0 ./random_benchmark
xoshiro256                                                988.74ps     1.01G
xoshiro256_64                                               1.22ns   818.56M

These benchmarks evaluate the next() function and are highly unstable and heavily dependent on compiler inlining decisions and compiler version. To ensure reliable data that isolates the algorithm's performance, we benchmarked the calc() function with inlining explicitly disabled.
As shown in the chart, the optimized auto-vectorized version (Buffer Size 16) achieves significantly higher throughput on Grace processor’s compared to the current implementation.
Implementation Choice:
We opted for an auto-vectorization approach using pragmas. This method yields performance equivalent to manual intrinsics while keeping the codebase cleaner, more maintainable, and portable for future hardware iterations. In addition, we can use one codebase for all CPUs.
Alternative Implementation:
For reference, we have drafted a version using manual NEON intrinsics. While we recommend the autovec version, the intrinsic version is included below.

void calc() noexcept {
// Process 2 results at a time using NEON
    for (int i = 0; i < VecResCount; i += 2) {
    // Load state vectors for 2 consecutive results
    uint64x2_t s0 = vld1q_u64(&state[idx(0, i)]);
    uint64x2_t s1 = vld1q_u64(&state[idx(1, i)]);
    uint64x2_t s2 = vld1q_u64(&state[idx(2, i)]);
    uint64x2_t s3 = vld1q_u64(&state[idx(3, i)]);
    
    // Calculate result: res[i] = rotl(s0 + s3, 23) + s0
    uint64x2_t sum = vaddq_u64(s0, s3);
    uint64x2_t rotl23_sum = veorq_u64(vshlq_n_u64(sum, 23), vshrq_n_u64(sum, 41));
    uint64x2_t result = vaddq_u64(rotl23_sum, s0);
    // Memcpy in case vector_type != result_type.
    std::memcpy(&res[i * size_ratio], &result, sizeof(result));
    
    uint64x2_t t = vshlq_n_u64(s1, 17);
    
    s2 = veorq_u64(s2, s0);
    s3 = veorq_u64(s3, s1);
    s1 = veorq_u64(s1, s2);
    s0 = veorq_u64(s0, s3);
    s2 = veorq_u64(s2, t);
    s3 = veorq_u64(vshlq_n_u64(s3, 45), vshrq_n_u64(s3, 19));
    
    // Store updated state vectors
    vst1q_u64(&state[idx(0, i)], s0);
    vst1q_u64(&state[idx(1, i)], s1);
    vst1q_u64(&state[idx(2, i)], s2);
    vst1q_u64(&state[idx(3, i)], s3);
    }
    
    cur = 0;
}

Benchmark results for Grace at 3.2 GHz. y axis shows iterations of calc per second, multiplied by buffer size (equiv. to how many numbers we can compute per second). This is with inlining of calc explicitly disabled.
Versions are the following:

  • neon: Neon intrinsics + flat layout
  • autovec: Auto vectorization (with manual pragmas) + flat layout
  • no suffix: Version currently in Folly
image

Add compiler hints to enable vectorization
Memory layout change leads to better codegen
on Neoverse V2.
@meta-cla meta-cla bot added the CLA Signed label Dec 9, 2025
@krenzland
Copy link
Contributor Author

I am going to be on vacation during the Christmas period, @rj-jesus is going to watch over this PR.

@kielfriedt
Copy link

Perhaps we should work together on this. Im working on similar performance #2512

@meta-codesync
Copy link

meta-codesync bot commented Dec 9, 2025

@Orvid has imported this pull request. If you are a Meta employee, you can view this in D88781116.

Comment on lines 73 to 75
for (uint64_t result_count = 0; result_count < VecResCount; result_count++) {
for (uint64_t state_count = 0; state_count < StateSize; state_count++) {
state[idx(state_count, result_count)] = splitmix64(seed_val);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for (uint64_t result_count = 0; result_count < VecResCount; result_count++) {
for (uint64_t state_count = 0; state_count < StateSize; state_count++) {
state[idx(state_count, result_count)] = splitmix64(seed_val);
for (uint64_t result_idx = 0; result_idx < VecResCount; result_idx++) {
for (uint64_t state_idx = 0; state_idx < StateSize; state_idx++) {
state[idx(state_idx, result_idx)] = splitmix64(seed_val);

result_type res[ResultCount];
};
vector_type state[VecResCount][StateSize]{};
static constexpr uint64_t VecResCount = 16;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This, plus the elimination of the vector_type choice means it's actually cutting the buffer size in half under x86_64, which we don't want.

Any idea what the auto-vectorization looks like for this new version on x86_64? IIRC the original was a bit annoying to get right when I first wrote this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know what autovectorisation looks like, but I believe Lukas had mentioned performance was better on x86_64 as well. I'll run some numbers to check and post back.

Would you rather we keep the old buffer size for x86_64 in any case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll likely need to take another look at the buffer size after the final state of this diff lands, we should probably keep the old buffer size for x86_64 at least for now, since it's mostly going to be dependent on the width of the underlying vectors and how many of ALU vector operations the CPU can execute per cycle.

curState[3] = rotl(curState[3], 45);
// By default, the compiler will prefer to unroll the loop completely, deactivating vectorization.
#if defined(__clang__)
#pragma clang loop unroll(disable) vectorize_width(8)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which versions of clang and GCC? And this probably will end up being different for different architectures, so may need to be guarded with a check for FOLLY_AARCH64.

Side-stepping the auto vectorizer entirely is part of the reason I wrote the original to use vector types directly, but I was also almost purely focused on x86_64 there, not aarch64.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was with Clang 17.0.6 and GCC 13.3.0, I believe. If you'd rather avoid relying on autovectorisation, we should be able to use the Neon implementation Lukas posted above. Just let me know what you'd prefer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I generally prefer auto-vectorization over explicitly using the intrinsics, since it means less archticture-specific code, but if the NEON version ends up faster, then that's the version we should be using. That said, as long as clang lets you do most of the basic operations on uint64x2_t without explicitly calling the intrinsics (like the prior code was using for __v4du, then it might be possible to get away with not explicitly calling the intrinsics, which would be far cleaner overall since you end up with far less platform-specific code.

@rj-jesus
Copy link
Contributor

Perhaps we should work together on this. Im working on similar performance #2512

Hi @kielfriedt, I think that's a good idea. I seem to see lower performance than expected with #2512:

xoshiro256                                                  1.65ns   606.19M
xoshiro256_64                                               1.84ns   542.15M

FWIW, just decreasing VecResCount from 8 to 4 seems to give a decent speedup on main (with Clang 17.0.6, compared to the numbers Lukas posted above):

xoshiro256                                                705.22ps     1.42G
xoshiro256_64                                             843.17ps     1.19G

@kielfriedt
Copy link

kielfriedt commented Dec 12, 2025

I added your code changes to mine and got a small speed up. I'm not seeing any negative impact on x86 latest gen.

testing

Neoverse V2

code from #2512:
xoshiro256 2.97ns 337.18M
xoshiro256_64 2.46ns 406.95M

combined code using 4 pipeline switch:
xoshiro256 1.81ns 553.01M
xoshiro256_64 1.56ns 641.65M

4 pipelines V2 performance uplift:
xoshiro256 64%
xoshiro256_64 57%

Neoverse N2

code from #2512:

xoshiro256 2.13ns 469.39M
xoshiro256_64 2.33ns 429.14M

combined code using 4 pipeline switch:
xoshiro256 2.08ns 481.69M
xoshiro256_64 2.29ns 435.99M

4 pipelines N2 performance uplift:
xoshiro256 2%
xoshiro256_64 2%

combined code using 2 pipeline switch:
xoshiro256 2.06ns 486.19M
xoshiro256_64 2.31ns 433.31M

2 pipelines N2 performance uplift:
xoshiro256 4%
xoshiro256_64 0%

intel emerald rapids

Original Code no modification:
xoshiro256 892.43ps 1.12G
xoshiro256_64 1.67ns 598.07M

combined code:
xoshiro256 837.63ps 1.19G
xoshiro256_64 1.25ns 800.62M

performance uplift:
xoshiro256 6%
xoshiro256_64 33%

 * Copyright (c) Meta Platforms, Inc. and affiliates.
 *
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

#pragma once

#include <array>
#include <cstdint>
#include <limits>
#include <ostream>
#include <random>

#include <folly/Likely.h>
#include <folly/Portability.h>

#if FOLLY_X64
#include <immintrin.h>
#endif

#if defined(__aarch64__)
#include <arm_neon.h>
// Allow selecting pipelines at compile time; default to 4 on AArch64.
#ifndef XOSHIRO_PIPELINES
#define XOSHIRO_PIPELINES 4
#endif
static_assert(
    XOSHIRO_PIPELINES == 2 || XOSHIRO_PIPELINES == 4,
    "XOSHIRO_PIPELINES must be 2 or 4 on AArch64");
#endif

namespace folly {

template <typename ResType>
class xoshiro256pp {
 public:
  using result_type = ResType;
  static constexpr result_type default_seed =
      static_cast<result_type>(0x8690c864c6e0b716ULL);

  // While this is not the actual size of the state, it is the size of the input
  // seed that we allow. Any uses of a larger state in the form of a seed_seq
  // will be ignored after the first small part of it.
  static constexpr size_t state_size = sizeof(uint64_t) / sizeof(result_type);

  static_assert(
      std::is_integral_v<result_type>, "ResType must be an integral type.");
  static_assert(
      std::is_unsigned_v<result_type>, "ResType must be an unsigned type.");

  xoshiro256pp(uint64_t pSeed = default_seed) noexcept : state{} {
    seed(pSeed);
  }

  explicit xoshiro256pp(std::seed_seq& seq) noexcept {
    uint64_t val{};
    seq.generate(&val, &val + 1);
    seed(val);
  }

  result_type operator()() noexcept { return next(); }
  static constexpr result_type min() noexcept {
    return std::numeric_limits<result_type>::min();
  }
  static constexpr result_type max() noexcept {
    return std::numeric_limits<result_type>::max();
  }

  void seed(uint64_t pSeed = default_seed) noexcept {
    uint64_t seedv = pSeed;
    for (uint64_t re = 0; re < VecResCount; re++) {
      for (uint64_t stat = 0; stat < StateSize; stat++) {
        state[re][stat] = seed_vec<vector_type>(seedv);
      }
    }
    cur = ResultCount;
  }

  void seed(std::seed_seq& seq) noexcept {
    std::array<uint64_t, 1> seeds{};
    seq.generate(seeds.begin(), seeds.end());
    seed(seeds[0]);
  }

 private:
#if defined(__AVX2__) && defined(__GNUC__)
  using vector_type = __v4du;        // x86_64 (GCC) vector of 4x u64
#elif defined(__aarch64__)
  using vector_type = uint64x2_t;    // AArch64 NEON vector of 2x u64
#else
  using vector_type = uint64_t;      // Portable scalar
#endif

  static constexpr uint64_t StateSize = 4;

#if defined(__aarch64__)
  static constexpr uint64_t VecResCount = XOSHIRO_PIPELINES; // 2 or 4
#else
  static constexpr uint64_t VecResCount = 8;
#endif

  static constexpr uint64_t ResultCount =
      VecResCount * (sizeof(vector_type) / sizeof(result_type));

  union {
    vector_type vecRes[VecResCount]{};
    result_type res[ResultCount];
  };
  vector_type state[VecResCount][StateSize]{};
  uint64_t cur = ResultCount;

  template <typename Size, typename CharT, typename Traits>
  friend std::basic_ostream<CharT, Traits>& operator<<(
      std::basic_ostream<CharT, Traits>& os, const xoshiro256pp<Size>& rng);

  template <typename T>
  static inline T seed_vec(uint64_t& seed) {
    if constexpr (sizeof(T) != sizeof(uint64_t)) {
      T sbase{};
      for (uint64_t i = 0; i < sizeof(T) / sizeof(uint64_t); i++) {
        sbase[i] = splitmix64(seed);
      }
      return sbase;
    } else {
      return T(splitmix64(seed));
    }
  }

  static inline uint64_t splitmix64(uint64_t& cur) noexcept {
    uint64_t z = (cur += 0x9e3779b97f4a7c15ULL);
    z = (z ^ (z >> 30)) * 0xbf58476d1ce4e5b9ULL;
    z = (z ^ (z >> 27)) * 0x94d049bb133111ebULL;
    return z ^ (z >> 31);
  }

  // Scalar / generic rotate (works for scalar and GCC vector types)
  FOLLY_ALWAYS_INLINE static vector_type rotl(
      const vector_type x, int k) noexcept {
    return (x << k) | (x >> (64 - k));
  }

#if defined(__aarch64__)
  // AArch64-specific helpers

  // Fixed rotates used in the algorithm: 23 and 45 (use immediate-shift intrinsics)
  FOLLY_ALWAYS_INLINE static uint64x2_t rotl23(uint64x2_t x) noexcept {
    // (x << 23) | (x >> 41)
    return vorrq_u64(vshlq_n_u64(x, 23), vshrq_n_u64(x, 41));
  }
  FOLLY_ALWAYS_INLINE static uint64x2_t rotl45(uint64x2_t x) noexcept {
    // (x << 45) | (x >> 19)
    return vorrq_u64(vshlq_n_u64(x, 45), vshrq_n_u64(x, 19));
  }

  // Variable rotate helper (SAFE: no immediate required)
  // If SHA3/SHA512 available at compile time, use vxarq_u64 for ROR; otherwise use variable vshlq_u64.
  FOLLY_ALWAYS_INLINE static uint64x2_t rotl_var(uint64x2_t x, int k) noexcept {
    int s = k & 63;
    if (s == 0) return x;
#if defined(__ARM_FEATURE_SHA3) || defined(__ARM_FEATURE_SHA512)
    // ROL(s) == ROR(64 - s); vxarq_u64(dst, src, imm) computes ROR(src, imm)
    const int ror_amt = 64 - s;
    return vxarq_u64(vdupq_n_u64(0), x, ror_amt);
#else
    // vshlq_u64 uses signed counts: +n = LSL n, -n = LSR n (logical for unsigned)
    int r = 64 - s;
    int64x2_t shl = vdupq_n_s64(s);
    int64x2_t shr = vdupq_n_s64(-r);
    uint64x2_t left  = vshlq_u64(x, shl);
    uint64x2_t right = vshlq_u64(x, shr);
    return vorrq_u64(left, right);
#endif
  }
#endif // __aarch64__

  void calc() noexcept {
#if defined(__aarch64__)
#if XOSHIRO_PIPELINES == 4
    // Process 4 streams per iteration
    for (uint64_t i = 0; i < VecResCount; i += 4) {
      auto& s0 = state[i + 0];
      auto& s1 = state[i + 1];
      auto& s2 = state[i + 2];
      auto& s3 = state[i + 3];

      uint64x2_t s0_0 = s0[0], s0_1 = s0[1], s0_2 = s0[2], s0_3 = s0[3];
      uint64x2_t s1_0 = s1[0], s1_1 = s1[1], s1_2 = s1[2], s1_3 = s1[3];
      uint64x2_t s2_0 = s2[0], s2_1 = s2[1], s2_2 = s2[2], s2_3 = s2[3];
      uint64x2_t s3_0 = s3[0], s3_1 = s3[1], s3_2 = s3[2], s3_3 = s3[3];

      // output = rotl(s0 + s3, 23) + s0  (and same for the other lanes)
      vecRes[i + 0] = vaddq_u64(rotl23(vaddq_u64(s0_0, s0_3)), s0_0);
      vecRes[i + 1] = vaddq_u64(rotl23(vaddq_u64(s1_0, s1_3)), s1_0);
      vecRes[i + 2] = vaddq_u64(rotl23(vaddq_u64(s2_0, s2_3)), s2_0);
      vecRes[i + 3] = vaddq_u64(rotl23(vaddq_u64(s3_0, s3_3)), s3_0);

      // t = s1 << 17
      uint64x2_t t0 = vshlq_n_u64(s0_1, 17);
      uint64x2_t t1 = vshlq_n_u64(s1_1, 17);
      uint64x2_t t2 = vshlq_n_u64(s2_1, 17);
      uint64x2_t t3 = vshlq_n_u64(s3_1, 17);

      s0_2 = veorq_u64(s0_2, s0_0);
      s0_3 = veorq_u64(s0_3, s0_1);
      s0_1 = veorq_u64(s0_1, s0_2);
      s0_0 = veorq_u64(s0_0, s0_3);
      s0_2 = veorq_u64(s0_2, t0);
      s0_3 = rotl45(s0_3);

      s1_2 = veorq_u64(s1_2, s1_0);
      s1_3 = veorq_u64(s1_3, s1_1);
      s1_1 = veorq_u64(s1_1, s1_2);
      s1_0 = veorq_u64(s1_0, s1_3);
      s1_2 = veorq_u64(s1_2, t1);
      s1_3 = rotl45(s1_3);

      s2_2 = veorq_u64(s2_2, s2_0);
      s2_3 = veorq_u64(s2_3, s2_1);
      s2_1 = veorq_u64(s2_1, s2_2);
      s2_0 = veorq_u64(s2_0, s2_3);
      s2_2 = veorq_u64(s2_2, t2);
      s2_3 = rotl45(s2_3);

      s3_2 = veorq_u64(s3_2, s3_0);
      s3_3 = veorq_u64(s3_3, s3_1);
      s3_1 = veorq_u64(s3_1, s3_2);
      s3_0 = veorq_u64(s3_0, s3_3);
      s3_2 = veorq_u64(s3_2, t3);
      s3_3 = rotl45(s3_3);

      s0[0] = s0_0; s0[1] = s0_1; s0[2] = s0_2; s0[3] = s0_3;
      s1[0] = s1_0; s1[1] = s1_1; s1[2] = s1_2; s1[3] = s1_3;
      s2[0] = s2_0; s2[1] = s2_1; s2[2] = s2_2; s2[3] = s2_3;
      s3[0] = s3_0; s3[1] = s3_1; s3[2] = s3_2; s3[3] = s3_3;
    }
    cur = 0;
#elif XOSHIRO_PIPELINES == 2
    // Process 2 streams per iteration
    for (uint64_t i = 0; i < VecResCount; i += 2) {
      auto& s0 = state[i + 0];
      auto& s1 = state[i + 1];

      uint64x2_t s0_0 = s0[0], s0_1 = s0[1], s0_2 = s0[2], s0_3 = s0[3];
      uint64x2_t s1_0 = s1[0], s1_1 = s1[1], s1_2 = s1[2], s1_3 = s1[3];

      vecRes[i + 0] = vaddq_u64(rotl23(vaddq_u64(s0_0, s0_3)), s0_0);
      vecRes[i + 1] = vaddq_u64(rotl23(vaddq_u64(s1_0, s1_3)), s1_0);

      uint64x2_t t0 = vshlq_n_u64(s0_1, 17);
      s0_2 = veorq_u64(s0_2, s0_0);
      s0_3 = veorq_u64(s0_3, s0_1);
      s0_1 = veorq_u64(s0_1, s0_2);
      s0_0 = veorq_u64(s0_0, s0_3);
      s0_2 = veorq_u64(s0_2, t0);
      s0_3 = rotl45(s0_3);

      uint64x2_t t1 = vshlq_n_u64(s1_1, 17);
      s1_2 = veorq_u64(s1_2, s1_0);
      s1_3 = veorq_u64(s1_3, s1_1);
      s1_1 = veorq_u64(s1_1, s1_2);
      s1_0 = veorq_u64(s1_0, s1_3);
      s1_2 = veorq_u64(s1_2, t1);
      s1_3 = rotl45(s1_3);

      s0[0] = s0_0; s0[1] = s0_1; s0[2] = s0_2; s0[3] = s0_3;
      s1[0] = s1_0; s1[1] = s1_1; s1[2] = s1_2; s1[3] = s1_3;
    }
    cur = 0;
#else
# error "XOSHIRO_PIPELINES must be 2 or 4"
#endif // XOSHIRO_PIPELINES
#else
    // Portable / x86 path (unchanged)
    for (uint64_t i = 0; i < VecResCount; i++) {
      auto& curState = state[i];
      vecRes[i] = rotl(curState[0] + curState[3], 23) + curState[0];
      const auto t = curState[1] << 17;
      curState[2] ^= curState[0];
      curState[3] ^= curState[1];
      curState[1] ^= curState[2];
      curState[0] ^= curState[3];
      curState[2] ^= t;
      curState[3] = rotl(curState[3], 45);
    }
    cur = 0;
#endif
  }

  FOLLY_ALWAYS_INLINE result_type next() noexcept {
    if (FOLLY_UNLIKELY(cur == ResultCount)) {
      calc();
    }
    return res[cur++];
  }
};

template <typename Size, typename CharT, typename Traits>
std::basic_ostream<CharT, Traits>& operator<<(
    std::basic_ostream<CharT, Traits>& os, const xoshiro256pp<Size>& rng) {
  for (auto i2 : rng.res) {
    os << i2 << " ";
  }
  os << "cur: " << rng.cur;
  return os;
}

using xoshiro256pp_32 = xoshiro256pp<uint32_t>;
using xoshiro256pp_64 = xoshiro256pp<uint64_t>;

} // namespace folly```

#elif defined(__GNUC__)
#pragma GCC unroll 4
#endif
for (int i = 0; i < VecResCount; ++i) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have -Wsign-compare enabled as an error internally, so this currently causes an error. Swapping i to unsigned fixes it.

Suggested change
for (int i = 0; i < VecResCount; ++i) {
for (uint64_t i = 0; i < VecResCount; ++i) {

result_type res[ResultCount];
};
vector_type state[VecResCount][StateSize]{};
static constexpr uint64_t VecResCount = 16;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll likely need to take another look at the buffer size after the final state of this diff lands, we should probably keep the old buffer size for x86_64 at least for now, since it's mostly going to be dependent on the width of the underlying vectors and how many of ALU vector operations the CPU can execute per cycle.

curState[3] = rotl(curState[3], 45);
// By default, the compiler will prefer to unroll the loop completely, deactivating vectorization.
#if defined(__clang__)
#pragma clang loop unroll(disable) vectorize_width(8)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I generally prefer auto-vectorization over explicitly using the intrinsics, since it means less archticture-specific code, but if the NEON version ends up faster, then that's the version we should be using. That said, as long as clang lets you do most of the basic operations on uint64x2_t without explicitly calling the intrinsics (like the prior code was using for __v4du, then it might be possible to get away with not explicitly calling the intrinsics, which would be far cleaner overall since you end up with far less platform-specific code.

@rj-jesus
Copy link
Contributor

I added your code changes to mine and got a small speed up. I'm not seeing any negative impact on x86 latest gen.

testing

Neoverse V2

code from #2512: xoshiro256 2.97ns 337.18M xoshiro256_64 2.46ns 406.95M

combined code using 4 pipeline switch: xoshiro256 1.81ns 553.01M xoshiro256_64 1.56ns 641.65M

4 pipelines V2 performance uplift: xoshiro256 64% xoshiro256_64 57%

Hi @kielfriedt, what compiler version and flags are you using to build the sources? As I mentioned above I see significantly better performance with upstream sources, but performance seems to be highly dependent on inlining decisions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants