Merging of RISC-V V1.0 support. #377

rdolbeau · 2025-02-05T13:05:55Z

This is a cleaned-up version of branch riscv-v from rdolbeau/fftw3.

No scalability, this is using multiple sets of masked operations to support all width multiple of 128 bits.
This goes up to 16384 bits to support the EPAC VPU design.

This is now using the official V1.0 intrinsics, and was tested with gcc-14 and clang-18. It can bes tested on qemu, or on hardware with V1.0 support. It was thoroughly tested on a BananaPi F3 with a Spacemit K1 SoC, which supports V1.0 with 256-bits registers.

The code was originally based on the SVE code and used EPI intrinsics. It was further refined on my bannapi f3, and some improvements were inspired by the code from #279, and is faster overall on the F3 (see a sample graph in #371)

It's currently still using the generated vtw.h, it still need to use the same macro-based approach as upstream's merged SVE (see commit 126e3b9)

This is a cleaned-up version of branch riscv-v from rdolbeau/fftw3. No scalability, this is using multiple sets of masked operations to support all width multiple of 128 bits. This goes up to 16384 bits to support the EPAC VPU design. This is now using the official V1.0 intirnsics, and was tested with gcc-14 and clang-18. It can bes tested on qemu, or on hardware with V1.0 support. It was thoroughly tested on a BananaPi F3 with a Spacemit K1 SoC, which supports V1.0 with 256-bits registers.

rdolbeau · 2025-02-05T13:11:45Z

@matteo-frigo :

This contains both the "standard" implementation (with real & imaginary part stored interleaved in a single vector register) and the "split" implementation (with the real and imaginary part stored in two different registers). However, only the "standard" is sued for now - testing the split requires the manual step of copying over the file. In theory, both could be enabled at the same time, but that would be a lot of codelets. The "split" is very register-intensive doesn't play well with larger codelets, hence the choice of "standard" by default. Feedback welcome.

rdolbeau · 2025-02-05T13:15:40Z

Test release (with all codelets pre-generated) available in https://github.com/rdolbeau/fftw3/releases/tag/r5v-test-release-005

sh-zheng · 2025-02-05T13:41:27Z

Great work!

This branch is now using "r5v" as its arch tag. May the "rvv" be relatively more often used? Or according to the "V" specificion, the Zvl128b/256b... may be more official?

I find that the arch tag in OpenBLAS uses Zvl128b/256b

sh-zheng · 2025-02-05T14:11:07Z

simd-support/simd-r5v.h

+	return TYPE(vrgather_vv)(x, hidx, 2*VL);
+}
+
+#define VNEG(a)   TYPE(vfsgnjn_vv)(a,a,2*VL)


Could be simplified as vfneg_v

sh-zheng · 2025-02-05T14:40:00Z

simd-support/simd-r5v.h

+	Vint idx = TYPEINT(vid_v)(2*VL); // (0, 1, 2, 3, ...)
+	//Vint vnotone = TYPEINT(vmv_v_x)(DS(~1ull,~1), 2*VL);
+	Vint hidx = TYPEINT(vand_vx)(idx, DS(~1ull,~1), 2*VL); // (0, 0, 2, 2, ...)
+	return TYPE(vrgather_vv)(x, hidx, 2*VL);


The index of destination of vrgather_vv depends on the value in vector reg, which makes it relatively unpredictable for compiler/microarchitecture. Maybe the vslide instruction have a more regular behaviour and better performance?

sh-zheng · 2025-02-06T15:53:30Z

simd-support/simd-r5v.h

+	//Vint vone = TYPEINT(vmv_v_x)(1, 2*VL);
+	Vint hidx = TYPEINT(vand_vx)(idx, 1, 2*VL); // (0, 1, 0, 1, 0, 1)
+	//return TYPEMASK(vfsgnjn_vv)(x, x, x, INT2MASK(hidx), 2*VL);
+	return TYPEMERGEDMASK(vfsgnjn_vv)(INT2MASK(hidx), x, x, x, 2*VL);


The mask register is scarce resource in rvv, and mask undisturbed policy may hinder out-of-order execution. So the unmask version may have benefits.

matteo-frigo · 2025-02-07T23:38:42Z

Can we take a more "practical" approach to the vector length, and only support lengths for which silicon actually exists, and maybe twice that for good measure? Or is there any actual 16384-bit silicon?

If you could replace the vtw.h thing with the new macros it would be great.

rdolbeau · 2025-02-08T07:20:35Z

Can we take a more "practical" approach to the vector length, and only support lengths for which silicon actually exists,
and maybe twice that for good measure? Or is there any actual 16384-bit silicon?

There is, actually, though it's research silicon for now. The European Processor Initiative (EPI) test chips EPAC v1.0 and v1.5 contain a 16384-bits (VL=256 @ 64 bits) vector unit (VPU). It's only v0.7.1 of the specifications in those, but the next tape-out in a different European project will support v1.0 instead. In fact, it's the hardware for which the code was originally written (way back with EPI-specifics intrinsics). There's also a FPGA-based Software Development Platform for EPI partners.

That being said, I agree that he wider vectors are not likely to be useful to the general public and are bloating the libraries. A solution could be to add a "--max-scalable-width" option for SVE and RISC-V, probably defaulting to 512 (A64FX width), that would only build the codelets up to that size. That would avoid the bloat of unused codelets. The extra files/directories needed to support the additional width in the source code are quite small and very similar so probably not an issue. Then users with specifics requirements can rebuild with the appropriate value for more "exotic" hardware.

If you could replace the vtw.h thing with the new macros it would be great.

I'll try to do it ASAP, but I wanted to bring the code forward now that SVE is merged (so the principle seems sound) and hardware is actually available commercially, even if with limited width.

I do still wonder if the scalable vector, or in fact all SIMD with proper mask support, couldn't be used more efficiently. Given the current usage model of SIMD in FFTW3, in theory it should be possible to implement the codelets with the internal loop in scalable/masked mode, i.e. replace (here from n1fv_16.c):

/* VL is a compile-time constant */
for (i = v; i > 0; i = i - VL, xi = xi + (VL * ivs), xo = xo + (VL * o
vs), MAKE_VOLATILE_STRIDE(32, is), MAKE_VOLATILE_STRIDE(32, os)) {
/* code goes here */

by something like

/* dynamic_VL set to the hardware length */
for (i = v; i > 0; i = i - dynamic_vl, xi = xi + (dynamic_VL * ivs), xo = xo + (dynamic_VL * o
vs), MAKE_VOLATILE_STRIDE(32, is), MAKE_VOLATILE_STRIDE(32, os)) {
/* here, build a mask so if (v % dynamic_VL) is non-zero the last iteration is appropriately masked */
/* masked code goes here */

Alternatively, it could be two loops, an unmasked main loop while i is larger or equal than dynamic_VL, and a single-iteration, masked reminder loop.

However, VL is baked into other places used by the planner infrastructure, and I'm not sure how easy/achievable that would be. But it would be cleaner and would enable more use cases for SIMD, as this can be used to implement any number of iterations rather than just multiple of VL. AVX-512 (and newer) would also potentially benefits has it can be masked just as well.

matteo-frigo · 2025-02-08T12:52:55Z

Re: 16384 bits. I would do something like --enable-sve enables <= 512 or whatever is a reasonable upper bound today, and --enable-large-sve enables <= 16384. Basically pretend it's another SIMD architecture. I don't feel strongly about this, though. Let's try to capture the practical common case + the exceptional case.

Re: dynamic VL. As you may have surmised, our SIMD theory is a bit long in the tooth. We hacked something when there was only SSE and 3DNow!, then SSE2 came and the rest is history. I would probably leave the existing SIMD stuff alone and create a new set of codelets. Thus we could have scalar/, simd/, and dynsimd/ codelets. The scalar/ codelets are already "vectorized" in the sense that they compute a vector of transforms, not just one. If I am reading what you say correctly, the dynsimd/ codelets would be the same as scalar, except perhaps with some restrictions on the strides. Every "genus" of codelets defines a function okp() that tells the planner whether or not the codelet is applicable, so this restriction would be easy to capture.

rdolbeau changed the title ~~Mergin of RISC-V V1.0 support.~~ Merging of RISC-V V1.0 support. Feb 5, 2025

rdolbeau mentioned this pull request Feb 5, 2025

Add RISC-V vector spec v1.0 support #279

Open

rdolbeau added the enhancement label Feb 5, 2025

sh-zheng reviewed Feb 5, 2025

View reviewed changes

simd-support/simd-r5v.h

return TYPE(vrgather_vv)(x, hidx, 2*VL);

}

#define VNEG(a) TYPE(vfsgnjn_vv)(a,a,2*VL)

Copy link

sh-zheng Feb 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be simplified as vfneg_v

sh-zheng reviewed Feb 5, 2025

View reviewed changes

sh-zheng reviewed Feb 6, 2025

View reviewed changes

ViralBShah mentioned this pull request Mar 7, 2025

FFTW: Build for riscv64 JuliaPackaging/Yggdrasil#10670

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Merging of RISC-V V1.0 support. #377

Merging of RISC-V V1.0 support. #377

Uh oh!

rdolbeau commented Feb 5, 2025

Uh oh!

rdolbeau commented Feb 5, 2025

Uh oh!

rdolbeau commented Feb 5, 2025

Uh oh!

sh-zheng commented Feb 5, 2025

Uh oh!

sh-zheng Feb 5, 2025

Uh oh!

sh-zheng Feb 5, 2025 •

edited

Loading

Uh oh!

sh-zheng Feb 6, 2025

Uh oh!

matteo-frigo commented Feb 7, 2025

Uh oh!

rdolbeau commented Feb 8, 2025

Uh oh!

matteo-frigo commented Feb 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Merging of RISC-V V1.0 support. #377

Are you sure you want to change the base?

Merging of RISC-V V1.0 support. #377

Uh oh!

Conversation

rdolbeau commented Feb 5, 2025

Uh oh!

rdolbeau commented Feb 5, 2025

Uh oh!

rdolbeau commented Feb 5, 2025

Uh oh!

sh-zheng commented Feb 5, 2025

Uh oh!

sh-zheng Feb 5, 2025

Choose a reason for hiding this comment

Uh oh!

sh-zheng Feb 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sh-zheng Feb 6, 2025

Choose a reason for hiding this comment

Uh oh!

matteo-frigo commented Feb 7, 2025

Uh oh!

rdolbeau commented Feb 8, 2025

Uh oh!

matteo-frigo commented Feb 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sh-zheng Feb 5, 2025 •

edited

Loading