-
Notifications
You must be signed in to change notification settings - Fork 697
Merging of RISC-V V1.0 support. #377
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
This is a cleaned-up version of branch riscv-v from rdolbeau/fftw3. No scalability, this is using multiple sets of masked operations to support all width multiple of 128 bits. This goes up to 16384 bits to support the EPAC VPU design. This is now using the official V1.0 intirnsics, and was tested with gcc-14 and clang-18. It can bes tested on qemu, or on hardware with V1.0 support. It was thoroughly tested on a BananaPi F3 with a Spacemit K1 SoC, which supports V1.0 with 256-bits registers.
|
This contains both the "standard" implementation (with real & imaginary part stored interleaved in a single vector register) and the "split" implementation (with the real and imaginary part stored in two different registers). However, only the "standard" is sued for now - testing the split requires the manual step of copying over the file. In theory, both could be enabled at the same time, but that would be a lot of codelets. The "split" is very register-intensive doesn't play well with larger codelets, hence the choice of "standard" by default. Feedback welcome. |
|
Test release (with all codelets pre-generated) available in https://github.com/rdolbeau/fftw3/releases/tag/r5v-test-release-005 |
|
Great work! This branch is now using "r5v" as its arch tag. May the "rvv" be relatively more often used? Or according to the "V" specificion, the Zvl128b/256b... may be more official? I find that the arch tag in OpenBLAS uses Zvl128b/256b |
| return TYPE(vrgather_vv)(x, hidx, 2*VL); | ||
| } | ||
|
|
||
| #define VNEG(a) TYPE(vfsgnjn_vv)(a,a,2*VL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be simplified as vfneg_v
| Vint idx = TYPEINT(vid_v)(2*VL); // (0, 1, 2, 3, ...) | ||
| //Vint vnotone = TYPEINT(vmv_v_x)(DS(~1ull,~1), 2*VL); | ||
| Vint hidx = TYPEINT(vand_vx)(idx, DS(~1ull,~1), 2*VL); // (0, 0, 2, 2, ...) | ||
| return TYPE(vrgather_vv)(x, hidx, 2*VL); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The index of destination of vrgather_vv depends on the value in vector reg, which makes it relatively unpredictable for compiler/microarchitecture. Maybe the vslide instruction have a more regular behaviour and better performance?
| //Vint vone = TYPEINT(vmv_v_x)(1, 2*VL); | ||
| Vint hidx = TYPEINT(vand_vx)(idx, 1, 2*VL); // (0, 1, 0, 1, 0, 1) | ||
| //return TYPEMASK(vfsgnjn_vv)(x, x, x, INT2MASK(hidx), 2*VL); | ||
| return TYPEMERGEDMASK(vfsgnjn_vv)(INT2MASK(hidx), x, x, x, 2*VL); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The mask register is scarce resource in rvv, and mask undisturbed policy may hinder out-of-order execution. So the unmask version may have benefits.
|
Can we take a more "practical" approach to the vector length, and only support lengths for which silicon actually exists, and maybe twice that for good measure? Or is there any actual 16384-bit silicon? If you could replace the vtw.h thing with the new macros it would be great. |
There is, actually, though it's research silicon for now. The European Processor Initiative (EPI) test chips EPAC v1.0 and v1.5 contain a 16384-bits (VL=256 @ 64 bits) vector unit (VPU). It's only v0.7.1 of the specifications in those, but the next tape-out in a different European project will support v1.0 instead. In fact, it's the hardware for which the code was originally written (way back with EPI-specifics intrinsics). There's also a FPGA-based Software Development Platform for EPI partners. That being said, I agree that he wider vectors are not likely to be useful to the general public and are bloating the libraries. A solution could be to add a "--max-scalable-width" option for SVE and RISC-V, probably defaulting to 512 (A64FX width), that would only build the codelets up to that size. That would avoid the bloat of unused codelets. The extra files/directories needed to support the additional width in the source code are quite small and very similar so probably not an issue. Then users with specifics requirements can rebuild with the appropriate value for more "exotic" hardware.
I'll try to do it ASAP, but I wanted to bring the code forward now that SVE is merged (so the principle seems sound) and hardware is actually available commercially, even if with limited width. I do still wonder if the scalable vector, or in fact all SIMD with proper mask support, couldn't be used more efficiently. Given the current usage model of SIMD in FFTW3, in theory it should be possible to implement the codelets with the internal loop in scalable/masked mode, i.e. replace (here from n1fv_16.c): by something like Alternatively, it could be two loops, an unmasked main loop while However, VL is baked into other places used by the planner infrastructure, and I'm not sure how easy/achievable that would be. But it would be cleaner and would enable more use cases for SIMD, as this can be used to implement any number of iterations rather than just multiple of VL. AVX-512 (and newer) would also potentially benefits has it can be masked just as well. |
|
Re: 16384 bits. I would do something like --enable-sve enables <= 512 or whatever is a reasonable upper bound today, and --enable-large-sve enables <= 16384. Basically pretend it's another SIMD architecture. I don't feel strongly about this, though. Let's try to capture the practical common case + the exceptional case. Re: dynamic VL. As you may have surmised, our SIMD theory is a bit long in the tooth. We hacked something when there was only SSE and 3DNow!, then SSE2 came and the rest is history. I would probably leave the existing SIMD stuff alone and create a new set of codelets. Thus we could have scalar/, simd/, and dynsimd/ codelets. The scalar/ codelets are already "vectorized" in the sense that they compute a vector of transforms, not just one. If I am reading what you say correctly, the dynsimd/ codelets would be the same as scalar, except perhaps with some restrictions on the strides. Every "genus" of codelets defines a function okp() that tells the planner whether or not the codelet is applicable, so this restriction would be easy to capture. |
This is a cleaned-up version of branch riscv-v from rdolbeau/fftw3.
No scalability, this is using multiple sets of masked operations to support all width multiple of 128 bits.
This goes up to 16384 bits to support the EPAC VPU design.
This is now using the official V1.0 intrinsics, and was tested with gcc-14 and clang-18. It can bes tested on qemu, or on hardware with V1.0 support. It was thoroughly tested on a BananaPi F3 with a Spacemit K1 SoC, which supports V1.0 with 256-bits registers.
The code was originally based on the SVE code and used EPI intrinsics. It was further refined on my bannapi f3, and some improvements were inspired by the code from #279, and is faster overall on the F3 (see a sample graph in #371)
It's currently still using the generated vtw.h, it still need to use the same macro-based approach as upstream's merged SVE (see commit 126e3b9)