Add skew-symmetric BLAS operations by devinamatthews · Pull Request #805 · flame/blis

devinamatthews · 2024-04-26T22:59:09Z

This PR adds a number of level-2 and level-3 skew-symmetric (and skew-hermitian) BLAS operations, defining the essential operations of a "Skew-BLAS" interface. These operations have been added as full 1st-class citizens of the BLIS API complete with testsuite and mixed-precision/mixed-domain support (level-3 only).

This operation negates a scalar (both real and imaginary parts).

- Add `bli_?negs/bli_?negris` to negate a scalar. - Add `bli_?setr0s` to zero out only the real part of a complex scalar. - Add (void) to silence unused variable warnings in several level-0 macros.

Add `mkskewsymm` and `mkskewherm` operations to explicit skew-symmetrize or skew-hermitize a matrix. For a skew-symmetric matrix, the diagonal is explicitly set to zero, while for a skew-hermitian matrix the real part of the diagonal is set to zero.

Add `BLIS_SKEW_SYMMETRIC` and `BLIS_SKEW_HERMITIAN` matrix structures along with associated help functions and macros. Note that this requires increasing the number of bits used to represent a `struc_t` in the `obj_t::info` member. A compile-time check has also been added to prevent against accidental bit overflow in the future.

This operation sets only the real part of a matrix diagonal to the given value.

The function signature for dotaxpyf has been changed to allow different `alpha` values for the dot and axpy sub-problems. This is needed to support skew-symmetric operations which differ in more than just conjugation of A and A^T.

Add `skmv` (skew-symmetric matrix times vector), `shmv` (skew-hermitian matrix times vector), `skr2` (skew-symmetric rank-2 update), and `shr2` (skew-hermitian rank-2 update) operations. Note that a rank-1 skew-symmetric update is not possible, and a rank-1 skew-hermitian update is not particularly useful.

The reference packing kernels have been updated to support skew-symmetric and skew-hermitian matrix structures. No updates to the dense reference packing kernel (`bli_?packm_ckx_<arch>_ref`) or to any optimized packing kernels, since `bli_?packm_struc_cxk` handles the negation of the unstored region by modifying `kappa`.

Add `skmm` (skew-symmetric matrix times dense matrix), `shmm` (skew-hermitian matrix times dense matrix), `skr2k` (skew-symmetric rank-2k update), and `shr2k` (skew-hermitian rank-2k update) operations. Note that a rank-k skew-symmetric update is not possible, and a rank-k skew-hermitian update is not particularly useful.

[ci skip]

devinamatthews · 2024-04-26T23:02:27Z

@myeh01 @nick-knight @Aaron-Hutchinson can the SiFive team please review commit b986782? I had to delve into the RISC-V assembly there and I'm only ~80% sure I did it right.

devinamatthews · 2024-04-26T23:03:06Z

@fgvanzee again the Travis CI build failed to trigger...

fgvanzee · 2024-04-26T23:05:03Z

@fgvanzee again the Travis CI build failed to trigger...

I don't remember if there was anything we could do to fix it on our end. Might be that we just have to wait and then make a dummy commit to try to trigger again?

devinamatthews · 2024-04-26T23:06:57Z

I don't remember either but it is annoying

devinamatthews · 2024-04-26T23:10:56Z

There we go

nick-knight · 2024-04-26T23:11:34Z

@devinamatthews Confirming I got your message. It looks like the register allocation in dotxaxpyf changed: thanks for taking a stab at it. I think @myeh01 wrote this one, I'll ask him to review it. (Michael, it looks like what's changed is the microkernel now has separate alpha values for the "dot" and "axpy" parts.)

devinamatthews · 2024-04-27T04:47:41Z

Travis CI failed for x280 so I guess I did do something wrong.

myeh01 · 2024-04-27T08:02:52Z

Running the testsuite, it looks like s and d are passing while c and z are failing. For c and z, when I change the register allocation from fa4, fa5 to fa0, fa1 for alphaw, it passes the dotxaxpyf tests locally, so I think Devin modified the code correctly. My guess is the inline assembly I wrote is not generating the code I want it to (maybe the compiler is overwriting some floating-point registers in between asm blocks). I'll need some time to study the objdump to see what's going on.

myeh01 · 2024-04-29T10:57:40Z

After looking at the objdump, it looks like the compiler is using fa4 and fa5 in some branches involving floating point comparisons. To prevent this issue from reappearing, I would like to go back through all the inline assembly files I wrote and replace any manually allocated floating point (and integer) registers with C variables. I should be able to get it done this week. What would be the best way to get these changes merged into the repo? Should I submit a PR to this branch or master?

nick-knight · 2024-04-29T17:56:23Z

@myeh01 A quicker fix might be to add clobbers. We should be using these anyway whenever we use inline asm with explicit register allocation, whether X-, F-, or V-registers.

Going the other direction, I think we might be better off using generic C for all the scalar stuff, and only using inline asm for the vector stuff (when intrinsics don't suffice). I don't think we'll lose much in performance, and it would make the code much more maintainable and retargetable.

myeh01 · 2024-04-29T20:11:40Z

@nick-knight I originally tried just adding the output register to the clobber list of any floating-point load, e.g.

__asm__(FLT_LOAD "fa5, %1(%0)" : : "r"(alphaw), "I"(FLT_SIZE) : "fa5");

But the compiler still uses fa5 for floating-point comparisons. Does the clobber only apply to that block of inline asm? Or maybe I did this incorrectly and should add clobbers in other places as well. There's also this, but then we would have to use C variables anyways.

Edit: Yeah, I think replacing the scalar stuff with generic C would be more robust. I started doing this for some parts of cdotxaxpyf and it wasn't too hard.

Edit 2: After looking through some of the code again, I think I would also like to replace some inline asm code with intrinsics.

nick-knight · 2024-04-29T20:38:41Z

Does the clobber only apply to that block of inline asm?

Correct. If you don't want the compiler to overwrite fa5 between this asm statement and a subsequent one, you'll have to merge them into the same statement. This tends to snowball, necessitating rewriting conditionals and loops in assembly, etc.

There's also this, but then we would have to use C variables anyways.

Correct.

I think replacing the scalar stuff with generic C would be more robust.

I agree.

Our code still has lurking risks related to our explicit allocation of V-registers: we are trusting the compiler not to generate any vector code between each pair of asm statements with a RAW dependence through a V-register. Hardening this will probably snowball in the manner described above, so I don't propose we attempt to tackle it here.

myeh01 · 2024-04-29T21:54:50Z

@devinamatthews How would you like to proceed? There are a few short-term solutions we discussed above. Longer-term, I'd like to rewrite the inline assembly files to be more robust (probably using intrinsics where it won't significantly impact performance). Please let me know what I can do to help.

nick-knight · 2024-04-29T22:31:52Z

One perspective is that this ukernel interface change exposed a bug in SiFive's implementation of the legacy ukernel interface. To proceed, the defective ukernel implementation could be deleted in this PR --- reverting to a generic implementation --- and then reintroduced, upgraded and corrected, in a subsequent PR. This might be the cleanest way forward.

devinamatthews · 2024-04-29T22:45:43Z

I'm not in a huge rush to merge. If it takes say a month or less to fix it properly then I can wait. Otherwise yes we could revert to generic and fix later. This wouldn't require deleting anything, just commenting out the kernel registration.

myeh01 · 2024-04-30T00:08:07Z

Mind if we sync up in a week or two? I'll start working on it this week and hopefully by then I'll have a sense of how much more time it would take.

devinamatthews · 2024-04-30T04:00:53Z

Sure.

myeh01 · 2024-05-10T23:11:39Z

@devinamatthews I'm steadily working through cleaning up all the kernels, but I don't think I'll be able to finish it in the next two weeks. I'm also trying to balance this work with other projects I need to work on, so it may take a few more weeks. It may be best to follow Nick's suggestion and temporarily disable the ~~sifive_x280 config~~ failing microkernel (sorry, misread Nick's comment form earlier) until the clean up is complete.

nick-knight · 2024-05-10T23:14:37Z

It may be best to follow Nick's suggestion and temporarily disable the sifive_x280 config until the clean up is complete.

I don't propose disabling the whole configuration, just removing the one ukernel that's causing issues. IIUC, this will cause BLIS to default to a generic/reference implementation.

devinamatthews · 2024-05-14T19:47:58Z

@myeh01 We've decided not to include this PR in the next release so there's not much time pressure.

myeh01 · 2024-10-02T00:01:35Z

@devinamatthews I opened a PR (#822) converting all our inline assembly to intrinsics, not sure if you've seen it yet.

@myeh01

Details: - Replace all assembly kernels in the `sifive_x280` kernel set with intrinsic versions. - Fixes bug encountered in #805. - Update the RISC-V toolchain used in CI testing. - Special thanks to Michael Yeh (@myeh01) and SiFive.

devinamatthews added 10 commits April 25, 2024 21:45

Add a negsc level-0 scalar operation.

6740b1f

This operation negates a scalar (both real and imaginary parts).

Add/modify level-0 scalar macros.

9c0cec8

- Add `bli_?negs/bli_?negris` to negate a scalar. - Add `bli_?setr0s` to zero out only the real part of a complex scalar. - Add (void) to silence unused variable warnings in several level-0 macros.

Add a setrd level-1d operation.

3a52b71

This operation sets only the real part of a matrix diagonal to the given value.

Change dotaxpyf microkernel signature.

b986782

The function signature for dotaxpyf has been changed to allow different `alpha` values for the dot and axpy sub-problems. This is needed to support skew-symmetric operations which differ in more than just conjugation of A and A^T.

Typo fix in configure

06829d0

[ci skip]

devinamatthews requested a review from fgvanzee April 26, 2024 23:02

Trigger CI build

129ce04

myeh01 mentioned this pull request Aug 1, 2024

Use intrinsics for all sifive_x280 kernels #822

Merged

devinamatthews added 2 commits November 28, 2024 18:22

Merge branch 'master' into skew-blas

8b2d49f

Attempt to modify new intrinsic SiFive x280 kernels.

8a8bf40

Conversation

devinamatthews commented Apr 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

devinamatthews commented Apr 26, 2024

Uh oh!

devinamatthews commented Apr 26, 2024

Uh oh!

fgvanzee commented Apr 26, 2024

Uh oh!

devinamatthews commented Apr 26, 2024

Uh oh!

devinamatthews commented Apr 26, 2024

Uh oh!

nick-knight commented Apr 26, 2024

Uh oh!

devinamatthews commented Apr 27, 2024

Uh oh!

myeh01 commented Apr 27, 2024

Uh oh!

myeh01 commented Apr 29, 2024

Uh oh!

nick-knight commented Apr 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

myeh01 commented Apr 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nick-knight commented Apr 29, 2024

Uh oh!

myeh01 commented Apr 29, 2024

Uh oh!

nick-knight commented Apr 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

devinamatthews commented Apr 29, 2024

Uh oh!

myeh01 commented Apr 30, 2024

Uh oh!

devinamatthews commented Apr 30, 2024

Uh oh!

myeh01 commented May 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nick-knight commented May 10, 2024

Uh oh!

devinamatthews commented May 14, 2024

Uh oh!

myeh01 commented Oct 2, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Comments

devinamatthews commented Apr 26, 2024 •

edited

Loading

nick-knight commented Apr 29, 2024 •

edited

Loading

myeh01 commented Apr 29, 2024 •

edited

Loading

nick-knight commented Apr 29, 2024 •

edited

Loading

myeh01 commented May 10, 2024 •

edited

Loading