Conversation
This operation negates a scalar (both real and imaginary parts).
- Add `bli_?negs/bli_?negris` to negate a scalar. - Add `bli_?setr0s` to zero out only the real part of a complex scalar. - Add (void) to silence unused variable warnings in several level-0 macros.
Add `mkskewsymm` and `mkskewherm` operations to explicit skew-symmetrize or skew-hermitize a matrix. For a skew-symmetric matrix, the diagonal is explicitly set to zero, while for a skew-hermitian matrix the real part of the diagonal is set to zero.
Add `BLIS_SKEW_SYMMETRIC` and `BLIS_SKEW_HERMITIAN` matrix structures along with associated help functions and macros. Note that this requires increasing the number of bits used to represent a `struc_t` in the `obj_t::info` member. A compile-time check has also been added to prevent against accidental bit overflow in the future.
This operation sets only the real part of a matrix diagonal to the given value.
The function signature for dotaxpyf has been changed to allow different `alpha` values for the dot and axpy sub-problems. This is needed to support skew-symmetric operations which differ in more than just conjugation of A and A^T.
Add `skmv` (skew-symmetric matrix times vector), `shmv` (skew-hermitian matrix times vector), `skr2` (skew-symmetric rank-2 update), and `shr2` (skew-hermitian rank-2 update) operations. Note that a rank-1 skew-symmetric update is not possible, and a rank-1 skew-hermitian update is not particularly useful.
The reference packing kernels have been updated to support skew-symmetric and skew-hermitian matrix structures. No updates to the dense reference packing kernel (`bli_?packm_ckx_<arch>_ref`) or to any optimized packing kernels, since `bli_?packm_struc_cxk` handles the negation of the unstored region by modifying `kappa`.
Add `skmm` (skew-symmetric matrix times dense matrix), `shmm` (skew-hermitian matrix times dense matrix), `skr2k` (skew-symmetric rank-2k update), and `shr2k` (skew-hermitian rank-2k update) operations. Note that a rank-k skew-symmetric update is not possible, and a rank-k skew-hermitian update is not particularly useful.
[ci skip]
|
@myeh01 @nick-knight @Aaron-Hutchinson can the SiFive team please review commit b986782? I had to delve into the RISC-V assembly there and I'm only ~80% sure I did it right. |
|
@fgvanzee again the Travis CI build failed to trigger... |
I don't remember if there was anything we could do to fix it on our end. Might be that we just have to wait and then make a dummy commit to try to trigger again? |
|
I don't remember either but it is annoying |
|
There we go |
|
@devinamatthews Confirming I got your message. It looks like the register allocation in |
|
Travis CI failed for x280 so I guess I did do something wrong. |
|
Running the testsuite, it looks like |
|
After looking at the objdump, it looks like the compiler is using |
|
@myeh01 A quicker fix might be to add clobbers. We should be using these anyway whenever we use inline asm with explicit register allocation, whether X-, F-, or V-registers. Going the other direction, I think we might be better off using generic C for all the scalar stuff, and only using inline asm for the vector stuff (when intrinsics don't suffice). I don't think we'll lose much in performance, and it would make the code much more maintainable and retargetable. |
|
@nick-knight I originally tried just adding the output register to the clobber list of any floating-point load, e.g. But the compiler still uses Edit: Yeah, I think replacing the scalar stuff with generic C would be more robust. I started doing this for some parts of Edit 2: After looking through some of the code again, I think I would also like to replace some inline asm code with intrinsics. |
Correct. If you don't want the compiler to overwrite
Correct.
I agree. Our code still has lurking risks related to our explicit allocation of V-registers: we are trusting the compiler not to generate any vector code between each pair of |
|
@devinamatthews How would you like to proceed? There are a few short-term solutions we discussed above. Longer-term, I'd like to rewrite the inline assembly files to be more robust (probably using intrinsics where it won't significantly impact performance). Please let me know what I can do to help. |
|
One perspective is that this ukernel interface change exposed a bug in SiFive's implementation of the legacy ukernel interface. To proceed, the defective ukernel implementation could be deleted in this PR --- reverting to a generic implementation --- and then reintroduced, upgraded and corrected, in a subsequent PR. This might be the cleanest way forward. |
|
I'm not in a huge rush to merge. If it takes say a month or less to fix it properly then I can wait. Otherwise yes we could revert to generic and fix later. This wouldn't require deleting anything, just commenting out the kernel registration. |
|
Mind if we sync up in a week or two? I'll start working on it this week and hopefully by then I'll have a sense of how much more time it would take. |
|
Sure. |
|
@devinamatthews I'm steadily working through cleaning up all the kernels, but I don't think I'll be able to finish it in the next two weeks. I'm also trying to balance this work with other projects I need to work on, so it may take a few more weeks. It may be best to follow Nick's suggestion and temporarily disable the |
I don't propose disabling the whole configuration, just removing the one ukernel that's causing issues. IIUC, this will cause BLIS to default to a generic/reference implementation. |
|
@myeh01 We've decided not to include this PR in the next release so there's not much time pressure. |
|
@devinamatthews I opened a PR (#822) converting all our inline assembly to intrinsics, not sure if you've seen it yet. |
This PR adds a number of level-2 and level-3 skew-symmetric (and skew-hermitian) BLAS operations, defining the essential operations of a "Skew-BLAS" interface. These operations have been added as full 1st-class citizens of the BLIS API complete with testsuite and mixed-precision/mixed-domain support (level-3 only).