Wave Transform to generate SSA Exec mask manipulation instrs #789

lalaniket8 · 2025-12-08T11:26:04Z

Since we are moving Wave Transform to the middle of Register Allocation after PHI-elimination, the Exec Mask Manipulation instructions added to the code by Wave Transform should not be in SSA.
This PR contains code changes to support this.
We remove SSAUpdater originally used and used a single Accumulator Register to capture contributions from Thread-level CFG predecessors of a basic block. This Accumulator is used to set the appropriate EXEC mask. The Reset Flag Semantics of GCNLaneMaskUpdater is retained and used to reset the Accumulator at correct points in the code.

z1-cciauto · 2025-12-08T11:26:34Z

Failed to trigger build:

github-actions · 2025-12-08T11:26:36Z

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

cdevadas

Have you done the clang-format? Felt like at some places the format wasn't good.

llvm/lib/CodeGen/MachineRegisterInfo.cpp

llvm/include/llvm/CodeGen/MachineRegisterInfo.h

llvm/lib/Target/AMDGPU/GCNLaneMaskUtils.cpp

z1-cciauto · 2025-12-09T11:00:41Z

Failed to trigger build:

lalaniket8 · 2025-12-09T11:06:44Z

Have you done the clang-format? Felt like at some places the format wasn't good.

Addressed in latest commit

cmc-rep · 2025-12-09T17:34:01Z

I have started reviewing the code change . In the meantime,
We have provided multiple tests under llvm/test/CodeGen/AMDGPU, could we update those tests?

For those ll files, we want to add the run-line to stop after wave-transform, and check generated MIR.
For those MIR files, manually update them to Non-SSA form, and run wave-transform pass.

Actually, for those ll files, I would suggest that we first have a separate PR to add those run-line and check-result showing what the MIR look like right before wave-transform. Hopefully, those tests are easy to add, they can get merged before this PR.
This PR then will update those tests with the result after wave-transform. This way, we can compare the MIR before and after wave-transform during code review.

cmc-rep · 2025-12-09T22:02:18Z

llvm/lib/CodeGen/MachineRegisterInfo.cpp

-  Register Reg, MachineBasicBlock &MBB, MachineBasicBlock::iterator I) const {
-  if(I == MBB.begin()) return MBB.end();
+    Register Reg, MachineBasicBlock &MBB, MachineBasicBlock::iterator I) const {
+  if (I == MBB.begin())


This code can be simplified into a while (I != MBB.begin()) { .... } loop, right? also return a pointer of MachineInstr seems more straightforward?

Yep. Turn it into a for/while loop and add this condition check as the loop terminator. I second that idea of returning a MachineInstr*. All the machinery of checking I != MBB.end() at the call-sites (after this function returns) can be simplified with a !MI.

Address this comment.

We can write a while (I != MBB.begin()) { .... } loop, but it will be proceeded by I--;, since we want to search from one instruction before I. This will be equivalent to a do{I--; ...}while(I != MBB.end()) loop.

Working with the iterators here instead of MachineInstr* actually makes things simpler in terms of calling this function. Since the callsites are working with iterators and can pass end() or begin(). We implement the logic for handling end() within this function once, instead of multiple places (at each callsite)

Working with the iterators here instead of MachineInstr* actually makes things simpler in terms of calling this function. Since the callsites are working with iterators and can pass end() or begin(). We implement the logic for handling end() within this function once, instead of multiple places (at each callsite)

Rightly chosen function names would give an idea of what's the intention of the function and what it returns.
Just like getVRegDef, getDomVRegDefInBasicBlock gives an impression that it returns the defining instruction (MI* or nullptr if can't find).
There is only one place (inside GCNLaneMaskAnalysis::isSubsetOfExec) I see a recursive call instance. If the function was returning MI*, MI->getIterator() would easily give the iterator to pass in the recursive call instance.
The check if (I != MBB.end()) can easily become if (!MI) at all the callsites after the function returns.

cmc-rep · 2025-12-09T22:09:45Z

Also please make sure all the code comments are up to date with the code changes. For example, any comment mentioning PHI node is likely out of date.

Also I feel thtat we should clean up LaneMaskUtil code that does not really get used. For example, for our application, we always assume accumulating == true. If we are not going to maintain the code that assumes accumulating == false, we may want to delete them. I personally would prefer getting the code as simpler as possible

llvm/lib/Target/AMDGPU/GCNLaneMaskUtils.cpp

cdevadas · 2025-12-10T04:29:51Z

I have started reviewing the code change . In the meantime, We have provided multiple tests under llvm/test/CodeGen/AMDGPU, could we update those tests?

I thought about that initially. Once this patch gets merged, the next patch will be to enable wave-transform by default, and that would cover all lit tests in the new pipeline. At the moment, most lit tests would break if wave transform is force-enabled as the original implementation depends on the SSAUpdater and introduces PHI nodes. However, it makes sense to add some selected tests to verify the new wave-transform changes.

For those ll files, we want to add the run-line to stop after wave-transform, and check generated MIR.

Better to select some control-flow tests involving loops and if-else and stop-after wave-transform pass. @lalaniket8 can you identify some tests and pre-commit the new changes?

For those MIR files, manually update them to Non-SSA form, and run wave-transform pass.

Actually, for those ll files, I would suggest that we first have a separate PR to add those run-line and check-result showing what the MIR look like right before wave-transform. Hopefully, those tests are easy to add, they can get merged before this PR. This PR then will update those tests with the result after wave-transform. This way, we can compare the MIR before and after wave-transform during code review.

lalaniket8 · 2025-12-10T04:36:25Z

Also please make sure all the code comments are up to date with the code changes. For example, any comment mentioning PHI node is likely out of date.

Should we also remove the SSAReconstructor class in AMDGPUWaveTransform.cpp since that is not needed anymore?

Also I feel thtat we should clean up LaneMaskUtil code that does not really get used. For example, for our application, we always assume accumulating == true. If we are not going to maintain the code that assumes accumulating == false, we may want to delete them. I personally would prefer getting the code as simpler as possible

Yes, I think it a good idea to remove the Default mode and keep only Accumulating mode, it will simplify the code a lot.
Should I have another PR for cleaning up this part, or a commit in this PR?

cdevadas · 2025-12-10T06:11:35Z

Also please make sure all the code comments are up to date with the code changes. For example, any comment mentioning PHI node is likely out of date.

Should we also remove the SSAReconstructor class in AMDGPUWaveTransform.cpp since that is not needed anymore?

Also I feel thtat we should clean up LaneMaskUtil code that does not really get used. For example, for our application, we always assume accumulating == true. If we are not going to maintain the code that assumes accumulating == false, we may want to delete them. I personally would prefer getting the code as simpler as possible

Yes, I think it a good idea to remove the Default mode and keep only Accumulating mode, it will simplify the code a lot. Should I have another PR for cleaning up this part, or a commit in this PR?

You can add the clean up in this PR itself.

cdevadas · 2025-12-10T06:23:29Z

Also please make sure all the code comments are up to date with the code changes. For example, any comment mentioning PHI node is likely out of date.

Should we also remove the SSAReconstructor class in AMDGPUWaveTransform.cpp since that is not needed anymore?

How about the second part of the SSAReconstructor that deals with the dominance relation between defs and their respective uses?" Keep it for now. Anyway, we disabled the SSAReconstructor.run() invocation for now. Let's see if there is any fixup needed later when we turn on the wave-transform pipeline by default.

z1-cciauto · 2025-12-10T09:27:36Z

Failed to trigger build:

cmc-rep · 2025-12-10T16:02:00Z

Cleanup looks good to me.

In terms of testing, I was suggesting that we first add run-line for those LL tests to STOP-BEFORE wave-transform in a separate PR, I expect that should works (not crashing). If we can add more tests for more control-flow situations, that would be even better.

In this PR, we should try to turn those STOP-BEFORE into STOP-after. We got multiple people here to examine those test results to ensure correctness, which should be a good and healthy exercise.

llvm/lib/CodeGen/MachineRegisterInfo.cpp

llvm/include/llvm/CodeGen/MachineRegisterInfo.h

vg0204 · 2025-12-11T08:53:21Z

llvm/lib/Target/AMDGPU/GCNLaneMaskUtils.cpp

+        I--;
+      BuildMI(*B, I, {}, TII->get(LMU.getLaneMaskConsts().MovOpc), ACC)
+          .addImm(0);
+    }


Can't you insert all resets one after another once you find the right place rather than searching for right insertion place for every accumulator to reset? Seems bit expensive!

No, they need to be inserted at the end of the basic blocks right before the first branch instruction.
When we first identify the inserts in the process() function, more instructions are yet to be added by later iterations in the basic block.
Doing it separately at the end saves us from iterating to the correct insertion point and is a cleaner and less expensive approach.

Make sense!

The idea of having write to EXEC as MovTermOpc breaks the new flow (non-SSA) as we need to insert the accumulator at the end of the BB, before the actual terminator instructions. The better approach would be to delay the insertion of EXEC write alongside the ACC reset routine. There could be challenges as we might not reset ACC all the time. However, if you knew earlier about the need for ACC reset in the block, we could handle them specially, and it can still be done without introducing MovTermOpc.
For now, we can continue with this choice of Inserting MovTermOpc early and then later changing it to MovOpc while resetting ACC.
This innermost loop in your code currently identifies the insertion point for each ACC. That code should be moved outside the loop. Secondly, once you get the first terminator, there is no need for the while loop to identify the branch instruction. You only need to skip the instruction that writes to EXEC mask. If you consider the following, we can change the MoveTermOpc to MoveOpc here as well.
for (auto &Entry : AccumulatorResetBlocks) {
...
MachineBasicBlock::iterator I = B->getFirstTerminator();
if (I is write to EXEC with a MovTermOpc) {
I->setDesc(TII.get(LMC.MovOpc)); // change the Term status from MOV.
I++;
} // This ensures that we have the right InsertionPt identified. Insert the ACC reset for all accumulators.
for (Register ACC : Accumulators) {
BuildMI(*B, I, {}, TII->get(LMU.getLaneMaskConsts().MovOpc), ACC)
.addImm(0);
}

The idea of having write to EXEC as MovTermOpc breaks the new flow (non-SSA) as we need to insert the accumulator at the end of the BB, before the actual terminator instructions.

I don't understand this part, writing to EXEC as MovTermOpc seems independent from writing to the accumulator

The better approach would be to delay the insertion of EXEC write alongside the ACC reset routine. There could be challenges as we might not reset ACC all the time. However, if you knew earlier about the need for ACC reset in the block, we could handle them specially, and it can still be done without introducing MovTermOpc.

EXEC insertions happen in 2 stages: First for all divergent incoming BBs, then for secondary BBs (creating rejoin masks). In the 2nd stage, when we set EXEC to the computed rejoin masks, the insertion point for this is found by iterating from the first terminator (MovTermOpc introduced by stage 1), by getSaluInsertionAtEnd() function.
So this MovTermOpc is acting like a anchor point for EXEC=rejoinmask instruction(s) to be added before the first exec is set.
ACC reset instructions should be after all EXEC set instructions in any BB.
The ACC to be reset is identified while processing Stage1 and 2 in the process() function, so retaining them in a separate data structure AccumulatorResetBlocks and adding them after both stages are complete is the cleanest approach.

I don't understand this part, writing to EXEC as MovTermOpc seems independent from writing to the accumulator

Thats true, but the order of instructions breaks the verifier since it sees ACC Reset instructions after a MovTermOpc:

$exec = rejoin_mask ... $exec = MOV_TERM %Acc SI_WAVE_CF_EDGE implicit-def $scc %Acc = S_MOV_B32 0 //Scalar operation after a TERM operation is invalid S_CBRANCH_EXECZ %bb.x, implicit $exec S_BRANCH %bb.y

This innermost loop in your code currently identifies the insertion point for each ACC. That code should be moved outside the loop. Secondly, once you get the first terminator, there is no need for the while loop to identify the branch instruction. You only need to skip the instruction that writes to EXEC mask. If you consider the following, we can change the MoveTermOpc to MoveOpc here as well.

Yes, this is a better approach, will incorporate this.

skganesan008 · 2025-12-12T21:23:15Z

!PSDB

z1-cciauto · 2025-12-12T21:23:37Z

PSDB Build Link: http://mlse-bdc-20dd129:8065/#/builders/10/builds/34

lalaniket8 · 2025-12-09T09:04:41Z

llvm/lib/Target/AMDGPU/GCNLaneMaskUtils.cpp

-    SSAUpdater.AddAvailableValue(
-        Info.Block,
-        (Info.Value && !(Info.Flags & ResetAtEnd)) ? Info.Merged : ZeroReg);
+    if(!Info.Value || (Info.Flags & ResetAtEnd))


I want to discuss further optimization here.

GCNLaneMaskUpdater::process() will process the BlockInfo for the following blocks:
X - The block for which we are computing EXEC mask
R - Set of preds of X in Reconverged CFG
T - Set of preds of X in Thread-level CFG

Info.Value is set for all blocks in T (via GCNLaneMaskUpdater::addAvailable() called from ControFlowRewriter::rewrite() )
ResetAtEnd is set for all blocks in R (via GCNLaneMaskUpdater::addReset() called from ControFlowRewriter::rewrite() )

The SSAUpdater marks the ZeroReg or MergedReg as available on the condition:
(Info.Value && !(Info.Flags & ResetAtEnd)) ? Info.Merged : ZeroReg

which translates to:
SSAUpdater.addAvailableValue(x, MergedReg) for x \in T and \notin R
SSAUpdater.addAvaialbleValue(x, ZeroReg) for x \in R UNION (x \notin R and \notin T)

The NonSSA approach uses a single Accumulator Register to store the contributions from each block in T instead of mulitple Merged Register beign defined. This Accumulator is reset at end of blocks corresponding to where SSAUpdater orignally marked ZeroRegister as available.

Therefore, we add Accumulator reset to 0 instructions at end of block : (x \in R) UNION (x \notin R and \notin T)

I believe we can reduce this set further to just x \in R.
This should work because (x \notin R and \notin T) when not empty, corresponds to block X such that X \notin R and X \notin T.

X is directly preceded by blocks in R in the reconverged CFG.
Blocks in R will have Accumulator reset instruction at their end.
Therefore adding Accumulator reset instruction at end of X is redundant.

Kindly let me know if this logic seems sound.

I think we may need to reset at the end of X when X is in the loop. I am not sure.

llvm/lib/Target/AMDGPU/GCNLaneMaskUtils.cpp

lalaniket8 · 2025-12-11T11:00:32Z

llvm/lib/Target/AMDGPU/GCNLaneMaskUtils.cpp

+        I--;
+      BuildMI(*B, I, {}, TII->get(LMU.getLaneMaskConsts().MovOpc), ACC)
+          .addImm(0);
+    }


No, they need to be inserted at the end of the basic blocks right before the first branch instruction.
When we first identify the inserts in the process() function, more instructions are yet to be added by later iterations in the basic block.
Doing it separately at the end saves us from iterating to the correct insertion point and is a cleaner and less expensive approach.

llvm/lib/CodeGen/MachineRegisterInfo.cpp

llvm/include/llvm/CodeGen/MachineRegisterInfo.h

…TermOpc operations with MovOpc

z1-cciauto · 2025-12-16T12:46:15Z

PSDB Build Link: http://mlse-bdc-20dd129:8065/#/builders/10/builds/37

llvm/lib/Target/AMDGPU/AMDGPUWaveTransform.cpp

cdevadas · 2025-12-16T15:04:53Z

llvm/lib/Target/AMDGPU/AMDGPUWaveTransform.cpp

+      // Turning off this copy-chain optimization to retain the Accumulator as
+      // the PrimaryExec
+
+      // MachineInstr *PrimaryExecDef;


Code commented out won't look good. Better clean them all. What is the significance of adding the above comment here? Are you planning to implement a similar optimization for ACC based non-SSA form? If yes, leave a strong note mentioning that (still need to clean up the commented code). Otherwise, remove the comment as well.

cdevadas · 2025-12-16T15:05:08Z

llvm/lib/Target/AMDGPU/AMDGPUWaveTransform.cpp

-        else if (PrimaryExecDef->getOperand(2).getReg() == LMC.ExecReg)
-          Rejoin = PrimaryExecDef->getOperand(1).getReg();
-      }
+      // Turning off this XOR optimiztion since buildMergeLaneMasks() will not


llvm/lib/Target/AMDGPU/AMDGPUWaveTransform.cpp

cdevadas · 2025-12-16T16:18:47Z

llvm/lib/CodeGen/MachineRegisterInfo.cpp

-  Register Reg, MachineBasicBlock &MBB, MachineBasicBlock::iterator I) const {
-  if(I == MBB.begin()) return MBB.end();
+    Register Reg, MachineBasicBlock &MBB, MachineBasicBlock::iterator I) const {
+  if (I == MBB.begin())


Working with the iterators here instead of MachineInstr* actually makes things simpler in terms of calling this function. Since the callsites are working with iterators and can pass end() or begin(). We implement the logic for handling end() within this function once, instead of multiple places (at each callsite)

Rightly chosen function names would give an idea of what's the intention of the function and what it returns.
Just like getVRegDef, getDomVRegDefInBasicBlock gives an impression that it returns the defining instruction (MI* or nullptr if can't find).
There is only one place (inside GCNLaneMaskAnalysis::isSubsetOfExec) I see a recursive call instance. If the function was returning MI*, MI->getIterator() would easily give the iterator to pass in the recursive call instance.
The check if (I != MBB.end()) can easily become if (!MI) at all the callsites after the function returns.

cdevadas · 2025-12-16T16:25:11Z

llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp

+  // Iterate backwards from I (exclusive) to the beginning of the basic block
+  do {
+    --I;
+    if (I->definesRegister(Reg, TRI))


Remove the additional argument TRI passed to the function. TRI can be null here. Like we talked about earlier, we see the full definition of the register in the instructions we are interested in. Or, in the worst case, you can pass this pointer, which is itself a TargetRegisterInfo*.

…ons (#845) Pre-commit to check for exec mask instruction changes caused by #789

Wave Transform should generate non SSA Exec mask manipulation instrs

10c69e3

lalaniket8 changed the title ~~Wave Transform should generate non SSA Exec mask manipulation instrs~~ Wave Transform to generate SSA Exec mask manipulation instrs Dec 9, 2025

lalaniket8 marked this pull request as ready for review December 9, 2025 05:20

lalaniket8 requested review from cdevadas and nhaehnle December 9, 2025 05:21

cdevadas reviewed Dec 9, 2025

View reviewed changes

lalaniket8 requested review from cmc-rep, jmmartinez and vg0204 December 9, 2025 10:47

sanitized with git clang format and minor fixes

54b5f3b

cmc-rep reviewed Dec 9, 2025

View reviewed changes

cmc-rep reviewed Dec 10, 2025

View reviewed changes

llvm/lib/Target/AMDGPU/GCNLaneMaskUtils.cpp Show resolved Hide resolved

Removed default mode and cleanup

7a31f7f

vg0204 reviewed Dec 11, 2025

View reviewed changes

llvm/lib/CodeGen/MachineRegisterInfo.cpp Outdated Show resolved Hide resolved

vg0204 reviewed Dec 11, 2025

View reviewed changes

llvm/include/llvm/CodeGen/MachineRegisterInfo.h Outdated Show resolved Hide resolved

vg0204 reviewed Dec 11, 2025

View reviewed changes

This was referenced Dec 14, 2025

[CLOSED] Pre-commit preliminary tests to check for Non-SSA Exec mask instructions llvm/llvm-project#172201

Closed

Pre-commit preliminary tests to check for Non-SSA Exec mask instructions #845

Merged

lalaniket8 commented Dec 15, 2025

View reviewed changes

Move getDomVRegDefInBasicBlock() into SIRegisterInfo.cpp, replace Mov…

35f7d2c

…TermOpc operations with MovOpc

lalaniket8 commented Dec 16, 2025

View reviewed changes

llvm/lib/Target/AMDGPU/AMDGPUWaveTransform.cpp Show resolved Hide resolved

cdevadas reviewed Dec 16, 2025

View reviewed changes

lalaniket8 added a commit that referenced this pull request Dec 18, 2025

Pre-commit preliminary tests to check for Non-SSA Exec mask instructi…

3f8e199

…ons (#845) Pre-commit to check for exec mask instruction changes caused by #789

Wave Transform to generate SSA Exec mask manipulation instrs #789

Are you sure you want to change the base?

Wave Transform to generate SSA Exec mask manipulation instrs #789

Conversation

lalaniket8 commented Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

z1-cciauto commented Dec 8, 2025

Uh oh!

github-actions bot commented Dec 8, 2025

Uh oh!

cdevadas left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

z1-cciauto commented Dec 9, 2025

Uh oh!

lalaniket8 commented Dec 9, 2025

Uh oh!

cmc-rep commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cmc-rep commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

cdevadas commented Dec 10, 2025

Uh oh!

lalaniket8 commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cdevadas commented Dec 10, 2025

Uh oh!

cdevadas commented Dec 10, 2025

Uh oh!

z1-cciauto commented Dec 10, 2025

Uh oh!

cmc-rep commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vg0204 Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

skganesan008 commented Dec 12, 2025

Uh oh!

z1-cciauto commented Dec 12, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

lalaniket8 commented Dec 8, 2025 •

edited

Loading

cmc-rep commented Dec 9, 2025 •

edited

Loading

cmc-rep commented Dec 9, 2025 •

edited

Loading

lalaniket8 commented Dec 10, 2025 •

edited

Loading

cmc-rep commented Dec 10, 2025 •

edited

Loading

vg0204 Dec 11, 2025 •

edited

Loading