Skip to content

Conversation

@ronlieb
Copy link
Collaborator

@ronlieb ronlieb commented Jan 9, 2026

No description provided.

Stylie777 and others added 30 commits January 9, 2026 10:34
Following on from the work to implement MLIR -> LLVM IR Translation for
Taskloop, this adds support for the following clauses to be used
alongside taskloop:
- if
- grainsize
- num_tasks
- untied
- Nogroup
- Final
- Mergeable
- Priority

These clauses are ones which work directly through the relevant OpenMP
Runtime functions, so their information just needed collecting from the
relevant location and passing through to the appropriate runtime
function.

Remaining clauses retain their TODO message as they have not yet been
implemented.
…m#172056)

Due to a previous PR (llvm#171227),
operations like `_mm_ceil_sd` compile to suboptimal assembly:
```asm
roundsd xmm1, xmm1, 10
blendpd xmm0, xmm1, 1
```
This PR introduces a rewrite pattern to mitigate this, and fuse the corresponding operations.
…vm#174738)

The current checks for if we're allowed to use a NEON copy works based on
the function attributes, which works most of the time. However in one
particular case where a normal function calls a streaming one, there's a
window of time where we enable SM at the call site and the emit a copy for
an outgoing parameter. This copy was lowered to a NEON move which is illegal.

There's also another case where we could end up generating these,
related to zero cycle move tuning features.

Both of these cases is fixed in this patch by walking back from the copy
to look for any streaming mode changes (within the current block). I know
this is pretty ugly but I don't have a better solution right now.

rdar://167439642
…lvm#174588)

Don't allocate a task context structure if none of the private variables
needed it. This was already skipped when there were no private variables
at all.
Introducing the notion of a minimum header version has multiple
benefits. It allows us to merge a bunch of ABI macros into a single one.
This makes configuring the library significantly easier, since, for a
stable ABI, you only need to know which version you started distributing
the library with, instead of checking which ABI flags have been
introduced at what point. For platforms which have a moving window of
the minimum version a program has been compiled against, this also makes
it simple to remove symbols from the dylib when they can't be used by
any program anymore.
The variant benchmarks are incredibly slow to compile and run currently.
This is due to them being incredibly exhaustive. This is usually a good
thing, but the exhaustiveness makes it prohibitive to actually run the
benchmarks. Even the new, incredibly reduced, set still requires almost
40 seconds to just compile on my system.
… types (llvm#162438)

Currently, `deque` and `vector`'s `append_range` is implemented in terms
of `insert_range`. The problem with that is that `insert_range` has more
preconditions, resulting in us rejecting valid code.

This also significantly improves performance for `deque` in some cases.
…tribute macro (llvm#174964)

Currently `_LIBCPP_OVERRIDABLE_FUNCTION` takes the return type, function
name and argument list, but simply constructs the function and adds
attributes without modifying the signature in any way. We can replace
this with a normal attribute macro, making the signature easier to read
and simpler to understand what's actually going on. Since it's an
internal macro we can also drop the `_LIBCPP_` prefix.
Original crash was observed in Chromium, in [1]. The problem occurs in
elf::isAArch64BTILandingPad because it didn't handle synthetic sections,
which can have a nullptr as a buf, so it crashed while trying to read
that buf.

After fixing that, a second issue occurs: When the patched code grows
too
much, it gets far away from the short jump, and the current
implementation
assumes a R_AARCH64_JUMP26 will be enough.

This PR changes the implementation to:
(a) In isAArch64BTILandingPad, checks if a section is synthetic, and
assumes that it'll NOT contain a landing pad, avoiding the buffer check;
(b) Suppress the size rounding for thunks that preceeds section
(Making the situation less likely to happen);
(c) Reimplements the patch by using a R_AARCH64_ABS64 in case the
patched code is still far away.

[1] https://issues.chromium.org/issues/440019454

---------

Co-authored-by: Tarcisio Fischer <tarcisio.fischer@arm.com>
…74956)

`__builtin_mul_overflow` does the right thing, even for `char` and
`short`, so the overloads for these types can simply be dropped. We can
also merge the remaining two overloads into a single one now, since we
don't do any dispatching for `char` and `short` anymore.
…#175148)

Test is failing intermittently after llvm#174944. The issue this time is the
`WSeqPair`/`XSeqPair` tests fail if the same pair is used as there's
fewer MOVs.

The test was expecting:
```
  0000000000000000 <foo>:
         0: f81e0ffb      str     x27, [sp, #-0x20]!
         4: a90163fa      stp     x26, x24, [sp, #0x10]
         8: d2800006      mov     x6, #0x0                // =0
         c: d2800007      mov     x7, #0x0                // =0
        10: d280001a      mov     x26, #0x0               // =0
        14: d280001b      mov     x27, #0x0               // =0
        18: d2800018      mov     x24, #0x0               // =0
        1c: 48267f1a      casp    x6, x7, x26, x27, [x24]
```
but this can occur:
```
  0000000000000000 <foo>:
         0: f81e0ffb      str     x27, [sp, #-0x20]!
         4: a90153f5      stp     x21, x20, [sp, #0x10]
         8: d2800014      mov     x20, #0x0               // =0
         c: d2800015      mov     x21, #0x0               // =0
        10: d280001b      mov     x27, #0x0               // =0
        14: 48347f74      casp    x20, x21, x20, x21, [x27]
```
…/minnum (llvm#174806)

Fixes llvm#173270

For x86 SSE/AVX floating point MAX/MIN intrinsics, attempt to generalize
them down into `Intrinsic::maxnum` and `Intrinsic::minnum` given that we
can verify that the inputs are either (PosNormal, NegNormal, PosZero).
This PR uses the `llvm::computeKnownFPClass` to generate the FPClass
bitset to verify if the inputs are of the other FP types (NaN, Inf,
Subnormal, NegZero).
Implementation follows exactly what is done for omp.wsloop and omp.task.
See llvm#137841.

The change to the operation verifier is to allow a taskgroup
cancellation point inside of a taskloop. This was already allowed for
omp.cancel.
Tests fail to link when using LLVM C++ library. Disabling the tests
until they can be investigated and the underlying cause identified and
fixed.
…172829)

Generalise the Hexagon cmdline options to control if memset, memcpy or memmove intrinsics should be inlined versus calling library functions, so they can be used by all backends:

	•	-max-store-memset
	•	-max-store-memcpy
	•	-max-store-memmove

These flags override the target-specific defaults set in TargetLowering (e.g., MaxStoresPerMemcpy) and allow fine-tuning of the inlining threshold for performance analysis and optimization.

The optsize variants (-max-store-memset-Os, -max-store-memcpy-Os, max-store-memmove-Os) from the Hexagon backend were removed, and now the above options control both.

The threshold is specified as a number of store operations, which is backend-specific. Operations requiring more stores than the threshold will call the corresponding library function instead of being inlined.
Avoid confusion with upcoming generic clmul intrinsic handling
When emitting an homonymous generic interface and procedure warning,
the source locations of the interface and the procedure were being
compared to find the one that occurred later in the source file.

The problem is that they could be in different source/module files,
which makes the comparison invalid.

Fix it by using parser::AllCookedSources::Precedes() instead, that
correctly handle names in different source files.
llvm#173280)

Update `NVVMTargetAttr` builder in `NVVMOps.td` to use `$_get` instead
of `Base::get`.

Now the auto-generated parser calls `getChecked`, allowing graceful
error handling for invalid parameters (e.g., `O=4`) instead of crashing
with an assertion failure.

Add a regression test in
`mlir/test/Dialect/LLVMIR/nvvm-target-invalid.mlir`.

Fixes: llvm#130014
Fixes an attribute mismatch error in `AllocTokenPass` that occurs during
ThinLTO builds at OptimizationLevel::O0.

The `getTokenAllocFunction` in `AllocTokenPass` was incorrectly copying
attributes from the instrumented function (`Callee`) to an *existing*
`void()` alloc-token function retrieved by `Mod.getOrInsertFunction`.
This resulted in arg attributes being added to a function with no
parameters, causing `VerifyPass` to fail with "Attribute after last
parameter!".

The fix modifies `getTokenAllocFunction` to pass the `Callee`'s
attributes directly to the `Mod.getOrInsertFunction` overload. This
ensures attributes are only applied when the alloc-token function is
*newly inserted*, preventing unintended attribute modifications on
already existing function declarations.

See https://g-issues.chromium.org/issues/474289092 for detailed
reproduction steps and analysis.

Co-authored-by: Ayumi Ono <ayumiohno@google.com>
…ve' from g++ compiler and remove unsupported floating-point data. (llvm#174915)

When building the flang-rt project with the g++ compiler on Linux-X86_64
machine, the compiler gives the following warning:

```
llvm-project/flang-rt/lib/runtime/extensions.cpp:455:26: warning: left shift count is negative [-Wshift-count-negative]
   455 |     mask = ~(unsigned)0u << ((8 - digits) * 4 + 1);
       |            ~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~

```

All the discussion records see:
llvm#173955

Co-authored-by: liao jun <liaojun@ultrarisc.com>
Signed-off-by: Jonas Rickert <jonas.rickert@amd.com>
At the time ParseAttributeArgumentList is called, the first argument
of an attribute may have already been parsed. We need to take this into
account when accessing ParsedAttributeArgumentsProperties mask, which
specifies which of the attribute arguments are string literals.

Pull Request: llvm#171017
…vm#173983)

Remove a helper function and query the `RegionBranchOpInterface`
instead. (Which does the same thing.) Also add a TODO for a bug in the
implementation of `SliceWalk.cpp`. (The bug is not fixed yet.)
Includes one manual fix to add -filetype=null to a RUN line in
test/MC/AMDGPU/gfx1250_asm_sop1.s. Everything else is autogenerated.
…from being marked as parial maps (llvm#175133)

The following test was triggering a runtime crash **on the host before
launching the kernel**:
```fortran
program test_omp_target_map_bug_v5
  implicit none
  type nested_type
    real, allocatable :: alloc_field(:)
  end type nested_type

  type nesting_type
    integer :: int_field
    type(nested_type) :: derived_field
  end type nesting_type

  type(nesting_type) :: config

  allocate(config%derived_field%alloc_field(1))

  !$OMP TARGET ENTER DATA MAP(TO:config, config%derived_field%alloc_field)

  !$OMP TARGET
  config%derived_field%alloc_field(1) = 1.0
  !$OMP END TARGET

  deallocate(config%derived_field%alloc_field)
end program test_omp_target_map_bug_v5
```

In particular, the runtime was producing a segmentation fault when the
test is compiled with any optimization level > 0; if you compile with
-O0 the sample ran fine.

After debugging the runtime, it turned out the crash was happening at
the point where the runtime calls the default mapper emitted by the
compiler for `nesting_type; in particular at this point in the runtime:
https://github.com/llvm/llvm-project/blob/c62cd2877cc25a0d708ad22a70c2a57590449c4d/offload/libomptarget/omptarget.cpp#L307.

Bisecting the optimization pipeline using `-mllvm -opt-bisect-limit=N`,
the first pass that triggered the issue on `O1` was the `instcombine`
pass. Debugging this further, the issue narrows down to canonicalizing
`getelementptr` instructions from using struct types (in this case the
`nesting_type` in the sample above) to using addressing bytes (`i8`). In
particular, in `O0`, you would see something like this:
```llvm
define internal void @.omp_mapper._QQFnesting_type_omp_default_mapper(ptr noundef %0, ptr noundef %1, ptr noundef %2, i64 noundef %3, i64 noundef %4, ptr noundef %5) #6 {
entry:
  %6 = udiv exact i64 %3, 56
  %7 = getelementptr %_QFTnesting_type, ptr %2, i64 %6
  ....
}
```

```llvm
define internal void @.omp_mapper._QQFnesting_type_omp_default_mapper(ptr noundef %0, ptr noundef %1, ptr noundef %2, i64 noundef %3, i64 noundef %4, ptr noundef %5) #6 {
entry:
  %6 = getelementptr i8, ptr %2, i64 %3
  ....
}
```

The `udiv exact` instruction emitted by the OMP IR Builder (see:
https://github.com/llvm/llvm-project/blob/c62cd2877cc25a0d708ad22a70c2a57590449c4d/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp#L9154)
allows `instcombine` to assume that `%3` is divisible by the struct size
(here `56`) and, therefore, replaces the result of the division with
direct GEP on `i8` rather than the struct type.

However, the runtime was calling
`@.omp_mapper._QQFnesting_type_omp_default_mapper` not with `56` (the
proper struct size) but with `48`!

Debugging this further, I found that the size of `omp.map.info`
operation to which the default mapper is attached computes the value of
`48` because we set the map to partial (see:
https://github.com/llvm/llvm-project/blob/c62cd2877cc25a0d708ad22a70c2a57590449c4d/flang/lib/Optimizer/OpenMP/MapInfoFinalization.cpp#L1146
and
https://github.com/llvm/llvm-project/blob/c62cd2877cc25a0d708ad22a70c2a57590449c4d/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp#L4501-L4512).

However, I think this is incorrect since the emitted mapper (and
user-defined mappers in general) are defined on the whole struct type
and should never be marked as partial. Hence, the fix in this PR.
StringMap duplicates the option name to a new allocation for every
option, which is not necessary. Instead we can use the same StringRef
that the Option already uses inside a DenseMap. This reduces the amount
of allocations when loading libLLVM.
…ec (llvm#175152)

In ScalarExprEmitter::EmitScalarPrePostIncDec we create ConstantInt
values that are either 1 or -1. There is a special case when the type is
i1 (e.g. for unsigned _BitInt(1)) when we need to be able to create a
"i1 true" value for both inc and dec.

To avoid triggering the assertions added by the pull request llvm#171456 we
now treat the ConstantInt as unsigned for increments and as signed for
decrements.
RKSimon and others added 22 commits January 9, 2026 14:48
… omp.simd (llvm#174916)

This PR adds additional checks and tests for linear clause on omp.wsloop
and omp.simd (both standalone and composite). For composite simd
constructs, the translation to LLVMIR uses the same
`LinearClauseProcessor` under `convertOmpSimd`, as already present in
previous PRs like llvm#150386 and
llvm#139386
Co-authored-by: Pranav Kant <prka@google.com>
Summary:
llvm#174862 and
llvm#174655 provided the intrinsics
required to get the fundamental operations working for these. This patch
sets up the basic support (as far as I know).

This should be the first step towards allowing SPIR-V to build things
like the LLVM libc and the OpenMP Device Runtime Library. The
implementations here are intentionally inefficient, such as not using
the dedicated SPIR-V opcode for read firstlane. This is just to start
and hopefully start testing things later.

Would appreciate someone more familiar with  the backend double-checking
these.
They have been deprecated for more than five years in favor of !getdagop
and !setdagop. See https://reviews.llvm.org/D89814.
We haven't yet decided what we want the `optional::iterator` type to be
in the end, so let's make it experimental for now so that we don't
commit to an ABI yet.
In the `= {"foo"}` case, we don't have an array filler we can use and we
need to explicitily zero the remaining elements.
…m#175176)

getType() returns just int for those instead of an array type, so the
previous condition resulted in the array index missing in the APValue's
LValuePath.
This PR reduces outliers in terms of runtime performance, by asking the
OS to prefetch memory-mapped input files in advance, as early as
possible. I have implemented the Linux aspect, however I have only
tested this on Windows 11 version 24H2, with an active security stack
enabled. The machine is a AMD Threadripper PRO 3975WX 32c/64t with 128
GB of RAM and Samsung 990 PRO SSD.

I have used a Unreal Engine-based game to profile the link times. Here's
a quick summary of the input data:
```
                                    Summary
--------------------------------------------------------------------------------
               4,169 Input OBJ files (expanded from all cmd-line inputs)
      26,325,429,114 Size of all consumed OBJ files (non-lazy), in bytes
                   9 PDB type server dependencies
                   0 Precomp OBJ dependencies
         350,516,212 Input debug type records
      18,146,407,324 Size of all input debug type records, in bytes
          15,709,427 Merged TPI records
           4,747,187 Merged IPI records
              56,408 Output PDB strings
          23,410,278 Global symbol records
          45,482,231 Module symbol records
           1,584,608 Public symbol records
```

In normal conditions - meanning all the pages are already in RAM - this
PR has no noticeable effect:
```
>hyperfine "before\lld-link.exe @Game.exe.rsp" "with_pr\lld-link.exe @Game.exe.rsp"
Benchmark 1: before\lld-link.exe @Game.exe.rsp
  Time (mean ± σ):     29.689 s ±  0.550 s    [User: 259.873 s, System: 37.936 s]
  Range (min … max):   29.026 s … 30.880 s    10 runs

Benchmark 2: with_pr\lld-link.exe @Game.exe.rsp
  Time (mean ± σ):     29.594 s ±  0.342 s    [User: 261.434 s, System: 62.259 s]
  Range (min … max):   29.209 s … 30.171 s    10 runs

Summary
  with_pr\lld-link.exe @Game.exe.rsp ran
    1.00 ± 0.02 times faster than before\lld-link.exe @Game.exe.rsp
```

However when in production conditions, we're typically working with the
Unreal Engine Editor, with exteral DCC tools like Maya, Houdini; we have
several instances of Visual Studio open, VSCode with Rust analyzer, etc.
All this means that between code change iterations, most of the input
OBJs files might have been already evicted from the Windows RAM cache.
Consequently, in the following test, I've simulated the worst case
condition by evicting all data from RAM with
[RAMMap64](https://learn.microsoft.com/en-us/sysinternals/downloads/rammap)
(ie. `RAMMap64.exe -E[wsmt0]` with a 5-sec sleep at the end to ensure
the System thread actually has time to evict the pages)
```
>hyperfine -p cleanup.bat "before\lld-link.exe @Game.exe.rsp" "with_pr\lld-link.exe @Game.exe.rsp"
Benchmark 1: before\lld-link.exe @Game.exe.rsp
  Time (mean ± σ):     48.124 s ±  1.770 s    [User: 269.031 s, System: 41.769 s]
  Range (min … max):   46.023 s … 50.388 s    10 runs

Benchmark 2: with_pr\lld-link.exe @Game.exe.rsp
  Time (mean ± σ):     34.192 s ±  0.478 s    [User: 263.620 s, System: 40.991 s]
  Range (min … max):   33.550 s … 34.916 s    10 runs

Summary
  with_pr\lld-link.exe @Game.exe.rsp ran
    1.41 ± 0.06 times faster than before\lld-link.exe @Game.exe.rsp
```

This is similar to the work done in MachO in
llvm#157917
…17543)

For some targets, it is required to identify the COPY instruction
corresponds to the RA inserted live range split. Adding the new
flag `MachineInstr::LRSplit` to serve the purpose.
The COPY inserted for liverange split during sgpr-regalloc
pipeline currently breaks the BB prolog during the subsequent
vgpr-regalloc phase while spilling and/or splitting the vector
liveranges. This patch fixes it by correctly including the
LR split instructions during sgpr-regalloc and wwm-regalloc
pipelines into the BB prolog.
These tests redirected stderr to stdout, but never actually checked for
any errors.
Summary:
This is only really meaningful for the NVPTX target. Not all build
environments support host LTO and these are redundant tests, just clean
this up and make it run faster.
Changes this to getSigned() to match the signedness of the calculation.
However, we still need to allow truncation because the addition
result may overflow, and the operation is specified to truncate
in that case.

Fixes llvm#175159.
This patch teaches clang-tblgen to start emitting ABI lowering pattern
declarations.
…ic is the only user (llvm#172723)

Closes llvm#172176.

Previously, `FoldOpIntoSelect` wouldn't fold multi-use selects if
`MultiUse` wasn't explicitly true. This prevents useful folding when the
select is used multiple times in the same intrinsic call. Similar to
what is done in `foldOpIntoPhi`, we'll now check that all of the uses
come from a single user, rather than checking that there is only one
use.
Current zilsd optimizer only support base op that is in a register,
however many use cases are essentially stack load/store.
Signed-off-by: Philip Wilkinson <philip.wilkinson@arm.com>
@ronlieb ronlieb requested review from a team and dpalermo January 9, 2026 17:02
@z1-cciauto
Copy link
Collaborator

@z1-cciauto
Copy link
Collaborator

@z1-cciauto z1-cciauto merged commit 91809a4 into amd-staging Jan 9, 2026
41 of 43 checks passed
@z1-cciauto z1-cciauto deleted the amd/merge/upstream_merge_20260109103130 branch January 9, 2026 22:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.