forked from llvm/llvm-project
-
Notifications
You must be signed in to change notification settings - Fork 77
merge main into amd-staging #1050
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
z1-cciauto
merged 54 commits into
amd-staging
from
amd/merge/upstream_merge_20260109103130
Jan 9, 2026
Merged
merge main into amd-staging #1050
z1-cciauto
merged 54 commits into
amd-staging
from
amd/merge/upstream_merge_20260109103130
Jan 9, 2026
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Following on from the work to implement MLIR -> LLVM IR Translation for Taskloop, this adds support for the following clauses to be used alongside taskloop: - if - grainsize - num_tasks - untied - Nogroup - Final - Mergeable - Priority These clauses are ones which work directly through the relevant OpenMP Runtime functions, so their information just needed collecting from the relevant location and passing through to the appropriate runtime function. Remaining clauses retain their TODO message as they have not yet been implemented.
…m#172056) Due to a previous PR (llvm#171227), operations like `_mm_ceil_sd` compile to suboptimal assembly: ```asm roundsd xmm1, xmm1, 10 blendpd xmm0, xmm1, 1 ``` This PR introduces a rewrite pattern to mitigate this, and fuse the corresponding operations.
…vm#174738) The current checks for if we're allowed to use a NEON copy works based on the function attributes, which works most of the time. However in one particular case where a normal function calls a streaming one, there's a window of time where we enable SM at the call site and the emit a copy for an outgoing parameter. This copy was lowered to a NEON move which is illegal. There's also another case where we could end up generating these, related to zero cycle move tuning features. Both of these cases is fixed in this patch by walking back from the copy to look for any streaming mode changes (within the current block). I know this is pretty ugly but I don't have a better solution right now. rdar://167439642
…lvm#174588) Don't allocate a task context structure if none of the private variables needed it. This was already skipped when there were no private variables at all.
Introducing the notion of a minimum header version has multiple benefits. It allows us to merge a bunch of ABI macros into a single one. This makes configuring the library significantly easier, since, for a stable ABI, you only need to know which version you started distributing the library with, instead of checking which ABI flags have been introduced at what point. For platforms which have a moving window of the minimum version a program has been compiled against, this also makes it simple to remove symbols from the dylib when they can't be used by any program anymore.
The variant benchmarks are incredibly slow to compile and run currently. This is due to them being incredibly exhaustive. This is usually a good thing, but the exhaustiveness makes it prohibitive to actually run the benchmarks. Even the new, incredibly reduced, set still requires almost 40 seconds to just compile on my system.
… types (llvm#162438) Currently, `deque` and `vector`'s `append_range` is implemented in terms of `insert_range`. The problem with that is that `insert_range` has more preconditions, resulting in us rejecting valid code. This also significantly improves performance for `deque` in some cases.
…tribute macro (llvm#174964) Currently `_LIBCPP_OVERRIDABLE_FUNCTION` takes the return type, function name and argument list, but simply constructs the function and adds attributes without modifying the signature in any way. We can replace this with a normal attribute macro, making the signature easier to read and simpler to understand what's actually going on. Since it's an internal macro we can also drop the `_LIBCPP_` prefix.
Original crash was observed in Chromium, in [1]. The problem occurs in elf::isAArch64BTILandingPad because it didn't handle synthetic sections, which can have a nullptr as a buf, so it crashed while trying to read that buf. After fixing that, a second issue occurs: When the patched code grows too much, it gets far away from the short jump, and the current implementation assumes a R_AARCH64_JUMP26 will be enough. This PR changes the implementation to: (a) In isAArch64BTILandingPad, checks if a section is synthetic, and assumes that it'll NOT contain a landing pad, avoiding the buffer check; (b) Suppress the size rounding for thunks that preceeds section (Making the situation less likely to happen); (c) Reimplements the patch by using a R_AARCH64_ABS64 in case the patched code is still far away. [1] https://issues.chromium.org/issues/440019454 --------- Co-authored-by: Tarcisio Fischer <tarcisio.fischer@arm.com>
…74956) `__builtin_mul_overflow` does the right thing, even for `char` and `short`, so the overloads for these types can simply be dropped. We can also merge the remaining two overloads into a single one now, since we don't do any dispatching for `char` and `short` anymore.
…#175148) Test is failing intermittently after llvm#174944. The issue this time is the `WSeqPair`/`XSeqPair` tests fail if the same pair is used as there's fewer MOVs. The test was expecting: ``` 0000000000000000 <foo>: 0: f81e0ffb str x27, [sp, #-0x20]! 4: a90163fa stp x26, x24, [sp, #0x10] 8: d2800006 mov x6, #0x0 // =0 c: d2800007 mov x7, #0x0 // =0 10: d280001a mov x26, #0x0 // =0 14: d280001b mov x27, #0x0 // =0 18: d2800018 mov x24, #0x0 // =0 1c: 48267f1a casp x6, x7, x26, x27, [x24] ``` but this can occur: ``` 0000000000000000 <foo>: 0: f81e0ffb str x27, [sp, #-0x20]! 4: a90153f5 stp x21, x20, [sp, #0x10] 8: d2800014 mov x20, #0x0 // =0 c: d2800015 mov x21, #0x0 // =0 10: d280001b mov x27, #0x0 // =0 14: 48347f74 casp x20, x21, x20, x21, [x27] ```
…/minnum (llvm#174806) Fixes llvm#173270 For x86 SSE/AVX floating point MAX/MIN intrinsics, attempt to generalize them down into `Intrinsic::maxnum` and `Intrinsic::minnum` given that we can verify that the inputs are either (PosNormal, NegNormal, PosZero). This PR uses the `llvm::computeKnownFPClass` to generate the FPClass bitset to verify if the inputs are of the other FP types (NaN, Inf, Subnormal, NegZero).
Implementation follows exactly what is done for omp.wsloop and omp.task. See llvm#137841. The change to the operation verifier is to allow a taskgroup cancellation point inside of a taskloop. This was already allowed for omp.cancel.
Tests fail to link when using LLVM C++ library. Disabling the tests until they can be investigated and the underlying cause identified and fixed.
…172829) Generalise the Hexagon cmdline options to control if memset, memcpy or memmove intrinsics should be inlined versus calling library functions, so they can be used by all backends: • -max-store-memset • -max-store-memcpy • -max-store-memmove These flags override the target-specific defaults set in TargetLowering (e.g., MaxStoresPerMemcpy) and allow fine-tuning of the inlining threshold for performance analysis and optimization. The optsize variants (-max-store-memset-Os, -max-store-memcpy-Os, max-store-memmove-Os) from the Hexagon backend were removed, and now the above options control both. The threshold is specified as a number of store operations, which is backend-specific. Operations requiring more stores than the threshold will call the corresponding library function instead of being inlined.
Avoid confusion with upcoming generic clmul intrinsic handling
When emitting an homonymous generic interface and procedure warning, the source locations of the interface and the procedure were being compared to find the one that occurred later in the source file. The problem is that they could be in different source/module files, which makes the comparison invalid. Fix it by using parser::AllCookedSources::Precedes() instead, that correctly handle names in different source files.
llvm#173280) Update `NVVMTargetAttr` builder in `NVVMOps.td` to use `$_get` instead of `Base::get`. Now the auto-generated parser calls `getChecked`, allowing graceful error handling for invalid parameters (e.g., `O=4`) instead of crashing with an assertion failure. Add a regression test in `mlir/test/Dialect/LLVMIR/nvvm-target-invalid.mlir`. Fixes: llvm#130014
Fixes an attribute mismatch error in `AllocTokenPass` that occurs during ThinLTO builds at OptimizationLevel::O0. The `getTokenAllocFunction` in `AllocTokenPass` was incorrectly copying attributes from the instrumented function (`Callee`) to an *existing* `void()` alloc-token function retrieved by `Mod.getOrInsertFunction`. This resulted in arg attributes being added to a function with no parameters, causing `VerifyPass` to fail with "Attribute after last parameter!". The fix modifies `getTokenAllocFunction` to pass the `Callee`'s attributes directly to the `Mod.getOrInsertFunction` overload. This ensures attributes are only applied when the alloc-token function is *newly inserted*, preventing unintended attribute modifications on already existing function declarations. See https://g-issues.chromium.org/issues/474289092 for detailed reproduction steps and analysis. Co-authored-by: Ayumi Ono <ayumiohno@google.com>
…ve' from g++ compiler and remove unsupported floating-point data. (llvm#174915) When building the flang-rt project with the g++ compiler on Linux-X86_64 machine, the compiler gives the following warning: ``` llvm-project/flang-rt/lib/runtime/extensions.cpp:455:26: warning: left shift count is negative [-Wshift-count-negative] 455 | mask = ~(unsigned)0u << ((8 - digits) * 4 + 1); | ~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~ ``` All the discussion records see: llvm#173955 Co-authored-by: liao jun <liaojun@ultrarisc.com>
Signed-off-by: Jonas Rickert <jonas.rickert@amd.com>
This was introduced since 7c402b8
Reorder some code to make it less confusing.
At the time ParseAttributeArgumentList is called, the first argument of an attribute may have already been parsed. We need to take this into account when accessing ParsedAttributeArgumentsProperties mask, which specifies which of the attribute arguments are string literals. Pull Request: llvm#171017
…vm#173983) Remove a helper function and query the `RegionBranchOpInterface` instead. (Which does the same thing.) Also add a TODO for a bug in the implementation of `SliceWalk.cpp`. (The bug is not fixed yet.)
Includes one manual fix to add -filetype=null to a RUN line in test/MC/AMDGPU/gfx1250_asm_sop1.s. Everything else is autogenerated.
…from being marked as parial maps (llvm#175133) The following test was triggering a runtime crash **on the host before launching the kernel**: ```fortran program test_omp_target_map_bug_v5 implicit none type nested_type real, allocatable :: alloc_field(:) end type nested_type type nesting_type integer :: int_field type(nested_type) :: derived_field end type nesting_type type(nesting_type) :: config allocate(config%derived_field%alloc_field(1)) !$OMP TARGET ENTER DATA MAP(TO:config, config%derived_field%alloc_field) !$OMP TARGET config%derived_field%alloc_field(1) = 1.0 !$OMP END TARGET deallocate(config%derived_field%alloc_field) end program test_omp_target_map_bug_v5 ``` In particular, the runtime was producing a segmentation fault when the test is compiled with any optimization level > 0; if you compile with -O0 the sample ran fine. After debugging the runtime, it turned out the crash was happening at the point where the runtime calls the default mapper emitted by the compiler for `nesting_type; in particular at this point in the runtime: https://github.com/llvm/llvm-project/blob/c62cd2877cc25a0d708ad22a70c2a57590449c4d/offload/libomptarget/omptarget.cpp#L307. Bisecting the optimization pipeline using `-mllvm -opt-bisect-limit=N`, the first pass that triggered the issue on `O1` was the `instcombine` pass. Debugging this further, the issue narrows down to canonicalizing `getelementptr` instructions from using struct types (in this case the `nesting_type` in the sample above) to using addressing bytes (`i8`). In particular, in `O0`, you would see something like this: ```llvm define internal void @.omp_mapper._QQFnesting_type_omp_default_mapper(ptr noundef %0, ptr noundef %1, ptr noundef %2, i64 noundef %3, i64 noundef %4, ptr noundef %5) #6 { entry: %6 = udiv exact i64 %3, 56 %7 = getelementptr %_QFTnesting_type, ptr %2, i64 %6 .... } ``` ```llvm define internal void @.omp_mapper._QQFnesting_type_omp_default_mapper(ptr noundef %0, ptr noundef %1, ptr noundef %2, i64 noundef %3, i64 noundef %4, ptr noundef %5) #6 { entry: %6 = getelementptr i8, ptr %2, i64 %3 .... } ``` The `udiv exact` instruction emitted by the OMP IR Builder (see: https://github.com/llvm/llvm-project/blob/c62cd2877cc25a0d708ad22a70c2a57590449c4d/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp#L9154) allows `instcombine` to assume that `%3` is divisible by the struct size (here `56`) and, therefore, replaces the result of the division with direct GEP on `i8` rather than the struct type. However, the runtime was calling `@.omp_mapper._QQFnesting_type_omp_default_mapper` not with `56` (the proper struct size) but with `48`! Debugging this further, I found that the size of `omp.map.info` operation to which the default mapper is attached computes the value of `48` because we set the map to partial (see: https://github.com/llvm/llvm-project/blob/c62cd2877cc25a0d708ad22a70c2a57590449c4d/flang/lib/Optimizer/OpenMP/MapInfoFinalization.cpp#L1146 and https://github.com/llvm/llvm-project/blob/c62cd2877cc25a0d708ad22a70c2a57590449c4d/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp#L4501-L4512). However, I think this is incorrect since the emitted mapper (and user-defined mappers in general) are defined on the whole struct type and should never be marked as partial. Hence, the fix in this PR.
StringMap duplicates the option name to a new allocation for every option, which is not necessary. Instead we can use the same StringRef that the Option already uses inside a DenseMap. This reduces the amount of allocations when loading libLLVM.
…ec (llvm#175152) In ScalarExprEmitter::EmitScalarPrePostIncDec we create ConstantInt values that are either 1 or -1. There is a special case when the type is i1 (e.g. for unsigned _BitInt(1)) when we need to be able to create a "i1 true" value for both inc and dec. To avoid triggering the assertions added by the pull request llvm#171456 we now treat the ConstantInt as unsigned for increments and as signed for decrements.
… omp.simd (llvm#174916) This PR adds additional checks and tests for linear clause on omp.wsloop and omp.simd (both standalone and composite). For composite simd constructs, the translation to LLVMIR uses the same `LinearClauseProcessor` under `convertOmpSimd`, as already present in previous PRs like llvm#150386 and llvm#139386
Co-authored-by: Pranav Kant <prka@google.com>
Summary: llvm#174862 and llvm#174655 provided the intrinsics required to get the fundamental operations working for these. This patch sets up the basic support (as far as I know). This should be the first step towards allowing SPIR-V to build things like the LLVM libc and the OpenMP Device Runtime Library. The implementations here are intentionally inefficient, such as not using the dedicated SPIR-V opcode for read firstlane. This is just to start and hopefully start testing things later. Would appreciate someone more familiar with the backend double-checking these.
They have been deprecated for more than five years in favor of !getdagop and !setdagop. See https://reviews.llvm.org/D89814.
We haven't yet decided what we want the `optional::iterator` type to be in the end, so let's make it experimental for now so that we don't commit to an ABI yet.
In the `= {"foo"}` case, we don't have an array filler we can use and we
need to explicitily zero the remaining elements.
…m#175176) getType() returns just int for those instead of an array type, so the previous condition resulted in the array index missing in the APValue's LValuePath.
This PR reduces outliers in terms of runtime performance, by asking the
OS to prefetch memory-mapped input files in advance, as early as
possible. I have implemented the Linux aspect, however I have only
tested this on Windows 11 version 24H2, with an active security stack
enabled. The machine is a AMD Threadripper PRO 3975WX 32c/64t with 128
GB of RAM and Samsung 990 PRO SSD.
I have used a Unreal Engine-based game to profile the link times. Here's
a quick summary of the input data:
```
Summary
--------------------------------------------------------------------------------
4,169 Input OBJ files (expanded from all cmd-line inputs)
26,325,429,114 Size of all consumed OBJ files (non-lazy), in bytes
9 PDB type server dependencies
0 Precomp OBJ dependencies
350,516,212 Input debug type records
18,146,407,324 Size of all input debug type records, in bytes
15,709,427 Merged TPI records
4,747,187 Merged IPI records
56,408 Output PDB strings
23,410,278 Global symbol records
45,482,231 Module symbol records
1,584,608 Public symbol records
```
In normal conditions - meanning all the pages are already in RAM - this
PR has no noticeable effect:
```
>hyperfine "before\lld-link.exe @Game.exe.rsp" "with_pr\lld-link.exe @Game.exe.rsp"
Benchmark 1: before\lld-link.exe @Game.exe.rsp
Time (mean ± σ): 29.689 s ± 0.550 s [User: 259.873 s, System: 37.936 s]
Range (min … max): 29.026 s … 30.880 s 10 runs
Benchmark 2: with_pr\lld-link.exe @Game.exe.rsp
Time (mean ± σ): 29.594 s ± 0.342 s [User: 261.434 s, System: 62.259 s]
Range (min … max): 29.209 s … 30.171 s 10 runs
Summary
with_pr\lld-link.exe @Game.exe.rsp ran
1.00 ± 0.02 times faster than before\lld-link.exe @Game.exe.rsp
```
However when in production conditions, we're typically working with the
Unreal Engine Editor, with exteral DCC tools like Maya, Houdini; we have
several instances of Visual Studio open, VSCode with Rust analyzer, etc.
All this means that between code change iterations, most of the input
OBJs files might have been already evicted from the Windows RAM cache.
Consequently, in the following test, I've simulated the worst case
condition by evicting all data from RAM with
[RAMMap64](https://learn.microsoft.com/en-us/sysinternals/downloads/rammap)
(ie. `RAMMap64.exe -E[wsmt0]` with a 5-sec sleep at the end to ensure
the System thread actually has time to evict the pages)
```
>hyperfine -p cleanup.bat "before\lld-link.exe @Game.exe.rsp" "with_pr\lld-link.exe @Game.exe.rsp"
Benchmark 1: before\lld-link.exe @Game.exe.rsp
Time (mean ± σ): 48.124 s ± 1.770 s [User: 269.031 s, System: 41.769 s]
Range (min … max): 46.023 s … 50.388 s 10 runs
Benchmark 2: with_pr\lld-link.exe @Game.exe.rsp
Time (mean ± σ): 34.192 s ± 0.478 s [User: 263.620 s, System: 40.991 s]
Range (min … max): 33.550 s … 34.916 s 10 runs
Summary
with_pr\lld-link.exe @Game.exe.rsp ran
1.41 ± 0.06 times faster than before\lld-link.exe @Game.exe.rsp
```
This is similar to the work done in MachO in
llvm#157917
…17543) For some targets, it is required to identify the COPY instruction corresponds to the RA inserted live range split. Adding the new flag `MachineInstr::LRSplit` to serve the purpose.
The COPY inserted for liverange split during sgpr-regalloc pipeline currently breaks the BB prolog during the subsequent vgpr-regalloc phase while spilling and/or splitting the vector liveranges. This patch fixes it by correctly including the LR split instructions during sgpr-regalloc and wwm-regalloc pipelines into the BB prolog.
These tests redirected stderr to stdout, but never actually checked for any errors.
Summary: This is only really meaningful for the NVPTX target. Not all build environments support host LTO and these are redundant tests, just clean this up and make it run faster.
Changes this to getSigned() to match the signedness of the calculation. However, we still need to allow truncation because the addition result may overflow, and the operation is specified to truncate in that case. Fixes llvm#175159.
This patch teaches clang-tblgen to start emitting ABI lowering pattern declarations.
…ic is the only user (llvm#172723) Closes llvm#172176. Previously, `FoldOpIntoSelect` wouldn't fold multi-use selects if `MultiUse` wasn't explicitly true. This prevents useful folding when the select is used multiple times in the same intrinsic call. Similar to what is done in `foldOpIntoPhi`, we'll now check that all of the uses come from a single user, rather than checking that there is only one use.
Current zilsd optimizer only support base op that is in a register, however many use cases are essentially stack load/store.
Signed-off-by: Philip Wilkinson <philip.wilkinson@arm.com>
Collaborator
dpalermo
approved these changes
Jan 9, 2026
Collaborator
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.