Skip to content

Conversation

@valassi
Copy link
Member

@valassi valassi commented Nov 24, 2025

WIP patch for issue #1072

I moved deltaMEs to fptype2 from fptype, but I do not think this will improve the situation - probably fpvmerge is the culprit

@valassi valassi self-assigned this Nov 24, 2025
@valassi valassi marked this pull request as draft November 24, 2025 21:20
@valassi
Copy link
Member Author

valassi commented Nov 24, 2025

I committed a new implementation of fpvmerge based on intrinsics (plus one based on experimental::simd which however no compiler gcc11/14 accepts).

I am testing the functionality here in the CI, then I will test the performance elsehwere

…ng color sums

Apply these as follows
  cd gg_ttggg.mad/SubProcesses
  patch -i ../../patchS.patch
  cd P1_gg_ttxggg/
  patch -i ../../../patchP.patch
  cd ../../..
…e icol loop) and add back colorsum timer

./CODEGEN/generateAndCompare.sh gg_ttggg --mad
cd gg_ttggg.mad/SubProcesses
patch -i ../../patchS.patch
cd P1_gg_ttxggg/
patch -i ../../../patchP.patch
cd ../../..
…/nosimd and for double/float

This is clearly faster for mixed/nosimd, but it is slower for float/double
…md but not for double/float

Now this is better for mixed/cppnone but brings everything else back to the previous good performance
…md and for doublefloat/nosimd

This is 10% faster for doublefloat/nosimd while keeping doublefloat/simd unchanged
This is also more robust in cppnone if autovectorization is disabled
… autovectorization for all build modes

For cppnone the color sum is now slower than sse4 by the expected factors 4/8/8
For cppavx2/cpp512y/cpp512z however this is ~20% slower than with autovectorization
Revert "[csm] gg_ttggg.mad colorsum TEST1 code/results (will revert): disable autovectorization for all build modes"
This reverts commit 17c72bd.
… autovectorization only for cppnone

For cppnone the color sum is now slower than sse4 by the expected factors 4/8/8
(and d/m/f all give the same performance in cppnone)

All other build modes and especially cppavx2/cpp512y/cpp512z are unchanged
…de to disable autovectorization for cppnone)
…dd back colorsum timer

./CODEGEN/generateAndCompare.sh gg_ttggg --mad
cd gg_ttggg.mad/SubProcesses
patch -i ../../patchS.patch
cd P1_gg_ttxggg/
patch -i ../../../patchP.patch
cd ../../..
…rsSplitMerge.h header (minimise dependencies)
… back colorsum timer

./CODEGEN/generateAndCompare.sh gg_ttggg --mad
cd gg_ttggg.mad/SubProcesses
patch -i ../../patchS.patch
cd P1_gg_ttxggg/
patch -i ../../../patchP.patch
cd ../../..
…sing intrinsics on __x86_64__

Clean up the code to also allow scalar and (default) initializer list implementations
…insics and experimentalSIMD

Keep the original initializer list implementation as the default
…nd add back colorsum timer

./CODEGEN/generateAndCompare.sh gg_ttggg --mad
cd gg_ttggg.mad/SubProcesses
patch -i ../../patchS.patch
cd P1_gg_ttxggg/
patch -i ../../../patchP.patch
cd ../../..
Revert "[csm] TMP ggttggg code/results using fpvmerge/initlist but without autovectorization in cppnone"
This reverts commit 240b5f5.
…vectorization in cppnone

Essentially I disabled autovectorization in cppnone when starting from this codebase:
  git checkout a9b52d2 gg_ttggg.mad
Revert "[csm] TMP ggttggg code/results using upstream/master but without autovectorization in cppnone"
This reverts commit 738f362.
With respect to the last LUMI logs for upstream/master (commit 6baae79 in hack_ihel3p1):
- Performance seems unchanged everywhere

STARTED  AT Sun 07 Dec 2025 04:13:17 PM EET
./tput/teeThroughputX.sh -dmf -hrd -makej -eemumu -ggtt -ggttg -ggttgg -gqttq -ggttggg -makeclean  -nocuda
ENDED(1) AT Sun 07 Dec 2025 06:26:14 PM EET [Status=0]
./tput/teeThroughputX.sh -dmf -makej -eemumu -ggtt -ggttg -ggttgg -gqttq -ggttggg -scaling  -nocuda
ENDED(1-scaling) AT Sun 07 Dec 2025 06:32:30 PM EET [Status=0]
./tput/teeThroughputX.sh -ggtt -ggttgg -dmf -blasOn  -nocuda
ENDED(2) AT Sun 07 Dec 2025 06:35:42 PM EET [Status=0]
./tput/teeThroughputX.sh -ggtt -ggttg -ggttgg -ggttggg -dmf -blasOn -scaling  -nocuda
ENDED(2-scaling) AT Sun 07 Dec 2025 06:53:14 PM EET [Status=0]
./tput/teeThroughputX.sh -d_f -hrd -makej -eemumu -ggtt -ggttgg -inlonly -makeclean  -nocuda
ENDED(3) AT Sun 07 Dec 2025 07:33:16 PM EET [Status=0]
./tput/teeThroughputX.sh -makej -eemumu -ggtt -ggttg -gqttq -ggttgg -ggttggg -d_f -bridge -makeclean  -nocuda
ENDED(4) AT Sun 07 Dec 2025 07:43:00 PM EET [Status=0]
./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -rmbhst  -nocuda
ENDED(5) AT Sun 07 Dec 2025 07:45:04 PM EET [Status=0]
SKIP './tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -common  -nocuda'
ENDED(6) AT Sun 07 Dec 2025 07:45:04 PM EET [Status=0]
./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -common  -nocuda
ENDED(7) AT Sun 07 Dec 2025 07:47:00 PM EET [Status=0]
./tput/teeThroughputX.sh -ggtt -ggttgg -dmf -noBlas -makeclean  -nocuda
ENDED(8) AT Sun 07 Dec 2025 07:57:43 PM EET [Status=0]
./tput/teeThroughputX.sh -dmf -hrd -makej -susyggtt -susyggt1t1 -smeftggtttt -heftggbb -makeclean  -nocuda
ENDED(9) AT Sun 07 Dec 2025 09:23:43 PM EET [Status=0]

No errors found in logs

No FPEs or '{ }' found in logs

No aborts found in logs
With respect to the last LUMI logs for upstream/master (commit c593242 in hack_ihel3p1):
- Performance seems unchanged everywhere
Revert "[csm] rerun 30 tmad tests on LUMI - all ok"
This reverts commit 8eaabcb.

Revert "[csm] rerun 138 tput tests on LUMI - all ok"
This reverts commit ca36ab7.
With respect to the last rd90 logs for upstream/master (commit 4178974 in hack_ihel3p1):
- Performance is around 5% better on CPU (mainly cppnone) and essentially the same everywhere else
With respect to the last rd90 logs for upstream/master (commit 5fce1aa in hack_ihel3p1):
- Performance is around 5% better on CPU (mainly m/cppnone) and essentially the same everywhere else

STARTED  AT Sun Dec  7 07:53:39 PM CET 2025
(SM tests)
ENDED(1) AT Sun Dec  7 08:44:21 PM CET 2025 [Status=0]
(BSM tests)
ENDED(1) AT Sun Dec  7 08:48:17 PM CET 2025 [Status=0]
@valassi valassi linked an issue Dec 11, 2025 that may be closed by this pull request
@valassi valassi changed the title WIP: performance fixes for FPTYPE=m color sums on CPU Tests and minor performance improvements for FPTYPE=m color sums on CPU Dec 11, 2025
@valassi
Copy link
Member Author

valassi commented Dec 11, 2025

The issues described in #1072 are clarified in https://arxiv.org/abs/2510.05392v3 that should appear tomorrow. Essentially there is no suboptimal scaling. The problem is that "cppnone" has some SSE/SSE2 color-level autovectorization in the color sum, so the speedup from cppnone to cppsse4 (that uses event-level explicity vectorization) is very low. Essentially SSE/SSE2 are not disabled because they are part of SystemV ABI and simply cannot be disabled. Notably -march=x86-64 does not disable them.

What I did to compare to a real "no SIMD" scenario was to disable autovectorization in the color sum. In that case the SIMD speedups are what one expects.

The real good news is that SIMD color sums in mixed precision are optimized. I did a few minor extra improvements in #1073 but there was not much room for improvements.

This table is from arxiv v3:

Image

There are three tests/changes in the csm branch of #1073.

    1. A general streamlining of the SIMD color sum. This improves a bit the sse4-512z modes, but also by a factor 2 the cppnone mode
    1. I disabled autovectorization and I checked that one finds the SIMD sppeedups one expects
    1. Since I was concerned about the overheads of fpvmerge, I reimplemented that with intrinsics and also with experimental SIMD. It turns out that my original implementation with initializer lists was already optimal (intrinsics is just a tiny bit faster) so I keep that. The experimental simd that I implemented is slower. No need to do anything better. Even the scalar version is as fast as initlist/intrisics because it gets autovectorized.

I closed #1072 as understood.

And I am marking this #1073 as ready to be merged. You get the minor improvements in the table, plus also the intrinsics and experimental simd fpvmerge that can be useful as future reference.

This completes the madgraph4gou work that I was doing on kernel splitting.

@valassi valassi marked this pull request as ready for review December 11, 2025 10:13
@valassi
Copy link
Member Author

valassi commented Dec 11, 2025

Hi Olivier, Daniele, I mark you as reviewers. Let me know if you want to discuss this. Thanks
Andrea

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Understood: apparent suboptimal SIMD scaling of FPTYPE=m color sum

1 participant