Tests and minor performance improvements for FPTYPE=m color sums on CPU #1073

valassi · 2025-11-24T21:20:05Z

WIP patch for issue #1072

I moved deltaMEs to fptype2 from fptype, but I do not think this will improve the situation - probably fpvmerge is the culprit

valassi · 2025-11-24T22:54:30Z

I committed a new implementation of fpvmerge based on intrinsics (plus one based on experimental::simd which however no compiler gcc11/14 accepts).

I am testing the functionality here in the CI, then I will test the performance elsehwere

…ng color sums Apply these as follows cd gg_ttggg.mad/SubProcesses patch -i ../../patchS.patch cd P1_gg_ttxggg/ patch -i ../../../patchP.patch cd ../../..

…d patchP

…ion to produce a raw output

…ummary table

…ream/master)

…1072): use fptype2 deltaMEs inside icol loop

…e icol loop) and add back colorsum timer ./CODEGEN/generateAndCompare.sh gg_ttggg --mad cd gg_ttggg.mad/SubProcesses patch -i ../../patchS.patch cd P1_gg_ttxggg/ patch -i ../../../patchP.patch cd ../../..

…de icol loop): essentially no change

… mixed/nosimd and for double/float

…/nosimd and for double/float This is clearly faster for mixed/nosimd, but it is slower for float/double

…d/nosimd but not for double/float

…md but not for double/float Now this is better for mixed/cppnone but brings everything else back to the previous good performance

…d/nosimd and for doublefloat/nosimd

…md and for doublefloat/nosimd This is 10% faster for doublefloat/nosimd while keeping doublefloat/simd unchanged This is also more robust in cppnone if autovectorization is disabled

… autovectorization for all build modes For cppnone the color sum is now slower than sse4 by the expected factors 4/8/8 For cppavx2/cpp512y/cpp512z however this is ~20% slower than with autovectorization

Revert "[csm] gg_ttggg.mad colorsum TEST1 code/results (will revert): disable autovectorization for all build modes" This reverts commit 17c72bd.

… autovectorization only for cppnone For cppnone the color sum is now slower than sse4 by the expected factors 4/8/8 (and d/m/f all give the same performance in cppnone) All other build modes and especially cppavx2/cpp512y/cpp512z are unchanged

…de to disable autovectorization for cppnone)

…d df/nosimd, keep autovectorization

…1072): precompute jampR_sv for dmf/nosimd

…dd back colorsum timer ./CODEGEN/generateAndCompare.sh gg_ttggg --mad cd gg_ttggg.mad/SubProcesses patch -i ../../patchS.patch cd P1_gg_ttxggg/ patch -i ../../../patchP.patch cd ../../..

…rsSplitMerge.h header (minimise dependencies)

…itMerge.h header (minimise dependencies)

… back colorsum timer ./CODEGEN/generateAndCompare.sh gg_ttggg --mad cd gg_ttggg.mad/SubProcesses patch -i ../../patchS.patch cd P1_gg_ttxggg/ patch -i ../../../patchP.patch cd ../../..

…sing intrinsics on __x86_64__ Clean up the code to also allow scalar and (default) initializer list implementations

…tovectorization)

…insics and experimentalSIMD Keep the original initializer list implementation as the default

…nions

…nd add back colorsum timer ./CODEGEN/generateAndCompare.sh gg_ttggg --mad cd gg_ttggg.mad/SubProcesses patch -i ../../patchS.patch cd P1_gg_ttxggg/ patch -i ../../../patchP.patch cd ../../..

…tovectorization in cppnone

Revert "[csm] TMP ggttggg code/results using fpvmerge/initlist but without autovectorization in cppnone" This reverts commit 240b5f5.

…vectorization in cppnone Essentially I disabled autovectorization in cppnone when starting from this codebase: git checkout a9b52d2 gg_ttggg.mad

Revert "[csm] TMP ggttggg code/results using upstream/master but without autovectorization in cppnone" This reverts commit 738f362.

…r sums

…ate mgOnGpuVectorsSplitMerge.h

With respect to the last LUMI logs for upstream/master (commit 6baae79 in hack_ihel3p1): - Performance seems unchanged everywhere STARTED AT Sun 07 Dec 2025 04:13:17 PM EET ./tput/teeThroughputX.sh -dmf -hrd -makej -eemumu -ggtt -ggttg -ggttgg -gqttq -ggttggg -makeclean -nocuda ENDED(1) AT Sun 07 Dec 2025 06:26:14 PM EET [Status=0] ./tput/teeThroughputX.sh -dmf -makej -eemumu -ggtt -ggttg -ggttgg -gqttq -ggttggg -scaling -nocuda ENDED(1-scaling) AT Sun 07 Dec 2025 06:32:30 PM EET [Status=0] ./tput/teeThroughputX.sh -ggtt -ggttgg -dmf -blasOn -nocuda ENDED(2) AT Sun 07 Dec 2025 06:35:42 PM EET [Status=0] ./tput/teeThroughputX.sh -ggtt -ggttg -ggttgg -ggttggg -dmf -blasOn -scaling -nocuda ENDED(2-scaling) AT Sun 07 Dec 2025 06:53:14 PM EET [Status=0] ./tput/teeThroughputX.sh -d_f -hrd -makej -eemumu -ggtt -ggttgg -inlonly -makeclean -nocuda ENDED(3) AT Sun 07 Dec 2025 07:33:16 PM EET [Status=0] ./tput/teeThroughputX.sh -makej -eemumu -ggtt -ggttg -gqttq -ggttgg -ggttggg -d_f -bridge -makeclean -nocuda ENDED(4) AT Sun 07 Dec 2025 07:43:00 PM EET [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -rmbhst -nocuda ENDED(5) AT Sun 07 Dec 2025 07:45:04 PM EET [Status=0] SKIP './tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -common -nocuda' ENDED(6) AT Sun 07 Dec 2025 07:45:04 PM EET [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -common -nocuda ENDED(7) AT Sun 07 Dec 2025 07:47:00 PM EET [Status=0] ./tput/teeThroughputX.sh -ggtt -ggttgg -dmf -noBlas -makeclean -nocuda ENDED(8) AT Sun 07 Dec 2025 07:57:43 PM EET [Status=0] ./tput/teeThroughputX.sh -dmf -hrd -makej -susyggtt -susyggt1t1 -smeftggtttt -heftggbb -makeclean -nocuda ENDED(9) AT Sun 07 Dec 2025 09:23:43 PM EET [Status=0] No errors found in logs No FPEs or '{ }' found in logs No aborts found in logs

With respect to the last LUMI logs for upstream/master (commit c593242 in hack_ihel3p1): - Performance seems unchanged everywhere

Revert "[csm] rerun 30 tmad tests on LUMI - all ok" This reverts commit 8eaabcb. Revert "[csm] rerun 138 tput tests on LUMI - all ok" This reverts commit ca36ab7.

With respect to the last rd90 logs for upstream/master (commit 4178974 in hack_ihel3p1): - Performance is around 5% better on CPU (mainly cppnone) and essentially the same everywhere else

With respect to the last rd90 logs for upstream/master (commit 5fce1aa in hack_ihel3p1): - Performance is around 5% better on CPU (mainly m/cppnone) and essentially the same everywhere else STARTED AT Sun Dec 7 07:53:39 PM CET 2025 (SM tests) ENDED(1) AT Sun Dec 7 08:44:21 PM CET 2025 [Status=0] (BSM tests) ENDED(1) AT Sun Dec 7 08:48:17 PM CET 2025 [Status=0]

valassi · 2025-12-11T10:13:00Z

The issues described in #1072 are clarified in https://arxiv.org/abs/2510.05392v3 that should appear tomorrow. Essentially there is no suboptimal scaling. The problem is that "cppnone" has some SSE/SSE2 color-level autovectorization in the color sum, so the speedup from cppnone to cppsse4 (that uses event-level explicity vectorization) is very low. Essentially SSE/SSE2 are not disabled because they are part of SystemV ABI and simply cannot be disabled. Notably -march=x86-64 does not disable them.

What I did to compare to a real "no SIMD" scenario was to disable autovectorization in the color sum. In that case the SIMD speedups are what one expects.

The real good news is that SIMD color sums in mixed precision are optimized. I did a few minor extra improvements in #1073 but there was not much room for improvements.

This table is from arxiv v3:

There are three tests/changes in the csm branch of #1073.

1. A general streamlining of the SIMD color sum. This improves a bit the sse4-512z modes, but also by a factor 2 the cppnone mode
1. I disabled autovectorization and I checked that one finds the SIMD sppeedups one expects
1. Since I was concerned about the overheads of fpvmerge, I reimplemented that with intrinsics and also with experimental SIMD. It turns out that my original implementation with initializer lists was already optimal (intrinsics is just a tiny bit faster) so I keep that. The experimental simd that I implemented is slower. No need to do anything better. Even the scalar version is as fast as initlist/intrisics because it gets autovectorized.

I closed #1072 as understood.

And I am marking this #1073 as ready to be merged. You get the minor improvements in the table, plus also the intrinsics and experimental simd fpvmerge that can be useful as future reference.

This completes the madgraph4gou work that I was doing on kernel splitting.

valassi · 2025-12-11T10:14:07Z

Hi Olivier, Daniele, I mark you as reviewers. Let me know if you want to discuss this. Thanks
Andrea

valassi self-assigned this Nov 24, 2025

valassi marked this pull request as draft November 24, 2025 21:20

valassi added 27 commits December 7, 2025 08:26

[hack_ihel4p2] ignore perf.data* in epochX/cudacpp/.gitignore

9f0a76e

[csm] add two patches (derived from branch paper25v2) for instrumenti…

f67f27e

…ng color sums Apply these as follows cd gg_ttggg.mad/SubProcesses patch -i ../../patchS.patch cd P1_gg_ttxggg/ patch -i ../../../patchP.patch cd ../../..

[csm] gg_ttggg.mad: instrument color sums with timers using patchS an…

e56ef48

…d patchP

[csm] add PAPER25/colortimer.sh from branch paper25v2 (commit cd5d62860)

5167d5a

[csm] PAPER25/colortimer.sh: add ggttggg SIMD scans with skipCuda opt…

c795cd5

…ion to produce a raw output

[csm] PAPER25/colortimer.sh: run PAPER25/simdparser.py to produce a s…

e9d80be

…ummary table

[csm] add raw and summary results from gg_ttggg on gold91 (using upst…

a9b52d2

…ream/master)

[csm] CODEGEN color_sum.cc patch1 (for colorsum mixed SIMD madgraph5#…

ccc12b1

…1072): use fptype2 deltaMEs inside icol loop

[csm] regenerate gg_ttggg.mad with patch1 (use fptype2 deltaMEs insid…

f8a9c96

…e icol loop) and add back colorsum timer ./CODEGEN/generateAndCompare.sh gg_ttggg --mad cd gg_ttggg.mad/SubProcesses patch -i ../../patchS.patch cd P1_gg_ttxggg/ patch -i ../../../patchP.patch cd ../../..

[csm] rerun ggttggg SIMD tests with patch1 (use fptype2 deltaMEs insi…

00bcbb0

…de icol loop): essentially no change

[csm] gg_ttggg.mad color_sum.cc patch2a: precompute jampR_sv also for…

2743b00

… mixed/nosimd and for double/float

[csm] retest ggttggg with patch2a: precompute jampR_sv also for mixed…

d268a7a

…/nosimd and for double/float This is clearly faster for mixed/nosimd, but it is slower for float/double

[csm] gg_ttggg.mad color_sum.cc patch2b: precompute jampR_sv for mixe…

6d59bbd

…d/nosimd but not for double/float

[csm] retest ggttggg with patch2b: precompute jampR_sv for mixed/nosi…

06a832a

…md but not for double/float Now this is better for mixed/cppnone but brings everything else back to the previous good performance

[csm] gg_ttggg.mad color_sum.cc patch2c: precompute jampR_sv for mixe…

d7bdf46

…d/nosimd and for doublefloat/nosimd

[csm] retest ggttggg with patch2c: precompute jampR_sv for mixed/nosi…

804bb62

…md and for doublefloat/nosimd This is 10% faster for doublefloat/nosimd while keeping doublefloat/simd unchanged This is also more robust in cppnone if autovectorization is disabled

[csm] gg_ttggg.mad colorsum TEST1 code/results (will revert): disable…

17c72bd

… autovectorization for all build modes For cppnone the color sum is now slower than sse4 by the expected factors 4/8/8 For cppavx2/cpp512y/cpp512z however this is ~20% slower than with autovectorization

[csm] gg_ttggg.mad colorsum revert TEST1 code/results

f74d34b

Revert "[csm] gg_ttggg.mad colorsum TEST1 code/results (will revert): disable autovectorization for all build modes" This reverts commit 17c72bd.

[csm] gg_ttggg.mad color_sum.cc complete patch2 (comment out TEST2 co…

0099352

…de to disable autovectorization for cppnone)

[csm] retest ggttggg with patch2: precompute jampR_sv for m/nosimd an…

96995b5

…d df/nosimd, keep autovectorization

[csm] CODEGEN color_sum.cc patch2 (for colorsum mixed SIMD madgraph5#…

d195f21

…1072): precompute jampR_sv for dmf/nosimd

[csm] regenerate gg_ttggg.mad with patch2 (precompute jampR_sv) and a…

9c20f96

…dd back colorsum timer ./CODEGEN/generateAndCompare.sh gg_ttggg --mad cd gg_ttggg.mad/SubProcesses patch -i ../../patchS.patch cd P1_gg_ttxggg/ patch -i ../../../patchP.patch cd ../../..

[csm] gg_ttggg.mad: move fpvsplit/fpvmerge to a separate mgOnGpuVecto…

f54ec8f

…rsSplitMerge.h header (minimise dependencies)

[csm] CODEGEN: move fpvsplit/fpvmerge to a separate mgOnGpuVectorsSpl…

c27d724

…itMerge.h header (minimise dependencies)

[csm] regenerate gg_ttggg.mad with mgOnGpuVectorsSplitMerge.h and add…

7ad51cc

… back colorsum timer ./CODEGEN/generateAndCompare.sh gg_ttggg --mad cd gg_ttggg.mad/SubProcesses patch -i ../../patchS.patch cd P1_gg_ttxggg/ patch -i ../../../patchP.patch cd ../../..

[csm] gg_ttggg.mad mgOnGpuVectorsSplitMerge.h: reimplement fpvmerge u…

8bf25fd

…sing intrinsics on __x86_64__ Clean up the code to also allow scalar and (default) initializer list implementations

valassi added 5 commits December 7, 2025 13:56

[csm] ggttggg results using defaults again (fpvmerge/initlist with au…

cdc75a7

…tovectorization)

[csm] CODEGEN mgOnGpuVectorsSplitMerge.h: clean up fpvmerge, add intr…

570c54d

…insics and experimentalSIMD Keep the original initializer list implementation as the default

[csm] CODEGEN mgOnGpuVectorsSplitMerge.h: fix clang-format for unions

2a4ff80

[csm] gg_ttggg.mad mgOnGpuVectorsSplitMerge.h: fix clang-format for u…

8b32815

…nions

[csm] regenerate gg_ttggg.mad with final mgOnGpuVectorsSplitMerge.h a…

2ee77e5

…nd add back colorsum timer ./CODEGEN/generateAndCompare.sh gg_ttggg --mad cd gg_ttggg.mad/SubProcesses patch -i ../../patchS.patch cd P1_gg_ttxggg/ patch -i ../../../patchP.patch cd ../../..

valassi force-pushed the csm branch from 07a750e to ce05312 Compare December 7, 2025 14:09

valassi added 6 commits December 7, 2025 20:41

[csm] TMP ggttggg code/results using fpvmerge/initlist but without au…

240b5f5

…tovectorization in cppnone

[csm] back to ggttggg code/results using defaults

59db527

Revert "[csm] TMP ggttggg code/results using fpvmerge/initlist but without autovectorization in cppnone" This reverts commit 240b5f5.

[csm] TMP ggttggg code/results using upstream/master but without auto…

738f362

…vectorization in cppnone Essentially I disabled autovectorization in cppnone when starting from this codebase: git checkout a9b52d2 gg_ttggg.mad

[csm] back to ggttggg code/results using defaults

ef121ba

Revert "[csm] TMP ggttggg code/results using upstream/master but without autovectorization in cppnone" This reverts commit 738f362.

[csm] CLEANUP: move to PAPER25 the two patches for instrumenting colo…

db5093f

…r sums

[csm] CLEANUP: remove the PAPER25 directory

0d942d4

valassi force-pushed the csm branch from ce05312 to 328f39d Compare December 7, 2025 21:44

valassi added 6 commits December 11, 2025 11:43

[csm] regenerate all processes with colorsum/simd patches and a separ…

e6a139e

…ate mgOnGpuVectorsSplitMerge.h

[csm] rerun 30 tmad tests on LUMI - all ok

8eaabcb

With respect to the last LUMI logs for upstream/master (commit c593242 in hack_ihel3p1): - Performance seems unchanged everywhere

[csm] go back from csm/LUMI to hack_ihel3p1/itscrd90 logs

1ba0e92

Revert "[csm] rerun 30 tmad tests on LUMI - all ok" This reverts commit 8eaabcb. Revert "[csm] rerun 138 tput tests on LUMI - all ok" This reverts commit ca36ab7.

[csm] rerun 144 tput tests on itscrd90 - all ok

967e077

With respect to the last rd90 logs for upstream/master (commit 4178974 in hack_ihel3p1): - Performance is around 5% better on CPU (mainly cppnone) and essentially the same everywhere else

valassi force-pushed the csm branch from 328f39d to d3ee3cb Compare December 11, 2025 09:50

valassi linked an issue Dec 11, 2025 that may be closed by this pull request

Understood: apparent suboptimal SIMD scaling of FPTYPE=m color sum #1072

Closed

valassi mentioned this pull request Dec 11, 2025

Kernel splitting ihel4-ihel6: Feynman diagram groups #1066

Open

valassi changed the title ~~WIP: performance fixes for FPTYPE=m color sums on CPU~~ Tests and minor performance improvements for FPTYPE=m color sums on CPU Dec 11, 2025

valassi mentioned this pull request Dec 11, 2025

Understood: apparent suboptimal SIMD scaling of FPTYPE=m color sum #1072

Closed

valassi marked this pull request as ready for review December 11, 2025 10:13

valassi requested review from Qubitol and oliviermattelaer December 11, 2025 10:13

valassi mentioned this pull request Dec 11, 2025

cppauto and AVX/AVX2 support + question #1068

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tests and minor performance improvements for FPTYPE=m color sums on CPU #1073

Tests and minor performance improvements for FPTYPE=m color sums on CPU #1073

Uh oh!

valassi commented Nov 24, 2025

Uh oh!

valassi commented Nov 24, 2025

Uh oh!

valassi commented Dec 11, 2025

Uh oh!

valassi commented Dec 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Tests and minor performance improvements for FPTYPE=m color sums on CPU #1073

Are you sure you want to change the base?

Tests and minor performance improvements for FPTYPE=m color sums on CPU #1073

Uh oh!

Conversation

valassi commented Nov 24, 2025

Uh oh!

valassi commented Nov 24, 2025

Uh oh!

valassi commented Dec 11, 2025

Uh oh!

valassi commented Dec 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant