-
Notifications
You must be signed in to change notification settings - Fork 37
Tests and minor performance improvements for FPTYPE=m color sums on CPU #1073
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
I committed a new implementation of fpvmerge based on intrinsics (plus one based on experimental::simd which however no compiler gcc11/14 accepts). I am testing the functionality here in the CI, then I will test the performance elsehwere |
…ng color sums Apply these as follows cd gg_ttggg.mad/SubProcesses patch -i ../../patchS.patch cd P1_gg_ttxggg/ patch -i ../../../patchP.patch cd ../../..
…ion to produce a raw output
…1072): use fptype2 deltaMEs inside icol loop
…e icol loop) and add back colorsum timer ./CODEGEN/generateAndCompare.sh gg_ttggg --mad cd gg_ttggg.mad/SubProcesses patch -i ../../patchS.patch cd P1_gg_ttxggg/ patch -i ../../../patchP.patch cd ../../..
…de icol loop): essentially no change
… mixed/nosimd and for double/float
…/nosimd and for double/float This is clearly faster for mixed/nosimd, but it is slower for float/double
…d/nosimd but not for double/float
…md but not for double/float Now this is better for mixed/cppnone but brings everything else back to the previous good performance
…d/nosimd and for doublefloat/nosimd
…md and for doublefloat/nosimd This is 10% faster for doublefloat/nosimd while keeping doublefloat/simd unchanged This is also more robust in cppnone if autovectorization is disabled
… autovectorization for all build modes For cppnone the color sum is now slower than sse4 by the expected factors 4/8/8 For cppavx2/cpp512y/cpp512z however this is ~20% slower than with autovectorization
Revert "[csm] gg_ttggg.mad colorsum TEST1 code/results (will revert): disable autovectorization for all build modes" This reverts commit 17c72bd.
… autovectorization only for cppnone For cppnone the color sum is now slower than sse4 by the expected factors 4/8/8 (and d/m/f all give the same performance in cppnone) All other build modes and especially cppavx2/cpp512y/cpp512z are unchanged
…de to disable autovectorization for cppnone)
…d df/nosimd, keep autovectorization
…1072): precompute jampR_sv for dmf/nosimd
…dd back colorsum timer ./CODEGEN/generateAndCompare.sh gg_ttggg --mad cd gg_ttggg.mad/SubProcesses patch -i ../../patchS.patch cd P1_gg_ttxggg/ patch -i ../../../patchP.patch cd ../../..
…rsSplitMerge.h header (minimise dependencies)
…itMerge.h header (minimise dependencies)
… back colorsum timer ./CODEGEN/generateAndCompare.sh gg_ttggg --mad cd gg_ttggg.mad/SubProcesses patch -i ../../patchS.patch cd P1_gg_ttxggg/ patch -i ../../../patchP.patch cd ../../..
…sing intrinsics on __x86_64__ Clean up the code to also allow scalar and (default) initializer list implementations
…insics and experimentalSIMD Keep the original initializer list implementation as the default
…nd add back colorsum timer ./CODEGEN/generateAndCompare.sh gg_ttggg --mad cd gg_ttggg.mad/SubProcesses patch -i ../../patchS.patch cd P1_gg_ttxggg/ patch -i ../../../patchP.patch cd ../../..
…tovectorization in cppnone
Revert "[csm] TMP ggttggg code/results using fpvmerge/initlist but without autovectorization in cppnone" This reverts commit 240b5f5.
…vectorization in cppnone Essentially I disabled autovectorization in cppnone when starting from this codebase: git checkout a9b52d2 gg_ttggg.mad
Revert "[csm] TMP ggttggg code/results using upstream/master but without autovectorization in cppnone" This reverts commit 738f362.
…ate mgOnGpuVectorsSplitMerge.h
With respect to the last LUMI logs for upstream/master (commit 6baae79 in hack_ihel3p1): - Performance seems unchanged everywhere STARTED AT Sun 07 Dec 2025 04:13:17 PM EET ./tput/teeThroughputX.sh -dmf -hrd -makej -eemumu -ggtt -ggttg -ggttgg -gqttq -ggttggg -makeclean -nocuda ENDED(1) AT Sun 07 Dec 2025 06:26:14 PM EET [Status=0] ./tput/teeThroughputX.sh -dmf -makej -eemumu -ggtt -ggttg -ggttgg -gqttq -ggttggg -scaling -nocuda ENDED(1-scaling) AT Sun 07 Dec 2025 06:32:30 PM EET [Status=0] ./tput/teeThroughputX.sh -ggtt -ggttgg -dmf -blasOn -nocuda ENDED(2) AT Sun 07 Dec 2025 06:35:42 PM EET [Status=0] ./tput/teeThroughputX.sh -ggtt -ggttg -ggttgg -ggttggg -dmf -blasOn -scaling -nocuda ENDED(2-scaling) AT Sun 07 Dec 2025 06:53:14 PM EET [Status=0] ./tput/teeThroughputX.sh -d_f -hrd -makej -eemumu -ggtt -ggttgg -inlonly -makeclean -nocuda ENDED(3) AT Sun 07 Dec 2025 07:33:16 PM EET [Status=0] ./tput/teeThroughputX.sh -makej -eemumu -ggtt -ggttg -gqttq -ggttgg -ggttggg -d_f -bridge -makeclean -nocuda ENDED(4) AT Sun 07 Dec 2025 07:43:00 PM EET [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -rmbhst -nocuda ENDED(5) AT Sun 07 Dec 2025 07:45:04 PM EET [Status=0] SKIP './tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -common -nocuda' ENDED(6) AT Sun 07 Dec 2025 07:45:04 PM EET [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -common -nocuda ENDED(7) AT Sun 07 Dec 2025 07:47:00 PM EET [Status=0] ./tput/teeThroughputX.sh -ggtt -ggttgg -dmf -noBlas -makeclean -nocuda ENDED(8) AT Sun 07 Dec 2025 07:57:43 PM EET [Status=0] ./tput/teeThroughputX.sh -dmf -hrd -makej -susyggtt -susyggt1t1 -smeftggtttt -heftggbb -makeclean -nocuda ENDED(9) AT Sun 07 Dec 2025 09:23:43 PM EET [Status=0] No errors found in logs No FPEs or '{ }' found in logs No aborts found in logs
With respect to the last LUMI logs for upstream/master (commit c593242 in hack_ihel3p1): - Performance seems unchanged everywhere
With respect to the last rd90 logs for upstream/master (commit 4178974 in hack_ihel3p1): - Performance is around 5% better on CPU (mainly cppnone) and essentially the same everywhere else
With respect to the last rd90 logs for upstream/master (commit 5fce1aa in hack_ihel3p1): - Performance is around 5% better on CPU (mainly m/cppnone) and essentially the same everywhere else STARTED AT Sun Dec 7 07:53:39 PM CET 2025 (SM tests) ENDED(1) AT Sun Dec 7 08:44:21 PM CET 2025 [Status=0] (BSM tests) ENDED(1) AT Sun Dec 7 08:48:17 PM CET 2025 [Status=0]
|
The issues described in #1072 are clarified in https://arxiv.org/abs/2510.05392v3 that should appear tomorrow. Essentially there is no suboptimal scaling. The problem is that "cppnone" has some SSE/SSE2 color-level autovectorization in the color sum, so the speedup from cppnone to cppsse4 (that uses event-level explicity vectorization) is very low. Essentially SSE/SSE2 are not disabled because they are part of SystemV ABI and simply cannot be disabled. Notably -march=x86-64 does not disable them. What I did to compare to a real "no SIMD" scenario was to disable autovectorization in the color sum. In that case the SIMD speedups are what one expects. The real good news is that SIMD color sums in mixed precision are optimized. I did a few minor extra improvements in #1073 but there was not much room for improvements. This table is from arxiv v3:
There are three tests/changes in the csm branch of #1073.
I closed #1072 as understood. And I am marking this #1073 as ready to be merged. You get the minor improvements in the table, plus also the intrinsics and experimental simd fpvmerge that can be useful as future reference. This completes the madgraph4gou work that I was doing on kernel splitting. |
|
Hi Olivier, Daniele, I mark you as reviewers. Let me know if you want to discuss this. Thanks |

WIP patch for issue #1072
I moved deltaMEs to fptype2 from fptype, but I do not think this will improve the situation - probably fpvmerge is the culprit