WIP (DOCUMENTATION ONLY): gg to ttgggg (2->6 process) with diagram splitting #1071

valassi · 2025-11-23T23:45:33Z

This is a WIP PR for documentation only (not to be merged) replacing PR #601 (that I will close).

It includes the 2->6 process gg->ttgggg in various diagram splitting scenarios, including many that execute correctly on CPU and GPU.

The same techniques also make it possible to execute the 2->7 process gg->ttggggg on CPU (not GPU). But I will not create a PR for that as the source code is almost 1GB.

Full documentation in https://arxiv.org/abs/2510.05392v2 that should appear tomorrow.

…a new layout for wavefunctions

(Without these changes, the ggttg/ggttggg bridge tests and tmad tests failed - most likely due to missing 'make clean') In detail: - in tput/throughputX.sh, use 'make -f cudacpp.mk' instead of 'make' (this enables faster rebuilds from ccache) - in tput/throughputX.sh, profile diagramgroup1 instead of diagram1 - in tput/allTees.sh, always run 'make clean' (unless -nomakeclean is specified) - in tput/allTees.sh, drop the -short option (always run ggttggg) - in tmad/allTees.sh, always run 'make cleanall' (unless -nomakeclean is specified) - in tmad/allTees.sh, improve debug printouts

…ew wf layout, optional CUDA Graphs) - all ok With respect to the last rd90 scaling logs for the 'hack_ihel3_sep25' codebase (commit 6e5d26a): - Split processes (ggttggg, ggttgg) are much worse > CUDA (no blas, no graphs) is a factor ~10 slower for small grids and ~2.5 slower for ggttggg (~4 and ~2 for ggttgg) > C++ is 10-15% slower for ggttggg (up to 5% for ggttgg) - Single-kernel processes are only moderately impacted > CUDA (no blas, no graphs) is ~20% slower for both small and large grids for ggtt and ggttg > C++ is the same speed for ggtt (and possibly faster for ggttg?) => Should try to keep the code but increase to 2000 diagrams per kernel? With respect to the previous rd90 scaling logs for the 'hack_ihel4p2' codebase (commit 2893531): - CUDA peak throughputs in ggttggg (with and without graphs) are 5% faster - The only difference here is the improved memory layout: so it does help, but not much STARTED AT Sun Oct 19 06:01:47 PM CEST 2025 ./tput/teeThroughputX.sh -dmf -hrd -makej -eemumu -ggtt -ggttg -ggttgg -ggttggg -gqttq -makeclean ENDED(1) AT Sun Oct 19 06:33:21 PM CEST 2025 [Status=0] ./tput/teeThroughputX.sh -dmf -makej -eemumu -ggtt -ggttg -ggttgg -ggttggg -gqttq -scaling -makeclean ENDED(1-scaling) AT Sun Oct 19 06:47:17 PM CEST 2025 [Status=0] ./tput/teeThroughputX.sh -ggtt -ggttgg -dmf -blasOn -makeclean ENDED(2) AT Sun Oct 19 06:53:26 PM CEST 2025 [Status=0] ./tput/teeThroughputX.sh -ggtt -ggttg -ggttgg -ggttggg -dmf -blasOn -scaling -makeclean ENDED(2-scaling) AT Sun Oct 19 07:10:41 PM CEST 2025 [Status=0] ./tput/teeThroughputX.sh -ggttgg -ggttggg -dmf -useGraphs -makeclean ENDED(3) AT Sun Oct 19 07:20:00 PM CEST 2025 [Status=0] ./tput/teeThroughputX.sh -ggtt -ggttg -ggttgg -ggttggg -dmf -useGraphs -scaling -makeclean ENDED(3-scaling) AT Sun Oct 19 07:38:16 PM CEST 2025 [Status=0] ./tput/teeThroughputX.sh -d_f -hrd -makej -eemumu -ggtt -ggttgg -inlonly -makeclean ENDED(4) AT Sun Oct 19 07:48:41 PM CEST 2025 [Status=0] ./tput/teeThroughputX.sh -makej -eemumu -ggtt -ggttg -ggttgg -ggttggg -gqttq -d_f -bridge -makeclean ENDED(5) AT Sun Oct 19 07:59:00 PM CEST 2025 [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -rmbhst -makeclean ENDED(6) AT Sun Oct 19 08:03:49 PM CEST 2025 [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -curhst -makeclean ENDED(7) AT Sun Oct 19 08:08:36 PM CEST 2025 [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -common -makeclean ENDED(8) AT Sun Oct 19 08:13:20 PM CEST 2025 [Status=0] ./tput/teeThroughputX.sh -ggtt -ggttgg -dmf -noBlas -makeclean ENDED(9) AT Sun Oct 19 08:18:18 PM CEST 2025 [Status=0] ./tput/teeThroughputX.sh -dmf -hrd -makej -susyggtt -susyggt1t1 -smeftggtttt -heftggbb -makeclean ENDED(10) AT Sun Oct 19 08:30:31 PM CEST 2025 [Status=0] No errors found in logs No FPEs or '{ }' found in logs No aborts found in logs

…w wf layout, optional CUDA Graphs) - all ok With respect to the last itscrd90 logs for the 'hack_ihel3_sep25' codebase (commit 10c3e3b), this is like tput tests: - Split processes (ggttggg, ggttgg) are much worse > CUDA (no blas, no graphs) is a factor ~2.5 slower for ggttggg (and ~1.5 for ggttgg) > C++ is ~15% slower for ggttggg (up to 5% for ggttgg) - Single-kernel processes are only moderately impacted > CUDA (no blas, no graphs) is ~15% slower for ggtt and ggttg > C++ is the same speed for ggtt (and possibly faster for ggttg?) STARTED AT Sun Oct 19 08:30:32 PM CEST 2025 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/teeMadX.sh -eemumu -ggtt -ggttg -ggttgg -gqttq -ggttggg -dmf -makeclean (SM tests) ENDED(1) AT Sun Oct 19 09:26:52 PM CEST 2025 [Status=0] /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/teeMadX.sh -heftggbb -susyggtt -susyggt1t1 -smeftggtttt -dmf -makeclean (BSM tests) ENDED(1) AT Sun Oct 19 09:32:22 PM CEST 2025 [Status=0]

…l split ggttgggg but not ggttggg)

… - they are all single-kernel again including ggttggg

…rnel) - failures in ggttggg/f STARTED AT Mon Oct 20 02:21:34 PM CEST 2025 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/teeMadX.sh -eemumu -ggtt -ggttg -ggttgg -gqttq -ggttggg -dmf -makeclean (SM tests) ENDED(1) AT Mon Oct 20 03:16:47 PM CEST 2025 [Status=0] /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/teeMadX.sh -heftggbb -susyggtt -susyggt1t1 -smeftggtttt -dmf -makeclean (BSM tests) ENDED(1) AT Mon Oct 20 03:22:17 PM CEST 2025 [Status=0] tmad/logs_ggttggg_mad/log_ggttggg_mad_f_inl0_hrd0.txt:ERROR! assertGpu: 'an illegal memory access was encountered' (700) in CPPProcess.cc:915

…ernel) - failures in ggttggg/f With respect to the previous rd90 scaling logs for 'hack_ihel4p2' with 100 diagrams/kernel (commit d6144e4): - CUDA/m for ggttggg/ggttgg is much better at small grids and 15% better peak at large grids > HOWEVER, CUDA/f for ggttggg fails - C++ is 2% better for ggttggg/ggttgg HOWEVER, with respect to the last rd90 scaling logs for the 'hack_ihel3_sep25' codebase (commit 6e5d26a): - Complex processes (ggttggg, ggttgg) > CUDA (no blas, no graphs) is still a factor ~2 (i.e. 50%) slower both at small and large grids > C++ is still 10% slower for ggttggg (up to 5% for ggttgg) - Simpler processes (ggttg, ggtt) are more moderately impacted > CUDA (no blas, no graphs) is ~20% slower for both small and large grids for ggtt and ggttg > C++ is the same speed for ggtt (and possibly faster for ggttg?) => there is still something to fix in both cuda and c++ STARTED AT Mon Oct 20 08:41:50 AM CEST 2025 ./tput/teeThroughputX.sh -dmf -hrd -makej -eemumu -ggtt -ggttg -ggttgg -ggttggg -gqttq -makeclean ENDED(1) AT Mon Oct 20 12:15:09 PM CEST 2025 [Status=2] ./tput/teeThroughputX.sh -dmf -makej -eemumu -ggtt -ggttg -ggttgg -ggttggg -gqttq -scaling -makeclean ENDED(1-scaling) AT Mon Oct 20 12:28:47 PM CEST 2025 [Status=0] ./tput/teeThroughputX.sh -ggtt -ggttgg -dmf -blasOn -makeclean ENDED(2) AT Mon Oct 20 12:34:33 PM CEST 2025 [Status=0] ./tput/teeThroughputX.sh -ggtt -ggttg -ggttgg -ggttggg -dmf -blasOn -scaling -makeclean ENDED(2-scaling) AT Mon Oct 20 12:50:48 PM CEST 2025 [Status=0] ./tput/teeThroughputX.sh -ggttgg -ggttggg -dmf -useGraphs -makeclean ENDED(3) AT Mon Oct 20 12:57:59 PM CEST 2025 [Status=2] ./tput/teeThroughputX.sh -ggtt -ggttg -ggttgg -ggttggg -dmf -useGraphs -scaling -makeclean ENDED(3-scaling) AT Mon Oct 20 01:18:13 PM CEST 2025 [Status=0] ./tput/teeThroughputX.sh -d_f -hrd -makej -eemumu -ggtt -ggttgg -inlonly -makeclean ENDED(4) AT Mon Oct 20 01:35:34 PM CEST 2025 [Status=0] ./tput/teeThroughputX.sh -makej -eemumu -ggtt -ggttg -ggttgg -ggttggg -gqttq -d_f -bridge -makeclean ENDED(5) AT Mon Oct 20 01:44:13 PM CEST 2025 [Status=2] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -rmbhst -makeclean ENDED(6) AT Mon Oct 20 01:48:48 PM CEST 2025 [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -curhst -makeclean ENDED(7) AT Mon Oct 20 01:53:16 PM CEST 2025 [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -common -makeclean ENDED(8) AT Mon Oct 20 01:57:55 PM CEST 2025 [Status=0] ./tput/teeThroughputX.sh -ggtt -ggttgg -dmf -noBlas -makeclean ENDED(9) AT Mon Oct 20 02:07:50 PM CEST 2025 [Status=0] ./tput/teeThroughputX.sh -dmf -hrd -makej -susyggtt -susyggt1t1 -smeftggtttt -heftggbb -makeclean ENDED(10) AT Mon Oct 20 02:21:34 PM CEST 2025 [Status=0] ./tput/logs_ggttggg_mad/log_ggttggg_mad_f_inl0_hrd0_bridge.txt:ERROR! C++ calculation (C++/GPU) failed ./tput/logs_ggttggg_mad/log_ggttggg_mad_f_inl0_hrd0_graphs.txt:ERROR! C++ calculation (C++/GPU) failed ./tput/logs_ggttggg_mad/log_ggttggg_mad_f_inl0_hrd0.txt:ERROR! C++ calculation (C++/GPU) failed ./tput/logs_ggttggg_mad/log_ggttggg_mad_f_inl0_hrd1.txt:ERROR! C++ calculation (C++/GPU) failed

…erence files for gg_ttgggg CUDACPP_RUNTEST_DUMPEVENTS=1 ./build.512z_m_inl0_hrd0/runTest_cpp.exe \cp ../../test/ref/dump* ../../../CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/test/ref/ \cp ../../test/ref/dump* ../../../gg_ttgggg.sa/test/ref/ This comes from code that had been generated in hack_ihel4p2: - setup with 2000 diagrams per group (15495 diagrams in 8 diagram groups) - still with diagrams.h in CPPProcess.cc, not yet with one or more separate diagrams.cc - still with direct writing of jamps to global memory, not yet with going back to local jamp_sv Notes about gg_ttgggg code generation - codegen of gg_ttgggg.sa took 11 minutes on itscrd90 > total size 43MB (diagrams.h 37MB) - codegen of gg_ttgggg.mad took 16 minutes on itgold91 > total size 200MB (coloramps.inc 66MB, coloramps.h 53MB, diagrams.h 39MB, matrix1.f 11MB) > diagrams.h is larger than in gg_ttgggg.sa because it includes multichannel code > previous attempts of code generation using older code had failed many months ago on itscrd90 Notes about gg_ttgggg code build and execution (C++) - code build relatively fast in C++ even with a single large CPPProcess.o - on the first execution that created these logs, runTest.exe took 200s/512z - on the next executions, runTest.exe takes 120s/512z, 160s/512y, 160s/avx2, 440s/sse4, 940s/none - check.exe takes 1.6s/512z for 16 events (plus 1.6s helicities) Notes about gg_ttgggg code build and execution (CUDA) - code build took 23 hours (A100 node) - attempted executions of runTest/check.exe were interrupted after ~5min with 100% CPU and >10GB RAM - (in later hack_ihel5 tests with diagrams.cc splitting, CUDA check.exe takes 15min in the goodHel filtering)

… copy to global jamps only at the end

… for C++

…tAccessJamp Also formatting changes for CODEGEN

…nel (local for CUDA, output array for C++) In CUDA, store to or update global jamps only at the end (Note: this also includes a fix tested on ggttg, store on diagramgroup1 and update on the following diagramgroups) In C++, simplify the code and remove HostAccessJamp

…p_sv

…ag/kernel) - all ok again With respect to the previous rd90 scaling logs for 'hack_ihel4p2' without local jamp_sv (commit 48fed45): - CUDA/m for ggttggg is a factor 2 better for ggttggg (and generally much better in other processes) - CUDA/f for ggttggg succeeds again - C++ is ~1-2% better for ggttggg With respect to the last rd90 scaling logs for the 'hack_ihel3_sep25' codebase (commit 6e5d26a): - Complex processes (ggttggg, ggttgg) > CUDA (no blas, no graphs) is now THE SAME SPEED AS IHEL3 both at small and large grids > C++ is still 5%-15% slower for ggttggg (up to 5% for ggttgg) - Simpler processes (ggttg, ggtt) > CUDA (no blas, no graphs) is up to ~10% slower for both small and large grids for ggtt and ggttg > C++ is the same speed for ggtt (and possibly faster for ggttg?) => In summary, CUDA looks good, but there may be something still to fix for C++?

…g/kernel) - all ok again With respect to the last itscrd90 logs for the 'hack_ihel3_sep25' codebase (commit 10c3e3b), this is like tput: - Complex processes (ggttggg, ggttgg) > CUDA (no blas, no graphs) is now THE SAME SPEED AS IHEL3 for ggttggg and ggttgg > C++ is still 5%-10% slower for ggttggg (up to 5% for ggttgg) - Simpler processes (ggttg, ggtt) > CUDA (no blas, no graphs) is up to ~5% slower for ggtt and ggttg > C++ is the same speed for ggtt and ~5% faster for ggttg => In summary, CUDA looks good, but there may be something still to fix for C++?

…ove diagram_headers.h

…h (and make it 'inline')

…CPPProcess.h

…tern __device__ __constant__' - build warnings and runtime assert diagrams.cc(40): warning #20044-D: extern declaration of the entity mg5amcGpu::cIPC is treated as a static definition diagrams.cc(41): warning #20044-D: extern declaration of the entity mg5amcGpu::cIPD is treated as a static definition diagrams.cc(42): warning #20044-D: extern declaration of the entity mg5amcGpu::cHel is treated as a static definition ERROR! assertGpu: 'an illegal memory access was encountered' (700) in CPPProcess.cc:794 runTest_cuda.exe: GpuRuntime.h:26: void assertGpu(cudaError_t, const char*, int, bool): Assertion `code == gpuSuccess' failed.

…olAddress - cuda and C++ build/run for hrdcod=0

… different dpgs alias mscalingtest0='for b in 1 2 4 8 16 32 64 128; \ do ( CUDACPP_RUNTIME_GOODHELICITIES=ALL ./build.cuda_m_inl0_hrd0_dcd0/check_cuda.exe -p $b 32 1 \ | \grep "EvtsPerSec\[MECalcOnly\]" | awk -vb=$b "{printf \"%s %4d %3d\n\", \$5, b, 32}" ) \ |& sed "s/Gpu.*Assert/Assert/"; done' alias mscalingtest1='for b in 1 2 4 8 16 32 64 128; \ do ( CUDACPP_RUNTIME_GOODHELICITIES=ALL ./build.cuda_m_inl0_hrd0_dcd1/check_cuda.exe -p $b 32 1 \ | \grep "EvtsPerSec\[MECalcOnly\]" | awk -vb=$b "{printf \"%s %4d %3d\n\", \$5, b, 32}" ) \ |& sed "s/Gpu.*Assert/Assert/"; done' Results are only given for dpg1, dpg10, dpg100 The dpg1000 build is still running in non-parallel mode (with DCDIAG=1 the build of diagrams1.cc easily grows to 60GB+ RSS i.e. >50% of the node RAM) The dpg10000 --- BUILD TIMES on [avalassi@itscrd-a100 gcc11/usr] /data/avalassi/GPU2023/test-madgraph4gpu2/epochX/cudacpp make cleanall; \ CCACHE_RECACHE=1 time make -j15 -f cudacpp.mk bldavxs; \ CCACHE_RECACHE=1 time make -j15 -f cudacpp.mk bldcuda DCDIAG=0; \ CCACHE_RECACHE=1 time make -j15 -f cudacpp.mk bldcuda DCDIAG=1 (1) dpg1dpf100 (155 diagram files) - avxs: 4m - dcd0: 24m - dcd1: 8m gg_ttgggg.dpg1dpf100.sa/SubProcesses/P1_Sigma_sm_gg_ttxgggg> ls -ltr build.*/.build* build.*/runTest*exe -rw-r--r--. 1 avalassi zg 0 Nov 2 13:20 build.none_m_inl0_hrd0/.build.none_m_inl0_hrd0_dcd0_hasCurand_hasNoHiprand_hasNoBlas -rw-r--r--. 1 avalassi zg 0 Nov 2 13:20 build.avx2_m_inl0_hrd0/.build.avx2_m_inl0_hrd0_dcd0_hasCurand_hasNoHiprand_hasNoBlas -rw-r--r--. 1 avalassi zg 0 Nov 2 13:20 build.sse4_m_inl0_hrd0/.build.sse4_m_inl0_hrd0_dcd0_hasCurand_hasNoHiprand_hasNoBlas -rw-r--r--. 1 avalassi zg 0 Nov 2 13:20 build.512y_m_inl0_hrd0/.build.512y_m_inl0_hrd0_dcd0_hasCurand_hasNoHiprand_hasNoBlas -rw-r--r--. 1 avalassi zg 0 Nov 2 13:20 build.512z_m_inl0_hrd0/.build.512z_m_inl0_hrd0_dcd0_hasCurand_hasNoHiprand_hasNoBlas -rwxr-xr-x. 1 avalassi zg 30685088 Nov 2 13:24 build.avx2_m_inl0_hrd0/runTest_cpp.exe* -rwxr-xr-x. 1 avalassi zg 32864032 Nov 2 13:24 build.sse4_m_inl0_hrd0/runTest_cpp.exe* -rwxr-xr-x. 1 avalassi zg 31299488 Nov 2 13:24 build.512y_m_inl0_hrd0/runTest_cpp.exe* -rwxr-xr-x. 1 avalassi zg 34296680 Nov 2 13:24 build.512z_m_inl0_hrd0/runTest_cpp.exe* -rwxr-xr-x. 1 avalassi zg 19213480 Nov 2 13:24 build.none_m_inl0_hrd0/runTest_cpp.exe* -rw-r--r--. 1 avalassi zg 0 Nov 2 13:24 build.cuda_m_inl0_hrd0_dcd0/.build.cuda_m_inl0_hrd0_dcd0_hasCurand_hasNoHiprand_hasBlas -rwxr-xr-x. 1 avalassi zg 459395256 Nov 2 13:48 build.cuda_m_inl0_hrd0_dcd0/runTest_cuda.exe* -rw-r--r--. 1 avalassi zg 0 Nov 2 13:48 build.cuda_m_inl0_hrd0_dcd1/.build.cuda_m_inl0_hrd0_dcd1_hasCurand_hasNoHiprand_hasBlas -rwxr-xr-x. 1 avalassi zg 717176192 Nov 2 13:56 build.cuda_m_inl0_hrd0_dcd1/runTest_cuda.exe* (2) dpg10dpf100 (155 diagram files) - avxs: 3m - dcd0: 4m - dcd1: 3m gg_ttgggg.dpg10dpf100.sa/SubProcesses/P1_Sigma_sm_gg_ttxgggg> ls -ltr build.*/.build* build.*/runTest*exe -rw-r--r--. 1 avalassi zg 0 Nov 2 14:15 build.none_m_inl0_hrd0/.build.none_m_inl0_hrd0_dcd0_hasCurand_hasNoHiprand_hasNoBlas -rw-r--r--. 1 avalassi zg 0 Nov 2 14:15 build.sse4_m_inl0_hrd0/.build.sse4_m_inl0_hrd0_dcd0_hasCurand_hasNoHiprand_hasNoBlas -rw-r--r--. 1 avalassi zg 0 Nov 2 14:15 build.512y_m_inl0_hrd0/.build.512y_m_inl0_hrd0_dcd0_hasCurand_hasNoHiprand_hasNoBlas -rw-r--r--. 1 avalassi zg 0 Nov 2 14:15 build.avx2_m_inl0_hrd0/.build.avx2_m_inl0_hrd0_dcd0_hasCurand_hasNoHiprand_hasNoBlas -rw-r--r--. 1 avalassi zg 0 Nov 2 14:15 build.512z_m_inl0_hrd0/.build.512z_m_inl0_hrd0_dcd0_hasCurand_hasNoHiprand_hasNoBlas -rwxr-xr-x. 1 avalassi zg 19795496 Nov 2 14:17 build.sse4_m_inl0_hrd0/runTest_cpp.exe* -rwxr-xr-x. 1 avalassi zg 10548840 Nov 2 14:17 build.none_m_inl0_hrd0/runTest_cpp.exe* -rwxr-xr-x. 1 avalassi zg 19962480 Nov 2 14:17 build.512z_m_inl0_hrd0/runTest_cpp.exe* -rwxr-xr-x. 1 avalassi zg 18136744 Nov 2 14:18 build.512y_m_inl0_hrd0/runTest_cpp.exe* -rwxr-xr-x. 1 avalassi zg 17133224 Nov 2 14:18 build.avx2_m_inl0_hrd0/runTest_cpp.exe* -rw-r--r--. 1 avalassi zg 0 Nov 2 14:18 build.cuda_m_inl0_hrd0_dcd0/.build.cuda_m_inl0_hrd0_dcd0_hasCurand_hasNoHiprand_hasBlas -rwxr-xr-x. 1 avalassi zg 644387016 Nov 2 14:22 build.cuda_m_inl0_hrd0_dcd0/runTest_cuda.exe* -rw-r--r--. 1 avalassi zg 0 Nov 2 14:22 build.cuda_m_inl0_hrd0_dcd1/.build.cuda_m_inl0_hrd0_dcd1_hasCurand_hasNoHiprand_hasBlas -rwxr-xr-x. 1 avalassi zg 388979648 Nov 2 14:25 build.cuda_m_inl0_hrd0_dcd1/runTest_cuda.exe* (3) dpg100dpf100 (155 diagram files) - avxs: 4m - dcd0: 7m - dcd1: 6m gg_ttgggg.dpg100dpf100.sa/SubProcesses/P1_Sigma_sm_gg_ttxgggg> ls -ltr build.*/.build* build.*/runTest*exe -rw-r--r--. 1 avalassi zg 0 Nov 2 14:57 build.none_m_inl0_hrd0/.build.none_m_inl0_hrd0_dcd0_hasCurand_hasNoHiprand_hasNoBlas -rw-r--r--. 1 avalassi zg 0 Nov 2 14:57 build.sse4_m_inl0_hrd0/.build.sse4_m_inl0_hrd0_dcd0_hasCurand_hasNoHiprand_hasNoBlas -rw-r--r--. 1 avalassi zg 0 Nov 2 14:57 build.avx2_m_inl0_hrd0/.build.avx2_m_inl0_hrd0_dcd0_hasCurand_hasNoHiprand_hasNoBlas -rw-r--r--. 1 avalassi zg 0 Nov 2 14:57 build.512y_m_inl0_hrd0/.build.512y_m_inl0_hrd0_dcd0_hasCurand_hasNoHiprand_hasNoBlas -rw-r--r--. 1 avalassi zg 0 Nov 2 14:57 build.512z_m_inl0_hrd0/.build.512z_m_inl0_hrd0_dcd0_hasCurand_hasNoHiprand_hasNoBlas -rwxr-xr-x. 1 avalassi zg 12496440 Nov 2 15:00 build.none_m_inl0_hrd0/runTest_cpp.exe* -rwxr-xr-x. 1 avalassi zg 15509608 Nov 2 15:00 build.512z_m_inl0_hrd0/runTest_cpp.exe* -rwxr-xr-x. 1 avalassi zg 13233312 Nov 2 15:00 build.avx2_m_inl0_hrd0/runTest_cpp.exe* -rwxr-xr-x. 1 avalassi zg 15641632 Nov 2 15:00 build.sse4_m_inl0_hrd0/runTest_cpp.exe* -rwxr-xr-x. 1 avalassi zg 14138528 Nov 2 15:01 build.512y_m_inl0_hrd0/runTest_cpp.exe* -rw-r--r--. 1 avalassi zg 0 Nov 2 15:01 build.cuda_m_inl0_hrd0_dcd0/.build.cuda_m_inl0_hrd0_dcd0_hasCurand_hasNoHiprand_hasBlas -rwxr-xr-x. 1 avalassi zg 481910800 Nov 2 15:08 build.cuda_m_inl0_hrd0_dcd0/runTest_cuda.exe* -rw-r--r--. 1 avalassi zg 0 Nov 2 15:08 build.cuda_m_inl0_hrd0_dcd1/.build.cuda_m_inl0_hrd0_dcd1_hasCurand_hasNoHiprand_hasBlas -rwxr-xr-x. 1 avalassi zg 324474280 Nov 2 15:14 build.cuda_m_inl0_hrd0_dcd1/runTest_cuda.exe* (4) dpg1000dpf1000 (16 diagram files) - avxs: 5m - dcd0: 3h08m - dcd1: N/A (*) crashed, probably out of memory gg_ttgggg.dpg1000dpf1000.sa/SubProcesses/P1_Sigma_sm_gg_ttxgggg> ls -ltr build.*/.build* build.*/runTest*exe -rw-r--r--. 1 avalassi zg 0 Nov 2 15:15 build.none_m_inl0_hrd0/.build.none_m_inl0_hrd0_dcd0_hasCurand_hasNoHiprand_hasNoBlas -rw-r--r--. 1 avalassi zg 0 Nov 2 15:15 build.sse4_m_inl0_hrd0/.build.sse4_m_inl0_hrd0_dcd0_hasCurand_hasNoHiprand_hasNoBlas -rw-r--r--. 1 avalassi zg 0 Nov 2 15:15 build.512z_m_inl0_hrd0/.build.512z_m_inl0_hrd0_dcd0_hasCurand_hasNoHiprand_hasNoBlas -rw-r--r--. 1 avalassi zg 0 Nov 2 15:15 build.512y_m_inl0_hrd0/.build.512y_m_inl0_hrd0_dcd0_hasCurand_hasNoHiprand_hasNoBlas -rw-r--r--. 1 avalassi zg 0 Nov 2 15:15 build.avx2_m_inl0_hrd0/.build.avx2_m_inl0_hrd0_dcd0_hasCurand_hasNoHiprand_hasNoBlas -rwxr-xr-x. 1 avalassi zg 14519744 Nov 2 15:18 build.sse4_m_inl0_hrd0/runTest_cpp.exe* -rwxr-xr-x. 1 avalassi zg 12209728 Nov 2 15:18 build.avx2_m_inl0_hrd0/runTest_cpp.exe* -rwxr-xr-x. 1 avalassi zg 13061696 Nov 2 15:19 build.512y_m_inl0_hrd0/runTest_cpp.exe* -rwxr-xr-x. 1 avalassi zg 14187880 Nov 2 15:19 build.512z_m_inl0_hrd0/runTest_cpp.exe* -rwxr-xr-x. 1 avalassi zg 14519304 Nov 2 15:20 build.none_m_inl0_hrd0/runTest_cpp.exe* -rw-r--r--. 1 avalassi zg 0 Nov 2 15:20 build.cuda_m_inl0_hrd0_dcd0/.build.cuda_m_inl0_hrd0_dcd0_hasCurand_hasNoHiprand_hasBlas -rwxr-xr-x. 1 avalassi zg 195724144 Nov 2 18:28 build.cuda_m_inl0_hrd0_dcd0/runTest_cuda.exe* -rw-r--r--. 1 avalassi zg 0 Nov 2 18:28 build.cuda_m_inl0_hrd0_dcd1/.build.cuda_m_inl0_hrd0_dcd1_hasCurand_hasNoHiprand_hasBlas (*) Parallel build with DCDIAG=1 crashed: the builds of five files were taking ~50GB RSS each (total RAM is 120 GB). Non-parallel build completion was not attempted. nvcc error : 'cicc' died due to signal 9 (Kill signal) make[1]: *** [cudacpp.mk:841: build.cuda_m_inl0_hrd0_dcd1/diagrams1_cuda.o] Error 9 make[1]: *** Waiting nvcc error : 'cicc' died due to signal 9 (Kill signal) make[1]: *** [cudacpp.mk:841: build.cuda_m_inl0_hrd0_dcd1/diagrams1_cuda.o] Error 9 make[1]: *** Waiting for unfinished jobs.... nvcc error : 'cicc' died due to signal 9 (Kill signal) make[1]: *** [cudacpp.mk:841: build.cuda_m_inl0_hrd0_dcd1/diagrams11_cuda.o] Error 9 nvcc error : 'cicc' died due to signal 9 (Kill signal) make[1]: *** [cudacpp.mk:841: build.cuda_m_inl0_hrd0_dcd1/diagrams16_cuda.o] Error 9 nvcc error : 'cicc' died due to signal 9 (Kill signal) make[1]: *** [cudacpp.mk:841: build.cuda_m_inl0_hrd0_dcd1/diagrams15_cuda.o] Error 9 nvcc error : 'cicc' died due to signal 9 (Kill signal) make[1]: *** [cudacpp.mk:841: build.cuda_m_inl0_hrd0_dcd1/diagrams13_cuda.o] Error 9 make[1]: Leaving directory '/data/avalassi/GPU2023/test-madgraph4gpu2/epochX/cudacpp/gg_ttgggg.dpg1000dpf1000.sa/SubProcesses/P1_Sigma_sm_gg_ttxgggg' (5) dpg10000dpf10000 (2 diagram files) - avxs: 4m - dcd0: N/A (**) not attempted - dcd1: N/A (**) not attempted gg_ttgggg.dpg10000dpf10000.sa/SubProcesses/P1_Sigma_sm_gg_ttxgggg> ls -ltr build.*/.build* build.*/runTest*exe -rw-r--r--. 1 avalassi zg 0 Nov 4 07:18 build.none_m_inl0_hrd0/.build.none_m_inl0_hrd0_dcd0_hasCurand_hasNoHiprand_hasNoBlas -rw-r--r--. 1 avalassi zg 0 Nov 4 07:18 build.sse4_m_inl0_hrd0/.build.sse4_m_inl0_hrd0_dcd0_hasCurand_hasNoHiprand_hasNoBlas -rw-r--r--. 1 avalassi zg 0 Nov 4 07:18 build.avx2_m_inl0_hrd0/.build.avx2_m_inl0_hrd0_dcd0_hasCurand_hasNoHiprand_hasNoBlas -rw-r--r--. 1 avalassi zg 0 Nov 4 07:18 build.512y_m_inl0_hrd0/.build.512y_m_inl0_hrd0_dcd0_hasCurand_hasNoHiprand_hasNoBlas -rw-r--r--. 1 avalassi zg 0 Nov 4 07:18 build.512z_m_inl0_hrd0/.build.512z_m_inl0_hrd0_dcd0_hasCurand_hasNoHiprand_hasNoBlas -rwxr-xr-x. 1 avalassi zg 14519608 Nov 4 07:21 build.none_m_inl0_hrd0/runTest_cpp.exe* -rwxr-xr-x. 1 avalassi zg 13061992 Nov 4 07:21 build.512y_m_inl0_hrd0/runTest_cpp.exe* -rwxr-xr-x. 1 avalassi zg 14524136 Nov 4 07:21 build.sse4_m_inl0_hrd0/runTest_cpp.exe* -rwxr-xr-x. 1 avalassi zg 14188176 Nov 4 07:22 build.512z_m_inl0_hrd0/runTest_cpp.exe* -rwxr-xr-x. 1 avalassi zg 12210024 Nov 4 07:22 build.avx2_m_inl0_hrd0/runTest_cpp.exe* (**) CUDA builds were not attempted for dpg10000. With DCDIAG=1 these are likely to crash like those for dpg1000. With DCDIAG=0 these are likely to take >24h, with suboptimal runtime performance. (6) dpg100000dpf100000 (1 diagram file) - avxs: N/A (***) crashed, gcc segmentation fault - dcd0: N/A (****) stopped after >7 days - dcd1: N/A (****) not attempted (***) C++ build crashed (both parallel and non-parallel): gcc segmentation fault. ccache g++ -I. -I../../src -O3 -std=c++17 -Wall -Wshadow -Wextra -ffast-math -march=x86-64 -DMGONGPU_FPTYPE_DOUBLE -DMGONGPU_FPTYPE2_FLOAT -DMGONGPU_HAS_NO_BLAS -fPIC -DMGONGPU_CHANNELID_DEBUG -c diagrams1.cc -o build.none_m_inl0_hrd0/diagrams1_cpp.o g++: internal compiler error: Segmentation fault signal terminated program cc1plus Please submit a full bug report, with preprocessed source if appropriate. See <http://bugs.almalinux.org/> for instructions. make[1]: *** [cudacpp.mk:836: build.none_m_inl0_hrd0/diagrams1_cpp.o] Error 4 make[1]: Leaving directory '/data/avalassi/GPU2023/test-madgraph4gpu2/epochX/cudacpp/gg_ttgggg.dpg100000dpf100000.sa/SubProcesses/P1_Sigma_sm_gg_ttxgggg' (****) CUDA build with a configuration similar to DCDIAG=0 had previously been stopped after >7 days. No further CUDA builds have been attempted whether with DCDIAG=0 or DCDIAG=1.

[avalassi@itscrd-a100 gcc11/usr] /data/avalassi/GPU2023/test-madgraph4gpu2/epochX/cudacpp> ./CODEGEN/generateAndCompare.sh gg_ttgggg --maxdiagpergroup 1000 --mindiagperfile 1000 Code generation and additional checks completed in 341 seconds

[avalassi@itscrd-a100 gcc11/usr] /data/avalassi/GPU2023/test-madgraph4gpu2/epochX/cudacpp> ./CODEGEN/generateAndCompare.sh gg_ttgggg --maxdiagpergroup 100 --mindiagperfile 100 Code generation and additional checks completed in 348 seconds

[avalassi@itscrd-a100 gcc11/usr] /data/avalassi/GPU2023/test-madgraph4gpu2/epochX/cudacpp> ./CODEGEN/generateAndCompare.sh gg_ttgggg --maxdiagpergroup 1 --mindiagperfile 100 Code generation and additional checks completed in 576 seconds

[avalassi@itscrd-a100 gcc11/usr] /data/avalassi/GPU2023/test-madgraph4gpu2/epochX/cudacpp> ./CODEGEN/generateAndCompare.sh gg_ttgggg --maxdiagpergroup 10 --mindiagperfile 100 Code generation and additional checks completed in 461 seconds

[avalassi@itscrd-a100 gcc11/usr] /data/avalassi/GPU2023/test-madgraph4gpu2/epochX/cudacpp> ./CODEGEN/generateAndCompare.sh gg_ttgggg --maxdiagpergroup 10000 --mindiagperfile 10000 Code generation and additional checks completed in 481 seconds

[avalassi@itscrd-a100 gcc11/usr] /data/avalassi/GPU2023/test-madgraph4gpu2/epochX/cudacpp> ./CODEGEN/generateAndCompare.sh gg_ttgggg --maxdiagpergroup 100000 --mindiagperfile 100000 Code generation and additional checks completed in 394 seconds

…rd-a100 on hack_ihel6p1 codebase There is no change: this is essentially the same code

…plates) [avalassi@itscrd-a100 gcc11/usr] /data/avalassi/GPU2023/test-madgraph4gpu/epochX/cudacpp> ./CODEGEN/generateAndCompare.sh gg_ttgggg --maxdiagpergroup 1 --mindiagperfile 100 Code generation and additional checks completed in 344 seconds Build times (C++/gold91): 2m20 gg_ttgggg.dpg1dpf100.sa/SubProcesses/P1_Sigma_sm_gg_ttxgggg> make cleanall; START=$(date); CCACHE_RECACHE=1 make -j -f cudacpp.mk bldall; echo $START; echo $(date) Sat Nov 22 07:12:22 AM CET 2025 Sat Nov 22 07:14:46 AM CET 2025 Build times (CUDA/a100): 48m (dcd0), 10m (dcd1) gg_ttgggg.dpg1dpf100.sa/SubProcesses/P1_Sigma_sm_gg_ttxgggg> make cleanall; START0=$(date); CCACHE_RECACHE=1 make -j -f cudacpp.mk bldcuda DCDIAG=0; START1=$(date); CCACHE_RECACHE=1 make -j -f cudacpp.mk bldcuda DCDIAG=1; echo $START0; echo $START1; echo $(date) Sat Nov 22 08:41:37 AM CET 2025 Sat Nov 22 09:29:17 AM CET 2025 Sat Nov 22 09:39:31 AM CET 2025

…mplates) [avalassi@itscrd-a100 gcc11/usr] /data/avalassi/GPU2023/test-madgraph4gpu/epochX/cudacpp> ./CODEGEN/generateAndCompare.sh gg_ttgggg --maxdiagpergroup 10 --mindiagperfile 100 Code generation and additional checks completed in 488 seconds Build times (C++/gold91): 3m10 gg_ttgggg.dpg10dpf100.sa/SubProcesses/P1_Sigma_sm_gg_ttxgggg> make cleanall; START=$(date); CCACHE_RECACHE=1 make -j -f cudacpp.mk bldall; echo $START; echo $(date) Sat Nov 22 07:19:36 AM CET 2025 Sat Nov 22 07:22:44 AM CET 2025 Build times (CUDA/a100): 4m30 (dcd0), 3m50 (dcd1) gg_ttgggg.dpg10dpf100.sa/SubProcesses/P1_Sigma_sm_gg_ttxgggg> make cleanall; START0=$(date); CCACHE_RECACHE=1 make -j -f cudacpp.mk bldcuda DCDIAG=0; START1=$(date); CCACHE_RECACHE=1 make -j -f cudacpp.mk bldcuda DCDIAG=1; echo $START0; echo $START1; echo $(date) Sat Nov 22 07:11:58 AM CET 2025 Sat Nov 22 07:16:26 AM CET 2025 Sat Nov 22 07:20:16 AM CET 2025

…emplates) [avalassi@itscrd-a100 gcc11/usr] /data/avalassi/GPU2023/test-madgraph4gpu/epochX/cudacpp> ./CODEGEN/generateAndCompare.sh gg_ttgggg --maxdiagpergroup 100 --mindiagperfile 100 Code generation and additional checks completed in 356 seconds Build times (C++/gold91): 3m gg_ttgggg.dpg100dpf100.sa/SubProcesses/P1_Sigma_sm_gg_ttxgggg> make cleanall; START=$(date); CCACHE_RECACHE=1 make -j -f cudacpp.mk bldall; echo $START; echo $(date) Sat Nov 22 07:26:57 AM CET 2025 Sat Nov 22 07:29:50 AM CET 2025 Build times (CUDA/a100): 8m50 (dcd0), 6m20 (dcd1) gg_ttgggg.dpg100dpf100.sa/SubProcesses/P1_Sigma_sm_gg_ttxgggg> make cleanall; START0=$(date); CCACHE_RECACHE=1 make -j -f cudacpp.mk bldcuda DCDIAG=0; START1=$(date); CCACHE_RECACHE=1 make -j -f cudacpp.mk bldcuda DCDIAG=1; echo $START0; echo $START1; echo $(date) Sat Nov 22 07:24:15 AM CET 2025 Sat Nov 22 07:33:07 AM CET 2025 Sat Nov 22 07:39:25 AM CET 2025

… templates) [avalassi@itscrd-a100 gcc11/usr] /data/avalassi/GPU2023/test-madgraph4gpu/epochX/cudacpp> ./CODEGEN/generateAndCompare.sh gg_ttgggg --maxdiagpergroup 1000 --mindiagperfile 1000 Code generation and additional checks completed in 372 seconds Build times (C++/gold91): 3m40 gg_ttgggg.dpg1000dpf1000.sa/SubProcesses/P1_Sigma_sm_gg_ttxgggg> make cleanall; START=$(date); CCACHE_RECACHE=1 make -j -f cudacpp.mk bldall; echo $START; echo $(date) Sat Nov 22 07:35:06 AM CET 2025 Sat Nov 22 07:38:48 AM CET 2025 Build times (CUDA/a100): 2h57 (dcd0), FAILED (dcd1) gg_ttgggg.dpg1000dpf1000.sa/SubProcesses/P1_Sigma_sm_gg_ttxgggg> make cleanall; START0=$(date); CCACHE_RECACHE=1 make -j -f cudacpp.mk bldcuda DCDIAG=0; START1=$(date); CCACHE_RECACHE=1 make -j -f cudacpp.mk bldcuda DCDIAG=1; echo $START0; echo $START1; echo $(date) ... (Node crashed) [1123356.164224] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/user.slice/user-14546.slice/session-7302.scope,task=cicc,pid=1243172,uid=14546 [1123356.164254] Out of memory: Killed process 1243172 (cicc) total-vm:157944388kB, anon-rss:70969088kB, file-rss:1152kB, shmem-rss:0kB, UID:14546 pgtables:299660kB oom_score_adj:0 [1123371.062904] oom_reaper: reaped process 1243172 (cicc), now anon-rss:1056kB, file-rss:1152kB, shmem-rss:0kB ... ls -ltr build.*/.build* build.*/run*exe -rw-r--r--. 1 avalassi zg 0 Nov 22 09:45 build.cuda_m_inl0_hrd0_dcd0/.build.cuda_m_inl0_hrd0_dcd0_hasCurand_hasNoHiprand_hasBlas -rwxr-xr-x. 1 avalassi zg 195705368 Nov 22 12:42 build.cuda_m_inl0_hrd0_dcd0/runTest_cuda.exe* -rw-r--r--. 1 avalassi zg 0 Nov 22 12:42 build.cuda_m_inl0_hrd0_dcd1/.build.cuda_m_inl0_hrd0_dcd1_hasCurand_hasNoHiprand_hasBlas

…ut templates) [avalassi@itscrd-a100 gcc11/usr] /data/avalassi/GPU2023/test-madgraph4gpu/epochX/cudacpp> ./CODEGEN/generateAndCompare.sh gg_ttgggg --maxdiagpergroup 10000 --mindiagperfile 10000 Code generation and additional checks completed in 358 seconds Build times (C++/gold91): 28m gg_ttgggg.dpg10000dpf10000.sa/SubProcesses/P1_Sigma_sm_gg_ttxgggg> make cleanall; START=$(date); CCACHE_RECACHE=1 make -j -f cudacpp.mk bldall; echo $START; echo $(date) Sat Nov 22 07:59:12 AM CET 2025 Sat Nov 22 08:27:22 AM CET 2025 Build times (CUDA/a100): N/A (dcd0), N/A (dcd1) (DCDIAG=0 build not attempted as it would probably take too long) (DCDIAG=1 build not attempted as the dpg1000 build failed)

…hout templates) [avalassi@itscrd-a100 gcc11/usr] /data/avalassi/GPU2023/test-madgraph4gpu/epochX/cudacpp> ./CODEGEN/generateAndCompare.sh gg_ttgggg --maxdiagpergroup 100000 --mindiagperfile 100000 Code generation and additional checks completed in 489 seconds Build times (C++/gold91): all five backends fail with "Segmentation fault" gg_ttgggg.dpg100000dpf100000.sa/SubProcesses/P1_Sigma_sm_gg_ttxgggg> make cleanall; START=$(date); CCACHE_RECACHE=1 make -j -f cudacpp.mk bldall; echo $START; echo $(date) g++: internal compiler error: Segmentation fault signal terminated program cc1plus Please submit a full bug report, with preprocessed source if appropriate. make[1]: *** [cudacpp.mk:841: build.none_m_inl0_hrd0/diagrams1_cpp.o] Error 4 Sat Nov 22 07:53:05 AM CET 2025 Sat Nov 22 07:55:14 AM CET 2025 Build times (CUDA/a100): N/A (dcd0), N/A (dcd1) (DCDIAG=0 build not attempted as it would probably take too long) (DCDIAG=1 build not attempted as the dpg1000 build failed)

…plates) [avalassi@itscrd-a100 gcc11/usr] /data/avalassi/GPU2023/test-madgraph4gpu/epochX/cudacpp> ./CODEGEN/generateAndCompare.sh gg_ttgggg --maxdiagpergroup 200 --mindiagperfile 200 Code generation and additional checks completed in 525 seconds [in parallel to a software build] Build times (C++/gold91): 3m10 gg_ttgggg.dpg200dpf200.sa/SubProcesses/P1_Sigma_sm_gg_ttxgggg> make cleanall; START=$(date); CCACHE_RECACHE=1 make -j -f cudacpp.mk bldall; echo $START; echo $(date) Sat Nov 22 07:41:52 AM CET 2025 Sat Nov 22 07:44:04 AM CET 2025 Build times (CUDA/a100): 15m (dcd0), 17m (dcd1) gg_ttgggg.dpg200dpf200.sa/SubProcesses/P1_Sigma_sm_gg_ttxgggg> make cleanall; START0=$(date); CCACHE_RECACHE=1 make -j -f cudacpp.mk bldcuda DCDIAG=0; START1=$(date); CCACHE_RECACHE=1 make -j -f cudacpp.mk bldcuda DCDIAG=1; echo $START0; echo $START1; echo $(date) Sat Nov 22 07:55:54 AM CET 2025 Sat Nov 22 08:11:29 AM CET 2025 Sat Nov 22 08:28:47 AM CET 2025

…p1 codebase with 1000 dpg/dpf)

… dpg values tput/logs_ggttgggg_sa_scan/scan.sh Sat Nov 22 09:13:07 AM CET 2025 Sat Nov 22 09:17:23 AM CET 2025

…t different dpg values dpg1dpf100 none 0.0 4.717894e-01 1.00x 33.913437 sse4 0.0 4.228031e-01 0.90x 37.842674 avx2 0.0 4.877365e-01 1.03x 32.804596 512y 0.0 5.069305e-01 1.07x 31.562513 512z 0.0 5.272687e-01 1.12x 30.345057 dpg10dpf100 none 0.0 1.672502e+00 1.00x 9.566506 sse4 0.0 1.556929e+00 0.93x 10.276643 avx2 0.0 3.484190e+00 2.08x 4.592172 512y 0.0 3.270659e+00 1.96x 4.891981 512z 0.0 4.175834e+00 2.50x 3.831570 dpg100dpf100 none 0.0 2.227907e+00 1.00x 7.181629 sse4 0.0 3.532345e+00 1.59x 4.529569 avx2 0.0 9.224621e+00 4.14x 1.734489 512y 0.0 1.044090e+01 4.69x 1.532434 512z 0.0 1.521281e+01 6.83x 1.051745 dpg200dpf200 none 0.0 2.540721e+00 1.00x 6.297424 sse4 0.0 4.628705e+00 1.82x 3.456690 avx2 0.0 1.100694e+01 4.33x 1.453629 512y 0.0 1.144459e+01 4.50x 1.398040 512z 0.0 1.798910e+01 7.08x 0.889428 dpg1000dpf1000 none 0.0 2.568091e+00 1.00x 6.230309 sse4 0.0 5.189057e+00 2.02x 3.083412 avx2 0.0 1.236506e+01 4.81x 1.293969 512y 0.0 1.311557e+01 5.11x 1.219924 512z 0.0 2.222294e+01 8.65x 0.719977 dpg10000dpf10000 none 0.0 2.644150e+00 1.00x 6.051095 sse4 0.0 5.453349e+00 2.06x 2.933977 avx2 0.0 1.314281e+01 4.97x 1.217396 512y 0.0 1.370831e+01 5.18x 1.167175 512z 0.0 2.289663e+01 8.66x 0.698793

… ggttgggg.sa at different dpg values Sat Nov 22 04:34:39 PM CET 2025 Sat Nov 22 07:12:15 PM CET 2025

…tests

… for ggttgggg.sa at different dpg values

… for instrumenting color sums Apply these as follows cd gg_ttgggg.<dpg>.sa/SubProcesses patch -i ../../patchS.patch cd P1_Sigma_sm_gg_ttxgggg/ patch -i ../../../patchP.patch

…dpg1000dpf1000.sa cd gg_ttgggg.dpg1000dpf1000.sa/SubProcesses patch -i ../../patchS.patch cd P1_Sigma_sm_gg_ttxgggg/ patch -i ../../../patchP.patch

…dpg100dpf100.sa cd gg_ttgggg.dpg100dpf100.sa/SubProcesses patch -i ../../patchS.patch cd P1_Sigma_sm_gg_ttxgggg/ patch -i ../../../patchP.patch

…CUDA/a100

…SIMD/gold Also update CUDA/a100 script: use common random numbers to compare MEs to SIMD/gold (no curand on gold)

…ts for ggtt4g colortimer using CUDA/a100, now using common random numbers

valassi added 30 commits October 20, 2025 06:38

[hack_ihel4p2] regenerate all processes with 100 diagrams/kernel and …

b285310

…a new layout for wavefunctions

[hack_ihel4p2] in CODEGEN, move to 2000 diagrams per kernel (this wil…

844b7c2

…l split ggttgggg but not ggttggg)

[hack_ihel4p2] regenerate all processes with 2000 diagrams per kernel…

65b6427

… - they are all single-kernel again including ggttggg

[hack_ihel4p2] ignore perf.data* in epochX/cudacpp/.gitignore

ee52e5c

[hack_ihel4p3] in gg_tt.mad, update a local jamp_sv in the kernel and…

06c42ef

… copy to global jamps only at the end

[hack_ihel4p3] in gg_tt.mad, further simplify writing to jamp_sv also…

3c3d074

… for C++

[hack_ihel4p3] in gg_tt.mad, further simplify C++ jamp_sv: remove Hos…

4e17bd3

…tAccessJamp Also formatting changes for CODEGEN

[hack_ihel4p3] regenerate all processes after going back to local jam…

9ed9d47

…p_sv

[hack_ihel5] in CODEGEN, move back to 5 diagrams per kernel for tests

317a9a1

[hack_ihel5] regenerate gg_ttg.mad with 5 diagrams per kernel

ebc9b49

[hack_ihel5] ggttg: cudacpp.mk for diagrams.cc

14b641a

[hack_ihel5] ggttg: git mv diagrams.h diagrams.cc

f31a21a

[hack_ihel5] ggttg: git add (bare) diagrams.h

f32c5cc

[hack_ihel5] ggttg: in CPPProcess.cc move diagram.h higher up and rem…

087d36d

…ove diagram_headers.h

[hack_ihel5] ggttg: in diagrams.cc add headers and {}

3fea308

[hack_ihel5] ggttg: move gpu_diagrams from CPPProcess.cc to diagrams.…

ab75a47

…h (and make it 'inline')

[hack_ihel5] ggttg: in diagrams.cc add nwf, nw6, ncolor, nicoup, ndcoup

9d70fbf

[hack_ihel5] ggttg: move nIPC and nIPD out of CPPProcess.cc and into …

3987c96

…CPPProcess.h

[hack_ihel5] ggttg: in diagrams.cc add NIPC and nIPD from CPPProcess.h

f29f969

[hack_ihel5] ggttg: add cIPC/cIPD/cHel as func arguments from gpuSymb…

f3e87bd

…olAddress - cuda and C++ build/run for hrdcod=0

valassi added 27 commits November 22, 2025 16:00

[hack_ihel6p1_ggtt4g] recreate logs of ggttgggg scaling tests on itsc…

78e24d0

…rd-a100 on hack_ihel6p1 codebase There is no change: this is essentially the same code

[hack_ihel6p1/ggtt5g] itgold91 SIMD scan for ggttggggg.sa (hack_ihel6…

c9afacc

…p1 codebase with 1000 dpg/dpf)

[hack_ihel6p2_ggtt4g] itgold91 SIMD scan for ggttgggg.sa at different…

e3063d6

… dpg values tput/logs_ggttgggg_sa_scan/scan.sh Sat Nov 22 09:13:07 AM CET 2025 Sat Nov 22 09:17:23 AM CET 2025

[hack_ihel6p2_ggtt4g] script and tmp results of GPU scaling scans for…

e9c6185

… ggttgggg.sa at different dpg values Sat Nov 22 04:34:39 PM CET 2025 Sat Nov 22 07:12:15 PM CET 2025

[hack_ihel6p2_ggtt4g] rename logs of hack_ihel6p1_ggtt4g GPU scaling …

8864600

…tests

[hack_ihel6p2_ggtt4g] script and updated results of GPU scaling scans…

e4a3564

… for ggttgggg.sa at different dpg values

[hack_ihel6p2_ggtt4g] add two patches (derived from branch paper25v2)…

a3d7701

… for instrumenting color sums Apply these as follows cd gg_ttgggg.<dpg>.sa/SubProcesses patch -i ../../patchS.patch cd P1_Sigma_sm_gg_ttxgggg/ patch -i ../../../patchP.patch

[hack_ihel6p2_ggtt4g] instrument color sums with timers in gg_ttgggg.…

175fc8d

…dpg1000dpf1000.sa cd gg_ttgggg.dpg1000dpf1000.sa/SubProcesses patch -i ../../patchS.patch cd P1_Sigma_sm_gg_ttxgggg/ patch -i ../../../patchP.patch

[hack_ihel6p2_ggtt4g] instrument color sums with timers in gg_ttgggg.…

b9a023b

…dpg100dpf100.sa cd gg_ttgggg.dpg100dpf100.sa/SubProcesses patch -i ../../patchS.patch cd P1_Sigma_sm_gg_ttxgggg/ patch -i ../../../patchP.patch

[hack_ihel6p2_ggtt4g] script and results for ggtt4g colortimer using …

82de814

…CUDA/a100

[hack_ihel6p2_ggtt4g] script and results for ggtt4g colortimer using …

31a5841

…SIMD/gold Also update CUDA/a100 script: use common random numbers to compare MEs to SIMD/gold (no curand on gold)

[hack_ihel6p2_ggtt4g] ** COMPLETE HACK_IHEL6P2_GGTT4G ** update resul…

58b52e1

…ts for ggtt4g colortimer using CUDA/a100, now using common random numbers

valassi self-assigned this Nov 23, 2025

valassi marked this pull request as draft November 23, 2025 23:45

valassi mentioned this pull request Nov 23, 2025

WIP: gg to ttgggg (2->6 process) #601

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

WIP (DOCUMENTATION ONLY): gg to ttgggg (2->6 process) with diagram splitting #1071

WIP (DOCUMENTATION ONLY): gg to ttgggg (2->6 process) with diagram splitting #1071

Uh oh!

valassi commented Nov 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

WIP (DOCUMENTATION ONLY): gg to ttgggg (2->6 process) with diagram splitting #1071

Are you sure you want to change the base?

WIP (DOCUMENTATION ONLY): gg to ttgggg (2->6 process) with diagram splitting #1071

Uh oh!

Conversation

valassi commented Nov 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant