-
Notifications
You must be signed in to change notification settings - Fork 37
WIP (DOCUMENTATION ONLY): gg to ttgggg (2->6 process) with diagram splitting #1071
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
valassi
wants to merge
398
commits into
madgraph5:master
Choose a base branch
from
valassi:hack_ihel6p2_ggtt4g_pr
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…a new layout for wavefunctions
(Without these changes, the ggttg/ggttggg bridge tests and tmad tests failed - most likely due to missing 'make clean') In detail: - in tput/throughputX.sh, use 'make -f cudacpp.mk' instead of 'make' (this enables faster rebuilds from ccache) - in tput/throughputX.sh, profile diagramgroup1 instead of diagram1 - in tput/allTees.sh, always run 'make clean' (unless -nomakeclean is specified) - in tput/allTees.sh, drop the -short option (always run ggttggg) - in tmad/allTees.sh, always run 'make cleanall' (unless -nomakeclean is specified) - in tmad/allTees.sh, improve debug printouts
…ew wf layout, optional CUDA Graphs) - all ok With respect to the last rd90 scaling logs for the 'hack_ihel3_sep25' codebase (commit 6e5d26a): - Split processes (ggttggg, ggttgg) are much worse > CUDA (no blas, no graphs) is a factor ~10 slower for small grids and ~2.5 slower for ggttggg (~4 and ~2 for ggttgg) > C++ is 10-15% slower for ggttggg (up to 5% for ggttgg) - Single-kernel processes are only moderately impacted > CUDA (no blas, no graphs) is ~20% slower for both small and large grids for ggtt and ggttg > C++ is the same speed for ggtt (and possibly faster for ggttg?) => Should try to keep the code but increase to 2000 diagrams per kernel? With respect to the previous rd90 scaling logs for the 'hack_ihel4p2' codebase (commit 2893531): - CUDA peak throughputs in ggttggg (with and without graphs) are 5% faster - The only difference here is the improved memory layout: so it does help, but not much STARTED AT Sun Oct 19 06:01:47 PM CEST 2025 ./tput/teeThroughputX.sh -dmf -hrd -makej -eemumu -ggtt -ggttg -ggttgg -ggttggg -gqttq -makeclean ENDED(1) AT Sun Oct 19 06:33:21 PM CEST 2025 [Status=0] ./tput/teeThroughputX.sh -dmf -makej -eemumu -ggtt -ggttg -ggttgg -ggttggg -gqttq -scaling -makeclean ENDED(1-scaling) AT Sun Oct 19 06:47:17 PM CEST 2025 [Status=0] ./tput/teeThroughputX.sh -ggtt -ggttgg -dmf -blasOn -makeclean ENDED(2) AT Sun Oct 19 06:53:26 PM CEST 2025 [Status=0] ./tput/teeThroughputX.sh -ggtt -ggttg -ggttgg -ggttggg -dmf -blasOn -scaling -makeclean ENDED(2-scaling) AT Sun Oct 19 07:10:41 PM CEST 2025 [Status=0] ./tput/teeThroughputX.sh -ggttgg -ggttggg -dmf -useGraphs -makeclean ENDED(3) AT Sun Oct 19 07:20:00 PM CEST 2025 [Status=0] ./tput/teeThroughputX.sh -ggtt -ggttg -ggttgg -ggttggg -dmf -useGraphs -scaling -makeclean ENDED(3-scaling) AT Sun Oct 19 07:38:16 PM CEST 2025 [Status=0] ./tput/teeThroughputX.sh -d_f -hrd -makej -eemumu -ggtt -ggttgg -inlonly -makeclean ENDED(4) AT Sun Oct 19 07:48:41 PM CEST 2025 [Status=0] ./tput/teeThroughputX.sh -makej -eemumu -ggtt -ggttg -ggttgg -ggttggg -gqttq -d_f -bridge -makeclean ENDED(5) AT Sun Oct 19 07:59:00 PM CEST 2025 [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -rmbhst -makeclean ENDED(6) AT Sun Oct 19 08:03:49 PM CEST 2025 [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -curhst -makeclean ENDED(7) AT Sun Oct 19 08:08:36 PM CEST 2025 [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -common -makeclean ENDED(8) AT Sun Oct 19 08:13:20 PM CEST 2025 [Status=0] ./tput/teeThroughputX.sh -ggtt -ggttgg -dmf -noBlas -makeclean ENDED(9) AT Sun Oct 19 08:18:18 PM CEST 2025 [Status=0] ./tput/teeThroughputX.sh -dmf -hrd -makej -susyggtt -susyggt1t1 -smeftggtttt -heftggbb -makeclean ENDED(10) AT Sun Oct 19 08:30:31 PM CEST 2025 [Status=0] No errors found in logs No FPEs or '{ }' found in logs No aborts found in logs
…w wf layout, optional CUDA Graphs) - all ok With respect to the last itscrd90 logs for the 'hack_ihel3_sep25' codebase (commit 10c3e3b), this is like tput tests: - Split processes (ggttggg, ggttgg) are much worse > CUDA (no blas, no graphs) is a factor ~2.5 slower for ggttggg (and ~1.5 for ggttgg) > C++ is ~15% slower for ggttggg (up to 5% for ggttgg) - Single-kernel processes are only moderately impacted > CUDA (no blas, no graphs) is ~15% slower for ggtt and ggttg > C++ is the same speed for ggtt (and possibly faster for ggttg?) STARTED AT Sun Oct 19 08:30:32 PM CEST 2025 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/teeMadX.sh -eemumu -ggtt -ggttg -ggttgg -gqttq -ggttggg -dmf -makeclean (SM tests) ENDED(1) AT Sun Oct 19 09:26:52 PM CEST 2025 [Status=0] /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/teeMadX.sh -heftggbb -susyggtt -susyggt1t1 -smeftggtttt -dmf -makeclean (BSM tests) ENDED(1) AT Sun Oct 19 09:32:22 PM CEST 2025 [Status=0]
…l split ggttgggg but not ggttggg)
… - they are all single-kernel again including ggttggg
…rnel) - failures in ggttggg/f STARTED AT Mon Oct 20 02:21:34 PM CEST 2025 /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/teeMadX.sh -eemumu -ggtt -ggttg -ggttgg -gqttq -ggttggg -dmf -makeclean (SM tests) ENDED(1) AT Mon Oct 20 03:16:47 PM CEST 2025 [Status=0] /data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/teeMadX.sh -heftggbb -susyggtt -susyggt1t1 -smeftggtttt -dmf -makeclean (BSM tests) ENDED(1) AT Mon Oct 20 03:22:17 PM CEST 2025 [Status=0] tmad/logs_ggttggg_mad/log_ggttggg_mad_f_inl0_hrd0.txt:ERROR! assertGpu: 'an illegal memory access was encountered' (700) in CPPProcess.cc:915
…ernel) - failures in ggttggg/f With respect to the previous rd90 scaling logs for 'hack_ihel4p2' with 100 diagrams/kernel (commit d6144e4): - CUDA/m for ggttggg/ggttgg is much better at small grids and 15% better peak at large grids > HOWEVER, CUDA/f for ggttggg fails - C++ is 2% better for ggttggg/ggttgg HOWEVER, with respect to the last rd90 scaling logs for the 'hack_ihel3_sep25' codebase (commit 6e5d26a): - Complex processes (ggttggg, ggttgg) > CUDA (no blas, no graphs) is still a factor ~2 (i.e. 50%) slower both at small and large grids > C++ is still 10% slower for ggttggg (up to 5% for ggttgg) - Simpler processes (ggttg, ggtt) are more moderately impacted > CUDA (no blas, no graphs) is ~20% slower for both small and large grids for ggtt and ggttg > C++ is the same speed for ggtt (and possibly faster for ggttg?) => there is still something to fix in both cuda and c++ STARTED AT Mon Oct 20 08:41:50 AM CEST 2025 ./tput/teeThroughputX.sh -dmf -hrd -makej -eemumu -ggtt -ggttg -ggttgg -ggttggg -gqttq -makeclean ENDED(1) AT Mon Oct 20 12:15:09 PM CEST 2025 [Status=2] ./tput/teeThroughputX.sh -dmf -makej -eemumu -ggtt -ggttg -ggttgg -ggttggg -gqttq -scaling -makeclean ENDED(1-scaling) AT Mon Oct 20 12:28:47 PM CEST 2025 [Status=0] ./tput/teeThroughputX.sh -ggtt -ggttgg -dmf -blasOn -makeclean ENDED(2) AT Mon Oct 20 12:34:33 PM CEST 2025 [Status=0] ./tput/teeThroughputX.sh -ggtt -ggttg -ggttgg -ggttggg -dmf -blasOn -scaling -makeclean ENDED(2-scaling) AT Mon Oct 20 12:50:48 PM CEST 2025 [Status=0] ./tput/teeThroughputX.sh -ggttgg -ggttggg -dmf -useGraphs -makeclean ENDED(3) AT Mon Oct 20 12:57:59 PM CEST 2025 [Status=2] ./tput/teeThroughputX.sh -ggtt -ggttg -ggttgg -ggttggg -dmf -useGraphs -scaling -makeclean ENDED(3-scaling) AT Mon Oct 20 01:18:13 PM CEST 2025 [Status=0] ./tput/teeThroughputX.sh -d_f -hrd -makej -eemumu -ggtt -ggttgg -inlonly -makeclean ENDED(4) AT Mon Oct 20 01:35:34 PM CEST 2025 [Status=0] ./tput/teeThroughputX.sh -makej -eemumu -ggtt -ggttg -ggttgg -ggttggg -gqttq -d_f -bridge -makeclean ENDED(5) AT Mon Oct 20 01:44:13 PM CEST 2025 [Status=2] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -rmbhst -makeclean ENDED(6) AT Mon Oct 20 01:48:48 PM CEST 2025 [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -curhst -makeclean ENDED(7) AT Mon Oct 20 01:53:16 PM CEST 2025 [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -common -makeclean ENDED(8) AT Mon Oct 20 01:57:55 PM CEST 2025 [Status=0] ./tput/teeThroughputX.sh -ggtt -ggttgg -dmf -noBlas -makeclean ENDED(9) AT Mon Oct 20 02:07:50 PM CEST 2025 [Status=0] ./tput/teeThroughputX.sh -dmf -hrd -makej -susyggtt -susyggt1t1 -smeftggtttt -heftggbb -makeclean ENDED(10) AT Mon Oct 20 02:21:34 PM CEST 2025 [Status=0] ./tput/logs_ggttggg_mad/log_ggttggg_mad_f_inl0_hrd0_bridge.txt:ERROR! C++ calculation (C++/GPU) failed ./tput/logs_ggttggg_mad/log_ggttggg_mad_f_inl0_hrd0_graphs.txt:ERROR! C++ calculation (C++/GPU) failed ./tput/logs_ggttggg_mad/log_ggttggg_mad_f_inl0_hrd0.txt:ERROR! C++ calculation (C++/GPU) failed ./tput/logs_ggttggg_mad/log_ggttggg_mad_f_inl0_hrd1.txt:ERROR! C++ calculation (C++/GPU) failed
…erence files for gg_ttgggg CUDACPP_RUNTEST_DUMPEVENTS=1 ./build.512z_m_inl0_hrd0/runTest_cpp.exe \cp ../../test/ref/dump* ../../../CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/test/ref/ \cp ../../test/ref/dump* ../../../gg_ttgggg.sa/test/ref/ This comes from code that had been generated in hack_ihel4p2: - setup with 2000 diagrams per group (15495 diagrams in 8 diagram groups) - still with diagrams.h in CPPProcess.cc, not yet with one or more separate diagrams.cc - still with direct writing of jamps to global memory, not yet with going back to local jamp_sv Notes about gg_ttgggg code generation - codegen of gg_ttgggg.sa took 11 minutes on itscrd90 > total size 43MB (diagrams.h 37MB) - codegen of gg_ttgggg.mad took 16 minutes on itgold91 > total size 200MB (coloramps.inc 66MB, coloramps.h 53MB, diagrams.h 39MB, matrix1.f 11MB) > diagrams.h is larger than in gg_ttgggg.sa because it includes multichannel code > previous attempts of code generation using older code had failed many months ago on itscrd90 Notes about gg_ttgggg code build and execution (C++) - code build relatively fast in C++ even with a single large CPPProcess.o - on the first execution that created these logs, runTest.exe took 200s/512z - on the next executions, runTest.exe takes 120s/512z, 160s/512y, 160s/avx2, 440s/sse4, 940s/none - check.exe takes 1.6s/512z for 16 events (plus 1.6s helicities) Notes about gg_ttgggg code build and execution (CUDA) - code build took 23 hours (A100 node) - attempted executions of runTest/check.exe were interrupted after ~5min with 100% CPU and >10GB RAM - (in later hack_ihel5 tests with diagrams.cc splitting, CUDA check.exe takes 15min in the goodHel filtering)
… copy to global jamps only at the end
…tAccessJamp Also formatting changes for CODEGEN
…nel (local for CUDA, output array for C++) In CUDA, store to or update global jamps only at the end (Note: this also includes a fix tested on ggttg, store on diagramgroup1 and update on the following diagramgroups) In C++, simplify the code and remove HostAccessJamp
…ag/kernel) - all ok again With respect to the previous rd90 scaling logs for 'hack_ihel4p2' without local jamp_sv (commit 48fed45): - CUDA/m for ggttggg is a factor 2 better for ggttggg (and generally much better in other processes) - CUDA/f for ggttggg succeeds again - C++ is ~1-2% better for ggttggg With respect to the last rd90 scaling logs for the 'hack_ihel3_sep25' codebase (commit 6e5d26a): - Complex processes (ggttggg, ggttgg) > CUDA (no blas, no graphs) is now THE SAME SPEED AS IHEL3 both at small and large grids > C++ is still 5%-15% slower for ggttggg (up to 5% for ggttgg) - Simpler processes (ggttg, ggtt) > CUDA (no blas, no graphs) is up to ~10% slower for both small and large grids for ggtt and ggttg > C++ is the same speed for ggtt (and possibly faster for ggttg?) => In summary, CUDA looks good, but there may be something still to fix for C++?
…g/kernel) - all ok again With respect to the last itscrd90 logs for the 'hack_ihel3_sep25' codebase (commit 10c3e3b), this is like tput: - Complex processes (ggttggg, ggttgg) > CUDA (no blas, no graphs) is now THE SAME SPEED AS IHEL3 for ggttggg and ggttgg > C++ is still 5%-10% slower for ggttggg (up to 5% for ggttgg) - Simpler processes (ggttg, ggtt) > CUDA (no blas, no graphs) is up to ~5% slower for ggtt and ggttg > C++ is the same speed for ggtt and ~5% faster for ggttg => In summary, CUDA looks good, but there may be something still to fix for C++?
…ove diagram_headers.h
…h (and make it 'inline')
…tern __device__ __constant__' - build warnings and runtime assert diagrams.cc(40): warning #20044-D: extern declaration of the entity mg5amcGpu::cIPC is treated as a static definition diagrams.cc(41): warning #20044-D: extern declaration of the entity mg5amcGpu::cIPD is treated as a static definition diagrams.cc(42): warning #20044-D: extern declaration of the entity mg5amcGpu::cHel is treated as a static definition ERROR! assertGpu: 'an illegal memory access was encountered' (700) in CPPProcess.cc:794 runTest_cuda.exe: GpuRuntime.h:26: void assertGpu(cudaError_t, const char*, int, bool): Assertion `code == gpuSuccess' failed.
…olAddress - cuda and C++ build/run for hrdcod=0
… different dpgs
alias mscalingtest0='for b in 1 2 4 8 16 32 64 128; \
do ( CUDACPP_RUNTIME_GOODHELICITIES=ALL ./build.cuda_m_inl0_hrd0_dcd0/check_cuda.exe -p $b 32 1 \
| \grep "EvtsPerSec\[MECalcOnly\]" | awk -vb=$b "{printf \"%s %4d %3d\n\", \$5, b, 32}" ) \
|& sed "s/Gpu.*Assert/Assert/"; done'
alias mscalingtest1='for b in 1 2 4 8 16 32 64 128; \
do ( CUDACPP_RUNTIME_GOODHELICITIES=ALL ./build.cuda_m_inl0_hrd0_dcd1/check_cuda.exe -p $b 32 1 \
| \grep "EvtsPerSec\[MECalcOnly\]" | awk -vb=$b "{printf \"%s %4d %3d\n\", \$5, b, 32}" ) \
|& sed "s/Gpu.*Assert/Assert/"; done'
Results are only given for dpg1, dpg10, dpg100
The dpg1000 build is still running in non-parallel mode
(with DCDIAG=1 the build of diagrams1.cc easily grows to 60GB+ RSS i.e. >50% of the node RAM)
The dpg10000
---
BUILD TIMES on
[avalassi@itscrd-a100 gcc11/usr] /data/avalassi/GPU2023/test-madgraph4gpu2/epochX/cudacpp
make cleanall; \
CCACHE_RECACHE=1 time make -j15 -f cudacpp.mk bldavxs; \
CCACHE_RECACHE=1 time make -j15 -f cudacpp.mk bldcuda DCDIAG=0; \
CCACHE_RECACHE=1 time make -j15 -f cudacpp.mk bldcuda DCDIAG=1
(1)
dpg1dpf100 (155 diagram files)
- avxs: 4m
- dcd0: 24m
- dcd1: 8m
gg_ttgggg.dpg1dpf100.sa/SubProcesses/P1_Sigma_sm_gg_ttxgggg>
ls -ltr build.*/.build* build.*/runTest*exe
-rw-r--r--. 1 avalassi zg 0 Nov 2 13:20 build.none_m_inl0_hrd0/.build.none_m_inl0_hrd0_dcd0_hasCurand_hasNoHiprand_hasNoBlas
-rw-r--r--. 1 avalassi zg 0 Nov 2 13:20 build.avx2_m_inl0_hrd0/.build.avx2_m_inl0_hrd0_dcd0_hasCurand_hasNoHiprand_hasNoBlas
-rw-r--r--. 1 avalassi zg 0 Nov 2 13:20 build.sse4_m_inl0_hrd0/.build.sse4_m_inl0_hrd0_dcd0_hasCurand_hasNoHiprand_hasNoBlas
-rw-r--r--. 1 avalassi zg 0 Nov 2 13:20 build.512y_m_inl0_hrd0/.build.512y_m_inl0_hrd0_dcd0_hasCurand_hasNoHiprand_hasNoBlas
-rw-r--r--. 1 avalassi zg 0 Nov 2 13:20 build.512z_m_inl0_hrd0/.build.512z_m_inl0_hrd0_dcd0_hasCurand_hasNoHiprand_hasNoBlas
-rwxr-xr-x. 1 avalassi zg 30685088 Nov 2 13:24 build.avx2_m_inl0_hrd0/runTest_cpp.exe*
-rwxr-xr-x. 1 avalassi zg 32864032 Nov 2 13:24 build.sse4_m_inl0_hrd0/runTest_cpp.exe*
-rwxr-xr-x. 1 avalassi zg 31299488 Nov 2 13:24 build.512y_m_inl0_hrd0/runTest_cpp.exe*
-rwxr-xr-x. 1 avalassi zg 34296680 Nov 2 13:24 build.512z_m_inl0_hrd0/runTest_cpp.exe*
-rwxr-xr-x. 1 avalassi zg 19213480 Nov 2 13:24 build.none_m_inl0_hrd0/runTest_cpp.exe*
-rw-r--r--. 1 avalassi zg 0 Nov 2 13:24 build.cuda_m_inl0_hrd0_dcd0/.build.cuda_m_inl0_hrd0_dcd0_hasCurand_hasNoHiprand_hasBlas
-rwxr-xr-x. 1 avalassi zg 459395256 Nov 2 13:48 build.cuda_m_inl0_hrd0_dcd0/runTest_cuda.exe*
-rw-r--r--. 1 avalassi zg 0 Nov 2 13:48 build.cuda_m_inl0_hrd0_dcd1/.build.cuda_m_inl0_hrd0_dcd1_hasCurand_hasNoHiprand_hasBlas
-rwxr-xr-x. 1 avalassi zg 717176192 Nov 2 13:56 build.cuda_m_inl0_hrd0_dcd1/runTest_cuda.exe*
(2)
dpg10dpf100 (155 diagram files)
- avxs: 3m
- dcd0: 4m
- dcd1: 3m
gg_ttgggg.dpg10dpf100.sa/SubProcesses/P1_Sigma_sm_gg_ttxgggg>
ls -ltr build.*/.build* build.*/runTest*exe
-rw-r--r--. 1 avalassi zg 0 Nov 2 14:15 build.none_m_inl0_hrd0/.build.none_m_inl0_hrd0_dcd0_hasCurand_hasNoHiprand_hasNoBlas
-rw-r--r--. 1 avalassi zg 0 Nov 2 14:15 build.sse4_m_inl0_hrd0/.build.sse4_m_inl0_hrd0_dcd0_hasCurand_hasNoHiprand_hasNoBlas
-rw-r--r--. 1 avalassi zg 0 Nov 2 14:15 build.512y_m_inl0_hrd0/.build.512y_m_inl0_hrd0_dcd0_hasCurand_hasNoHiprand_hasNoBlas
-rw-r--r--. 1 avalassi zg 0 Nov 2 14:15 build.avx2_m_inl0_hrd0/.build.avx2_m_inl0_hrd0_dcd0_hasCurand_hasNoHiprand_hasNoBlas
-rw-r--r--. 1 avalassi zg 0 Nov 2 14:15 build.512z_m_inl0_hrd0/.build.512z_m_inl0_hrd0_dcd0_hasCurand_hasNoHiprand_hasNoBlas
-rwxr-xr-x. 1 avalassi zg 19795496 Nov 2 14:17 build.sse4_m_inl0_hrd0/runTest_cpp.exe*
-rwxr-xr-x. 1 avalassi zg 10548840 Nov 2 14:17 build.none_m_inl0_hrd0/runTest_cpp.exe*
-rwxr-xr-x. 1 avalassi zg 19962480 Nov 2 14:17 build.512z_m_inl0_hrd0/runTest_cpp.exe*
-rwxr-xr-x. 1 avalassi zg 18136744 Nov 2 14:18 build.512y_m_inl0_hrd0/runTest_cpp.exe*
-rwxr-xr-x. 1 avalassi zg 17133224 Nov 2 14:18 build.avx2_m_inl0_hrd0/runTest_cpp.exe*
-rw-r--r--. 1 avalassi zg 0 Nov 2 14:18 build.cuda_m_inl0_hrd0_dcd0/.build.cuda_m_inl0_hrd0_dcd0_hasCurand_hasNoHiprand_hasBlas
-rwxr-xr-x. 1 avalassi zg 644387016 Nov 2 14:22 build.cuda_m_inl0_hrd0_dcd0/runTest_cuda.exe*
-rw-r--r--. 1 avalassi zg 0 Nov 2 14:22 build.cuda_m_inl0_hrd0_dcd1/.build.cuda_m_inl0_hrd0_dcd1_hasCurand_hasNoHiprand_hasBlas
-rwxr-xr-x. 1 avalassi zg 388979648 Nov 2 14:25 build.cuda_m_inl0_hrd0_dcd1/runTest_cuda.exe*
(3)
dpg100dpf100 (155 diagram files)
- avxs: 4m
- dcd0: 7m
- dcd1: 6m
gg_ttgggg.dpg100dpf100.sa/SubProcesses/P1_Sigma_sm_gg_ttxgggg>
ls -ltr build.*/.build* build.*/runTest*exe
-rw-r--r--. 1 avalassi zg 0 Nov 2 14:57 build.none_m_inl0_hrd0/.build.none_m_inl0_hrd0_dcd0_hasCurand_hasNoHiprand_hasNoBlas
-rw-r--r--. 1 avalassi zg 0 Nov 2 14:57 build.sse4_m_inl0_hrd0/.build.sse4_m_inl0_hrd0_dcd0_hasCurand_hasNoHiprand_hasNoBlas
-rw-r--r--. 1 avalassi zg 0 Nov 2 14:57 build.avx2_m_inl0_hrd0/.build.avx2_m_inl0_hrd0_dcd0_hasCurand_hasNoHiprand_hasNoBlas
-rw-r--r--. 1 avalassi zg 0 Nov 2 14:57 build.512y_m_inl0_hrd0/.build.512y_m_inl0_hrd0_dcd0_hasCurand_hasNoHiprand_hasNoBlas
-rw-r--r--. 1 avalassi zg 0 Nov 2 14:57 build.512z_m_inl0_hrd0/.build.512z_m_inl0_hrd0_dcd0_hasCurand_hasNoHiprand_hasNoBlas
-rwxr-xr-x. 1 avalassi zg 12496440 Nov 2 15:00 build.none_m_inl0_hrd0/runTest_cpp.exe*
-rwxr-xr-x. 1 avalassi zg 15509608 Nov 2 15:00 build.512z_m_inl0_hrd0/runTest_cpp.exe*
-rwxr-xr-x. 1 avalassi zg 13233312 Nov 2 15:00 build.avx2_m_inl0_hrd0/runTest_cpp.exe*
-rwxr-xr-x. 1 avalassi zg 15641632 Nov 2 15:00 build.sse4_m_inl0_hrd0/runTest_cpp.exe*
-rwxr-xr-x. 1 avalassi zg 14138528 Nov 2 15:01 build.512y_m_inl0_hrd0/runTest_cpp.exe*
-rw-r--r--. 1 avalassi zg 0 Nov 2 15:01 build.cuda_m_inl0_hrd0_dcd0/.build.cuda_m_inl0_hrd0_dcd0_hasCurand_hasNoHiprand_hasBlas
-rwxr-xr-x. 1 avalassi zg 481910800 Nov 2 15:08 build.cuda_m_inl0_hrd0_dcd0/runTest_cuda.exe*
-rw-r--r--. 1 avalassi zg 0 Nov 2 15:08 build.cuda_m_inl0_hrd0_dcd1/.build.cuda_m_inl0_hrd0_dcd1_hasCurand_hasNoHiprand_hasBlas
-rwxr-xr-x. 1 avalassi zg 324474280 Nov 2 15:14 build.cuda_m_inl0_hrd0_dcd1/runTest_cuda.exe*
(4)
dpg1000dpf1000 (16 diagram files)
- avxs: 5m
- dcd0: 3h08m
- dcd1: N/A (*) crashed, probably out of memory
gg_ttgggg.dpg1000dpf1000.sa/SubProcesses/P1_Sigma_sm_gg_ttxgggg>
ls -ltr build.*/.build* build.*/runTest*exe
-rw-r--r--. 1 avalassi zg 0 Nov 2 15:15 build.none_m_inl0_hrd0/.build.none_m_inl0_hrd0_dcd0_hasCurand_hasNoHiprand_hasNoBlas
-rw-r--r--. 1 avalassi zg 0 Nov 2 15:15 build.sse4_m_inl0_hrd0/.build.sse4_m_inl0_hrd0_dcd0_hasCurand_hasNoHiprand_hasNoBlas
-rw-r--r--. 1 avalassi zg 0 Nov 2 15:15 build.512z_m_inl0_hrd0/.build.512z_m_inl0_hrd0_dcd0_hasCurand_hasNoHiprand_hasNoBlas
-rw-r--r--. 1 avalassi zg 0 Nov 2 15:15 build.512y_m_inl0_hrd0/.build.512y_m_inl0_hrd0_dcd0_hasCurand_hasNoHiprand_hasNoBlas
-rw-r--r--. 1 avalassi zg 0 Nov 2 15:15 build.avx2_m_inl0_hrd0/.build.avx2_m_inl0_hrd0_dcd0_hasCurand_hasNoHiprand_hasNoBlas
-rwxr-xr-x. 1 avalassi zg 14519744 Nov 2 15:18 build.sse4_m_inl0_hrd0/runTest_cpp.exe*
-rwxr-xr-x. 1 avalassi zg 12209728 Nov 2 15:18 build.avx2_m_inl0_hrd0/runTest_cpp.exe*
-rwxr-xr-x. 1 avalassi zg 13061696 Nov 2 15:19 build.512y_m_inl0_hrd0/runTest_cpp.exe*
-rwxr-xr-x. 1 avalassi zg 14187880 Nov 2 15:19 build.512z_m_inl0_hrd0/runTest_cpp.exe*
-rwxr-xr-x. 1 avalassi zg 14519304 Nov 2 15:20 build.none_m_inl0_hrd0/runTest_cpp.exe*
-rw-r--r--. 1 avalassi zg 0 Nov 2 15:20 build.cuda_m_inl0_hrd0_dcd0/.build.cuda_m_inl0_hrd0_dcd0_hasCurand_hasNoHiprand_hasBlas
-rwxr-xr-x. 1 avalassi zg 195724144 Nov 2 18:28 build.cuda_m_inl0_hrd0_dcd0/runTest_cuda.exe*
-rw-r--r--. 1 avalassi zg 0 Nov 2 18:28 build.cuda_m_inl0_hrd0_dcd1/.build.cuda_m_inl0_hrd0_dcd1_hasCurand_hasNoHiprand_hasBlas
(*)
Parallel build with DCDIAG=1 crashed:
the builds of five files were taking ~50GB RSS each (total RAM is 120 GB).
Non-parallel build completion was not attempted.
nvcc error : 'cicc' died due to signal 9 (Kill signal)
make[1]: *** [cudacpp.mk:841: build.cuda_m_inl0_hrd0_dcd1/diagrams1_cuda.o] Error 9
make[1]: *** Waiting nvcc error : 'cicc' died due to signal 9 (Kill signal)
make[1]: *** [cudacpp.mk:841: build.cuda_m_inl0_hrd0_dcd1/diagrams1_cuda.o] Error 9
make[1]: *** Waiting for unfinished jobs....
nvcc error : 'cicc' died due to signal 9 (Kill signal)
make[1]: *** [cudacpp.mk:841: build.cuda_m_inl0_hrd0_dcd1/diagrams11_cuda.o] Error 9
nvcc error : 'cicc' died due to signal 9 (Kill signal)
make[1]: *** [cudacpp.mk:841: build.cuda_m_inl0_hrd0_dcd1/diagrams16_cuda.o] Error 9
nvcc error : 'cicc' died due to signal 9 (Kill signal)
make[1]: *** [cudacpp.mk:841: build.cuda_m_inl0_hrd0_dcd1/diagrams15_cuda.o] Error 9
nvcc error : 'cicc' died due to signal 9 (Kill signal)
make[1]: *** [cudacpp.mk:841: build.cuda_m_inl0_hrd0_dcd1/diagrams13_cuda.o] Error 9
make[1]: Leaving directory '/data/avalassi/GPU2023/test-madgraph4gpu2/epochX/cudacpp/gg_ttgggg.dpg1000dpf1000.sa/SubProcesses/P1_Sigma_sm_gg_ttxgggg'
(5)
dpg10000dpf10000 (2 diagram files)
- avxs: 4m
- dcd0: N/A (**) not attempted
- dcd1: N/A (**) not attempted
gg_ttgggg.dpg10000dpf10000.sa/SubProcesses/P1_Sigma_sm_gg_ttxgggg>
ls -ltr build.*/.build* build.*/runTest*exe
-rw-r--r--. 1 avalassi zg 0 Nov 4 07:18 build.none_m_inl0_hrd0/.build.none_m_inl0_hrd0_dcd0_hasCurand_hasNoHiprand_hasNoBlas
-rw-r--r--. 1 avalassi zg 0 Nov 4 07:18 build.sse4_m_inl0_hrd0/.build.sse4_m_inl0_hrd0_dcd0_hasCurand_hasNoHiprand_hasNoBlas
-rw-r--r--. 1 avalassi zg 0 Nov 4 07:18 build.avx2_m_inl0_hrd0/.build.avx2_m_inl0_hrd0_dcd0_hasCurand_hasNoHiprand_hasNoBlas
-rw-r--r--. 1 avalassi zg 0 Nov 4 07:18 build.512y_m_inl0_hrd0/.build.512y_m_inl0_hrd0_dcd0_hasCurand_hasNoHiprand_hasNoBlas
-rw-r--r--. 1 avalassi zg 0 Nov 4 07:18 build.512z_m_inl0_hrd0/.build.512z_m_inl0_hrd0_dcd0_hasCurand_hasNoHiprand_hasNoBlas
-rwxr-xr-x. 1 avalassi zg 14519608 Nov 4 07:21 build.none_m_inl0_hrd0/runTest_cpp.exe*
-rwxr-xr-x. 1 avalassi zg 13061992 Nov 4 07:21 build.512y_m_inl0_hrd0/runTest_cpp.exe*
-rwxr-xr-x. 1 avalassi zg 14524136 Nov 4 07:21 build.sse4_m_inl0_hrd0/runTest_cpp.exe*
-rwxr-xr-x. 1 avalassi zg 14188176 Nov 4 07:22 build.512z_m_inl0_hrd0/runTest_cpp.exe*
-rwxr-xr-x. 1 avalassi zg 12210024 Nov 4 07:22 build.avx2_m_inl0_hrd0/runTest_cpp.exe*
(**)
CUDA builds were not attempted for dpg10000.
With DCDIAG=1 these are likely to crash like those for dpg1000.
With DCDIAG=0 these are likely to take >24h, with suboptimal runtime performance.
(6)
dpg100000dpf100000 (1 diagram file)
- avxs: N/A (***) crashed, gcc segmentation fault
- dcd0: N/A (****) stopped after >7 days
- dcd1: N/A (****) not attempted
(***)
C++ build crashed (both parallel and non-parallel): gcc segmentation fault.
ccache g++ -I. -I../../src -O3 -std=c++17 -Wall -Wshadow -Wextra -ffast-math -march=x86-64 -DMGONGPU_FPTYPE_DOUBLE -DMGONGPU_FPTYPE2_FLOAT -DMGONGPU_HAS_NO_BLAS -fPIC -DMGONGPU_CHANNELID_DEBUG -c diagrams1.cc -o build.none_m_inl0_hrd0/diagrams1_cpp.o
g++: internal compiler error: Segmentation fault signal terminated program cc1plus
Please submit a full bug report,
with preprocessed source if appropriate.
See <http://bugs.almalinux.org/> for instructions.
make[1]: *** [cudacpp.mk:836: build.none_m_inl0_hrd0/diagrams1_cpp.o] Error 4
make[1]: Leaving directory '/data/avalassi/GPU2023/test-madgraph4gpu2/epochX/cudacpp/gg_ttgggg.dpg100000dpf100000.sa/SubProcesses/P1_Sigma_sm_gg_ttxgggg'
(****)
CUDA build with a configuration similar to DCDIAG=0 had previously been stopped after >7 days.
No further CUDA builds have been attempted whether with DCDIAG=0 or DCDIAG=1.
[avalassi@itscrd-a100 gcc11/usr] /data/avalassi/GPU2023/test-madgraph4gpu2/epochX/cudacpp> ./CODEGEN/generateAndCompare.sh gg_ttgggg --maxdiagpergroup 1000 --mindiagperfile 1000 Code generation and additional checks completed in 341 seconds
[avalassi@itscrd-a100 gcc11/usr] /data/avalassi/GPU2023/test-madgraph4gpu2/epochX/cudacpp> ./CODEGEN/generateAndCompare.sh gg_ttgggg --maxdiagpergroup 100 --mindiagperfile 100 Code generation and additional checks completed in 348 seconds
[avalassi@itscrd-a100 gcc11/usr] /data/avalassi/GPU2023/test-madgraph4gpu2/epochX/cudacpp> ./CODEGEN/generateAndCompare.sh gg_ttgggg --maxdiagpergroup 1 --mindiagperfile 100 Code generation and additional checks completed in 576 seconds
[avalassi@itscrd-a100 gcc11/usr] /data/avalassi/GPU2023/test-madgraph4gpu2/epochX/cudacpp> ./CODEGEN/generateAndCompare.sh gg_ttgggg --maxdiagpergroup 10 --mindiagperfile 100 Code generation and additional checks completed in 461 seconds
[avalassi@itscrd-a100 gcc11/usr] /data/avalassi/GPU2023/test-madgraph4gpu2/epochX/cudacpp> ./CODEGEN/generateAndCompare.sh gg_ttgggg --maxdiagpergroup 10000 --mindiagperfile 10000 Code generation and additional checks completed in 481 seconds
[avalassi@itscrd-a100 gcc11/usr] /data/avalassi/GPU2023/test-madgraph4gpu2/epochX/cudacpp> ./CODEGEN/generateAndCompare.sh gg_ttgggg --maxdiagpergroup 100000 --mindiagperfile 100000 Code generation and additional checks completed in 394 seconds
…rd-a100 on hack_ihel6p1 codebase There is no change: this is essentially the same code
…plates) [avalassi@itscrd-a100 gcc11/usr] /data/avalassi/GPU2023/test-madgraph4gpu/epochX/cudacpp> ./CODEGEN/generateAndCompare.sh gg_ttgggg --maxdiagpergroup 1 --mindiagperfile 100 Code generation and additional checks completed in 344 seconds Build times (C++/gold91): 2m20 gg_ttgggg.dpg1dpf100.sa/SubProcesses/P1_Sigma_sm_gg_ttxgggg> make cleanall; START=$(date); CCACHE_RECACHE=1 make -j -f cudacpp.mk bldall; echo $START; echo $(date) Sat Nov 22 07:12:22 AM CET 2025 Sat Nov 22 07:14:46 AM CET 2025 Build times (CUDA/a100): 48m (dcd0), 10m (dcd1) gg_ttgggg.dpg1dpf100.sa/SubProcesses/P1_Sigma_sm_gg_ttxgggg> make cleanall; START0=$(date); CCACHE_RECACHE=1 make -j -f cudacpp.mk bldcuda DCDIAG=0; START1=$(date); CCACHE_RECACHE=1 make -j -f cudacpp.mk bldcuda DCDIAG=1; echo $START0; echo $START1; echo $(date) Sat Nov 22 08:41:37 AM CET 2025 Sat Nov 22 09:29:17 AM CET 2025 Sat Nov 22 09:39:31 AM CET 2025
…mplates) [avalassi@itscrd-a100 gcc11/usr] /data/avalassi/GPU2023/test-madgraph4gpu/epochX/cudacpp> ./CODEGEN/generateAndCompare.sh gg_ttgggg --maxdiagpergroup 10 --mindiagperfile 100 Code generation and additional checks completed in 488 seconds Build times (C++/gold91): 3m10 gg_ttgggg.dpg10dpf100.sa/SubProcesses/P1_Sigma_sm_gg_ttxgggg> make cleanall; START=$(date); CCACHE_RECACHE=1 make -j -f cudacpp.mk bldall; echo $START; echo $(date) Sat Nov 22 07:19:36 AM CET 2025 Sat Nov 22 07:22:44 AM CET 2025 Build times (CUDA/a100): 4m30 (dcd0), 3m50 (dcd1) gg_ttgggg.dpg10dpf100.sa/SubProcesses/P1_Sigma_sm_gg_ttxgggg> make cleanall; START0=$(date); CCACHE_RECACHE=1 make -j -f cudacpp.mk bldcuda DCDIAG=0; START1=$(date); CCACHE_RECACHE=1 make -j -f cudacpp.mk bldcuda DCDIAG=1; echo $START0; echo $START1; echo $(date) Sat Nov 22 07:11:58 AM CET 2025 Sat Nov 22 07:16:26 AM CET 2025 Sat Nov 22 07:20:16 AM CET 2025
…emplates) [avalassi@itscrd-a100 gcc11/usr] /data/avalassi/GPU2023/test-madgraph4gpu/epochX/cudacpp> ./CODEGEN/generateAndCompare.sh gg_ttgggg --maxdiagpergroup 100 --mindiagperfile 100 Code generation and additional checks completed in 356 seconds Build times (C++/gold91): 3m gg_ttgggg.dpg100dpf100.sa/SubProcesses/P1_Sigma_sm_gg_ttxgggg> make cleanall; START=$(date); CCACHE_RECACHE=1 make -j -f cudacpp.mk bldall; echo $START; echo $(date) Sat Nov 22 07:26:57 AM CET 2025 Sat Nov 22 07:29:50 AM CET 2025 Build times (CUDA/a100): 8m50 (dcd0), 6m20 (dcd1) gg_ttgggg.dpg100dpf100.sa/SubProcesses/P1_Sigma_sm_gg_ttxgggg> make cleanall; START0=$(date); CCACHE_RECACHE=1 make -j -f cudacpp.mk bldcuda DCDIAG=0; START1=$(date); CCACHE_RECACHE=1 make -j -f cudacpp.mk bldcuda DCDIAG=1; echo $START0; echo $START1; echo $(date) Sat Nov 22 07:24:15 AM CET 2025 Sat Nov 22 07:33:07 AM CET 2025 Sat Nov 22 07:39:25 AM CET 2025
… templates) [avalassi@itscrd-a100 gcc11/usr] /data/avalassi/GPU2023/test-madgraph4gpu/epochX/cudacpp> ./CODEGEN/generateAndCompare.sh gg_ttgggg --maxdiagpergroup 1000 --mindiagperfile 1000 Code generation and additional checks completed in 372 seconds Build times (C++/gold91): 3m40 gg_ttgggg.dpg1000dpf1000.sa/SubProcesses/P1_Sigma_sm_gg_ttxgggg> make cleanall; START=$(date); CCACHE_RECACHE=1 make -j -f cudacpp.mk bldall; echo $START; echo $(date) Sat Nov 22 07:35:06 AM CET 2025 Sat Nov 22 07:38:48 AM CET 2025 Build times (CUDA/a100): 2h57 (dcd0), FAILED (dcd1) gg_ttgggg.dpg1000dpf1000.sa/SubProcesses/P1_Sigma_sm_gg_ttxgggg> make cleanall; START0=$(date); CCACHE_RECACHE=1 make -j -f cudacpp.mk bldcuda DCDIAG=0; START1=$(date); CCACHE_RECACHE=1 make -j -f cudacpp.mk bldcuda DCDIAG=1; echo $START0; echo $START1; echo $(date) ... (Node crashed) [1123356.164224] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/user.slice/user-14546.slice/session-7302.scope,task=cicc,pid=1243172,uid=14546 [1123356.164254] Out of memory: Killed process 1243172 (cicc) total-vm:157944388kB, anon-rss:70969088kB, file-rss:1152kB, shmem-rss:0kB, UID:14546 pgtables:299660kB oom_score_adj:0 [1123371.062904] oom_reaper: reaped process 1243172 (cicc), now anon-rss:1056kB, file-rss:1152kB, shmem-rss:0kB ... ls -ltr build.*/.build* build.*/run*exe -rw-r--r--. 1 avalassi zg 0 Nov 22 09:45 build.cuda_m_inl0_hrd0_dcd0/.build.cuda_m_inl0_hrd0_dcd0_hasCurand_hasNoHiprand_hasBlas -rwxr-xr-x. 1 avalassi zg 195705368 Nov 22 12:42 build.cuda_m_inl0_hrd0_dcd0/runTest_cuda.exe* -rw-r--r--. 1 avalassi zg 0 Nov 22 12:42 build.cuda_m_inl0_hrd0_dcd1/.build.cuda_m_inl0_hrd0_dcd1_hasCurand_hasNoHiprand_hasBlas
…ut templates) [avalassi@itscrd-a100 gcc11/usr] /data/avalassi/GPU2023/test-madgraph4gpu/epochX/cudacpp> ./CODEGEN/generateAndCompare.sh gg_ttgggg --maxdiagpergroup 10000 --mindiagperfile 10000 Code generation and additional checks completed in 358 seconds Build times (C++/gold91): 28m gg_ttgggg.dpg10000dpf10000.sa/SubProcesses/P1_Sigma_sm_gg_ttxgggg> make cleanall; START=$(date); CCACHE_RECACHE=1 make -j -f cudacpp.mk bldall; echo $START; echo $(date) Sat Nov 22 07:59:12 AM CET 2025 Sat Nov 22 08:27:22 AM CET 2025 Build times (CUDA/a100): N/A (dcd0), N/A (dcd1) (DCDIAG=0 build not attempted as it would probably take too long) (DCDIAG=1 build not attempted as the dpg1000 build failed)
…hout templates) [avalassi@itscrd-a100 gcc11/usr] /data/avalassi/GPU2023/test-madgraph4gpu/epochX/cudacpp> ./CODEGEN/generateAndCompare.sh gg_ttgggg --maxdiagpergroup 100000 --mindiagperfile 100000 Code generation and additional checks completed in 489 seconds Build times (C++/gold91): all five backends fail with "Segmentation fault" gg_ttgggg.dpg100000dpf100000.sa/SubProcesses/P1_Sigma_sm_gg_ttxgggg> make cleanall; START=$(date); CCACHE_RECACHE=1 make -j -f cudacpp.mk bldall; echo $START; echo $(date) g++: internal compiler error: Segmentation fault signal terminated program cc1plus Please submit a full bug report, with preprocessed source if appropriate. make[1]: *** [cudacpp.mk:841: build.none_m_inl0_hrd0/diagrams1_cpp.o] Error 4 Sat Nov 22 07:53:05 AM CET 2025 Sat Nov 22 07:55:14 AM CET 2025 Build times (CUDA/a100): N/A (dcd0), N/A (dcd1) (DCDIAG=0 build not attempted as it would probably take too long) (DCDIAG=1 build not attempted as the dpg1000 build failed)
…plates) [avalassi@itscrd-a100 gcc11/usr] /data/avalassi/GPU2023/test-madgraph4gpu/epochX/cudacpp> ./CODEGEN/generateAndCompare.sh gg_ttgggg --maxdiagpergroup 200 --mindiagperfile 200 Code generation and additional checks completed in 525 seconds [in parallel to a software build] Build times (C++/gold91): 3m10 gg_ttgggg.dpg200dpf200.sa/SubProcesses/P1_Sigma_sm_gg_ttxgggg> make cleanall; START=$(date); CCACHE_RECACHE=1 make -j -f cudacpp.mk bldall; echo $START; echo $(date) Sat Nov 22 07:41:52 AM CET 2025 Sat Nov 22 07:44:04 AM CET 2025 Build times (CUDA/a100): 15m (dcd0), 17m (dcd1) gg_ttgggg.dpg200dpf200.sa/SubProcesses/P1_Sigma_sm_gg_ttxgggg> make cleanall; START0=$(date); CCACHE_RECACHE=1 make -j -f cudacpp.mk bldcuda DCDIAG=0; START1=$(date); CCACHE_RECACHE=1 make -j -f cudacpp.mk bldcuda DCDIAG=1; echo $START0; echo $START1; echo $(date) Sat Nov 22 07:55:54 AM CET 2025 Sat Nov 22 08:11:29 AM CET 2025 Sat Nov 22 08:28:47 AM CET 2025
…p1 codebase with 1000 dpg/dpf)
… dpg values tput/logs_ggttgggg_sa_scan/scan.sh Sat Nov 22 09:13:07 AM CET 2025 Sat Nov 22 09:17:23 AM CET 2025
…t different dpg values dpg1dpf100 none 0.0 4.717894e-01 1.00x 33.913437 sse4 0.0 4.228031e-01 0.90x 37.842674 avx2 0.0 4.877365e-01 1.03x 32.804596 512y 0.0 5.069305e-01 1.07x 31.562513 512z 0.0 5.272687e-01 1.12x 30.345057 dpg10dpf100 none 0.0 1.672502e+00 1.00x 9.566506 sse4 0.0 1.556929e+00 0.93x 10.276643 avx2 0.0 3.484190e+00 2.08x 4.592172 512y 0.0 3.270659e+00 1.96x 4.891981 512z 0.0 4.175834e+00 2.50x 3.831570 dpg100dpf100 none 0.0 2.227907e+00 1.00x 7.181629 sse4 0.0 3.532345e+00 1.59x 4.529569 avx2 0.0 9.224621e+00 4.14x 1.734489 512y 0.0 1.044090e+01 4.69x 1.532434 512z 0.0 1.521281e+01 6.83x 1.051745 dpg200dpf200 none 0.0 2.540721e+00 1.00x 6.297424 sse4 0.0 4.628705e+00 1.82x 3.456690 avx2 0.0 1.100694e+01 4.33x 1.453629 512y 0.0 1.144459e+01 4.50x 1.398040 512z 0.0 1.798910e+01 7.08x 0.889428 dpg1000dpf1000 none 0.0 2.568091e+00 1.00x 6.230309 sse4 0.0 5.189057e+00 2.02x 3.083412 avx2 0.0 1.236506e+01 4.81x 1.293969 512y 0.0 1.311557e+01 5.11x 1.219924 512z 0.0 2.222294e+01 8.65x 0.719977 dpg10000dpf10000 none 0.0 2.644150e+00 1.00x 6.051095 sse4 0.0 5.453349e+00 2.06x 2.933977 avx2 0.0 1.314281e+01 4.97x 1.217396 512y 0.0 1.370831e+01 5.18x 1.167175 512z 0.0 2.289663e+01 8.66x 0.698793
… ggttgggg.sa at different dpg values Sat Nov 22 04:34:39 PM CET 2025 Sat Nov 22 07:12:15 PM CET 2025
… for ggttgggg.sa at different dpg values
… for instrumenting color sums Apply these as follows cd gg_ttgggg.<dpg>.sa/SubProcesses patch -i ../../patchS.patch cd P1_Sigma_sm_gg_ttxgggg/ patch -i ../../../patchP.patch
…dpg1000dpf1000.sa cd gg_ttgggg.dpg1000dpf1000.sa/SubProcesses patch -i ../../patchS.patch cd P1_Sigma_sm_gg_ttxgggg/ patch -i ../../../patchP.patch
…dpg100dpf100.sa cd gg_ttgggg.dpg100dpf100.sa/SubProcesses patch -i ../../patchS.patch cd P1_Sigma_sm_gg_ttxgggg/ patch -i ../../../patchP.patch
…SIMD/gold Also update CUDA/a100 script: use common random numbers to compare MEs to SIMD/gold (no curand on gold)
…ts for ggtt4g colortimer using CUDA/a100, now using common random numbers
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is a WIP PR for documentation only (not to be merged) replacing PR #601 (that I will close).
It includes the 2->6 process gg->ttgggg in various diagram splitting scenarios, including many that execute correctly on CPU and GPU.
The same techniques also make it possible to execute the 2->7 process gg->ttggggg on CPU (not GPU). But I will not create a PR for that as the source code is almost 1GB.
Full documentation in https://arxiv.org/abs/2510.05392v2 that should appear tomorrow.