Skip to content

Conversation

@valassi
Copy link
Member

@valassi valassi commented Nov 23, 2025

This is a WIP PR for documentation only (not to be merged) replacing PR #601 (that I will close).

It includes the 2->6 process gg->ttgggg in various diagram splitting scenarios, including many that execute correctly on CPU and GPU.

The same techniques also make it possible to execute the 2->7 process gg->ttggggg on CPU (not GPU). But I will not create a PR for that as the source code is almost 1GB.

Full documentation in https://arxiv.org/abs/2510.05392v2 that should appear tomorrow.

(Without these changes, the ggttg/ggttggg bridge tests and tmad tests failed - most likely due to missing 'make clean')

In detail:
- in tput/throughputX.sh, use 'make -f cudacpp.mk' instead of 'make' (this enables faster rebuilds from ccache)
- in tput/throughputX.sh, profile diagramgroup1 instead of diagram1
- in tput/allTees.sh, always run 'make clean' (unless -nomakeclean is specified)
- in tput/allTees.sh, drop the -short option (always run ggttggg)
- in tmad/allTees.sh, always run 'make cleanall' (unless -nomakeclean is specified)
- in tmad/allTees.sh, improve debug printouts
…ew wf layout, optional CUDA Graphs) - all ok

With respect to the last rd90 scaling logs for the 'hack_ihel3_sep25' codebase (commit 6e5d26a):
- Split processes (ggttggg, ggttgg) are much worse
  > CUDA (no blas, no graphs) is a factor ~10 slower for small grids and ~2.5 slower for ggttggg (~4 and ~2 for ggttgg)
  > C++ is 10-15% slower for ggttggg (up to 5% for ggttgg)
- Single-kernel processes are only moderately impacted
  > CUDA (no blas, no graphs) is ~20% slower for both small and large grids for ggtt and ggttg
  > C++ is the same speed for ggtt (and possibly faster for ggttg?)
=> Should try to keep the code but increase to 2000 diagrams per kernel?

With respect to the previous rd90 scaling logs for the 'hack_ihel4p2' codebase (commit 2893531):
- CUDA peak throughputs in ggttggg (with and without graphs) are 5% faster
- The only difference here is the improved memory layout: so it does help, but not much

STARTED  AT Sun Oct 19 06:01:47 PM CEST 2025
./tput/teeThroughputX.sh -dmf -hrd -makej -eemumu -ggtt -ggttg -ggttgg -ggttggg -gqttq -makeclean
ENDED(1) AT Sun Oct 19 06:33:21 PM CEST 2025 [Status=0]
./tput/teeThroughputX.sh -dmf -makej -eemumu -ggtt -ggttg -ggttgg -ggttggg -gqttq -scaling -makeclean
ENDED(1-scaling) AT Sun Oct 19 06:47:17 PM CEST 2025 [Status=0]
./tput/teeThroughputX.sh -ggtt -ggttgg -dmf -blasOn -makeclean
ENDED(2) AT Sun Oct 19 06:53:26 PM CEST 2025 [Status=0]
./tput/teeThroughputX.sh -ggtt -ggttg -ggttgg -ggttggg -dmf -blasOn -scaling -makeclean
ENDED(2-scaling) AT Sun Oct 19 07:10:41 PM CEST 2025 [Status=0]
./tput/teeThroughputX.sh -ggttgg -ggttggg -dmf -useGraphs -makeclean
ENDED(3) AT Sun Oct 19 07:20:00 PM CEST 2025 [Status=0]
./tput/teeThroughputX.sh -ggtt -ggttg -ggttgg -ggttggg -dmf -useGraphs -scaling -makeclean
ENDED(3-scaling) AT Sun Oct 19 07:38:16 PM CEST 2025 [Status=0]
./tput/teeThroughputX.sh -d_f -hrd -makej -eemumu -ggtt -ggttgg -inlonly -makeclean
ENDED(4) AT Sun Oct 19 07:48:41 PM CEST 2025 [Status=0]
./tput/teeThroughputX.sh -makej -eemumu -ggtt -ggttg -ggttgg -ggttggg -gqttq -d_f -bridge -makeclean
ENDED(5) AT Sun Oct 19 07:59:00 PM CEST 2025 [Status=0]
./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -rmbhst -makeclean
ENDED(6) AT Sun Oct 19 08:03:49 PM CEST 2025 [Status=0]
./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -curhst -makeclean
ENDED(7) AT Sun Oct 19 08:08:36 PM CEST 2025 [Status=0]
./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -common -makeclean
ENDED(8) AT Sun Oct 19 08:13:20 PM CEST 2025 [Status=0]
./tput/teeThroughputX.sh -ggtt -ggttgg -dmf -noBlas -makeclean
ENDED(9) AT Sun Oct 19 08:18:18 PM CEST 2025 [Status=0]
./tput/teeThroughputX.sh -dmf -hrd -makej -susyggtt -susyggt1t1 -smeftggtttt -heftggbb -makeclean
ENDED(10) AT Sun Oct 19 08:30:31 PM CEST 2025 [Status=0]

No errors found in logs

No FPEs or '{ }' found in logs

No aborts found in logs
…w wf layout, optional CUDA Graphs) - all ok

With respect to the last itscrd90 logs for the 'hack_ihel3_sep25' codebase (commit 10c3e3b), this is like tput tests:
- Split processes (ggttggg, ggttgg) are much worse
  > CUDA (no blas, no graphs) is a factor ~2.5 slower for ggttggg (and ~1.5 for ggttgg)
  > C++ is ~15% slower for ggttggg (up to 5% for ggttgg)
- Single-kernel processes are only moderately impacted
  > CUDA (no blas, no graphs) is ~15% slower for ggtt and ggttg
  > C++ is the same speed for ggtt (and possibly faster for ggttg?)

STARTED  AT Sun Oct 19 08:30:32 PM CEST 2025
/data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/teeMadX.sh -eemumu -ggtt -ggttg -ggttgg -gqttq -ggttggg -dmf -makeclean
(SM tests)
ENDED(1) AT Sun Oct 19 09:26:52 PM CEST 2025 [Status=0]
/data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/teeMadX.sh -heftggbb -susyggtt -susyggt1t1 -smeftggtttt -dmf -makeclean
(BSM tests)
ENDED(1) AT Sun Oct 19 09:32:22 PM CEST 2025 [Status=0]
… - they are all single-kernel again including ggttggg
…rnel) - failures in ggttggg/f

STARTED  AT Mon Oct 20 02:21:34 PM CEST 2025
/data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/teeMadX.sh -eemumu -ggtt -ggttg -ggttgg -gqttq -ggttggg -dmf -makeclean
(SM tests)
ENDED(1) AT Mon Oct 20 03:16:47 PM CEST 2025 [Status=0]
/data/avalassi/GPU2025/test-madgraph4gpu/epochX/cudacpp/tmad/teeMadX.sh -heftggbb -susyggtt -susyggt1t1 -smeftggtttt -dmf -makeclean
(BSM tests)
ENDED(1) AT Mon Oct 20 03:22:17 PM CEST 2025 [Status=0]

tmad/logs_ggttggg_mad/log_ggttggg_mad_f_inl0_hrd0.txt:ERROR! assertGpu: 'an illegal memory access was encountered' (700) in CPPProcess.cc:915
…ernel) - failures in ggttggg/f

With respect to the previous rd90 scaling logs for 'hack_ihel4p2' with 100 diagrams/kernel (commit d6144e4):
- CUDA/m for ggttggg/ggttgg is much better at small grids and 15% better peak at large grids
  > HOWEVER, CUDA/f for ggttggg fails
- C++ is 2% better for ggttggg/ggttgg

HOWEVER, with respect to the last rd90 scaling logs for the 'hack_ihel3_sep25' codebase (commit 6e5d26a):
- Complex processes (ggttggg, ggttgg)
  > CUDA (no blas, no graphs) is still a factor ~2 (i.e. 50%) slower both at small and large grids
  > C++ is still 10% slower for ggttggg (up to 5% for ggttgg)
- Simpler processes (ggttg, ggtt) are more moderately impacted
  > CUDA (no blas, no graphs) is ~20% slower for both small and large grids for ggtt and ggttg
  > C++ is the same speed for ggtt (and possibly faster for ggttg?)
=> there is still something to fix in both cuda and c++

STARTED  AT Mon Oct 20 08:41:50 AM CEST 2025
./tput/teeThroughputX.sh -dmf -hrd -makej -eemumu -ggtt -ggttg -ggttgg -ggttggg -gqttq -makeclean
ENDED(1) AT Mon Oct 20 12:15:09 PM CEST 2025 [Status=2]
./tput/teeThroughputX.sh -dmf -makej -eemumu -ggtt -ggttg -ggttgg -ggttggg -gqttq -scaling -makeclean
ENDED(1-scaling) AT Mon Oct 20 12:28:47 PM CEST 2025 [Status=0]
./tput/teeThroughputX.sh -ggtt -ggttgg -dmf -blasOn -makeclean
ENDED(2) AT Mon Oct 20 12:34:33 PM CEST 2025 [Status=0]
./tput/teeThroughputX.sh -ggtt -ggttg -ggttgg -ggttggg -dmf -blasOn -scaling -makeclean
ENDED(2-scaling) AT Mon Oct 20 12:50:48 PM CEST 2025 [Status=0]
./tput/teeThroughputX.sh -ggttgg -ggttggg -dmf -useGraphs -makeclean
ENDED(3) AT Mon Oct 20 12:57:59 PM CEST 2025 [Status=2]
./tput/teeThroughputX.sh -ggtt -ggttg -ggttgg -ggttggg -dmf -useGraphs -scaling -makeclean
ENDED(3-scaling) AT Mon Oct 20 01:18:13 PM CEST 2025 [Status=0]
./tput/teeThroughputX.sh -d_f -hrd -makej -eemumu -ggtt -ggttgg -inlonly -makeclean
ENDED(4) AT Mon Oct 20 01:35:34 PM CEST 2025 [Status=0]
./tput/teeThroughputX.sh -makej -eemumu -ggtt -ggttg -ggttgg -ggttggg -gqttq -d_f -bridge -makeclean
ENDED(5) AT Mon Oct 20 01:44:13 PM CEST 2025 [Status=2]
./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -rmbhst -makeclean
ENDED(6) AT Mon Oct 20 01:48:48 PM CEST 2025 [Status=0]
./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -curhst -makeclean
ENDED(7) AT Mon Oct 20 01:53:16 PM CEST 2025 [Status=0]
./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -common -makeclean
ENDED(8) AT Mon Oct 20 01:57:55 PM CEST 2025 [Status=0]
./tput/teeThroughputX.sh -ggtt -ggttgg -dmf -noBlas -makeclean
ENDED(9) AT Mon Oct 20 02:07:50 PM CEST 2025 [Status=0]
./tput/teeThroughputX.sh -dmf -hrd -makej -susyggtt -susyggt1t1 -smeftggtttt -heftggbb -makeclean
ENDED(10) AT Mon Oct 20 02:21:34 PM CEST 2025 [Status=0]

./tput/logs_ggttggg_mad/log_ggttggg_mad_f_inl0_hrd0_bridge.txt:ERROR! C++ calculation (C++/GPU) failed
./tput/logs_ggttggg_mad/log_ggttggg_mad_f_inl0_hrd0_graphs.txt:ERROR! C++ calculation (C++/GPU) failed
./tput/logs_ggttggg_mad/log_ggttggg_mad_f_inl0_hrd0.txt:ERROR! C++ calculation (C++/GPU) failed
./tput/logs_ggttggg_mad/log_ggttggg_mad_f_inl0_hrd1.txt:ERROR! C++ calculation (C++/GPU) failed
…erence files for gg_ttgggg

CUDACPP_RUNTEST_DUMPEVENTS=1 ./build.512z_m_inl0_hrd0/runTest_cpp.exe
\cp ../../test/ref/dump* ../../../CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/test/ref/
\cp ../../test/ref/dump* ../../../gg_ttgggg.sa/test/ref/

This comes from code that had been generated in hack_ihel4p2:
- setup with 2000 diagrams per group (15495 diagrams in 8 diagram groups)
- still with diagrams.h in CPPProcess.cc, not yet with one or more separate diagrams.cc
- still with direct writing of jamps to global memory, not yet with going back to local jamp_sv

Notes about gg_ttgggg code generation
- codegen of gg_ttgggg.sa took 11 minutes on itscrd90
  > total size 43MB (diagrams.h 37MB)
- codegen of gg_ttgggg.mad took 16 minutes on itgold91
  > total size 200MB (coloramps.inc 66MB, coloramps.h 53MB, diagrams.h 39MB, matrix1.f 11MB)
  > diagrams.h is larger than in gg_ttgggg.sa because it includes multichannel code
  > previous attempts of code generation using older code had failed many months ago on itscrd90

Notes about gg_ttgggg code build and execution (C++)
- code build relatively fast in C++ even with a single large CPPProcess.o
- on the first execution that created these logs, runTest.exe took 200s/512z
- on the next executions, runTest.exe takes 120s/512z, 160s/512y, 160s/avx2, 440s/sse4, 940s/none
- check.exe takes 1.6s/512z for 16 events (plus 1.6s helicities)

Notes about gg_ttgggg code build and execution (CUDA)
- code build took 23 hours (A100 node)
- attempted executions of runTest/check.exe were interrupted after ~5min with 100% CPU and >10GB RAM
- (in later hack_ihel5 tests with diagrams.cc splitting, CUDA check.exe takes 15min in the goodHel filtering)
…tAccessJamp

Also formatting changes for CODEGEN
…nel (local for CUDA, output array for C++)

In CUDA, store to or update global jamps only at the end
(Note: this also includes a fix tested on ggttg, store on diagramgroup1 and update on the following diagramgroups)

In C++, simplify the code and remove HostAccessJamp
…ag/kernel) - all ok again

With respect to the previous rd90 scaling logs for 'hack_ihel4p2' without local jamp_sv (commit 48fed45):
- CUDA/m for ggttggg is a factor 2 better for ggttggg (and generally much better in other processes)
- CUDA/f for ggttggg succeeds again
- C++ is ~1-2% better for ggttggg

With respect to the last rd90 scaling logs for the 'hack_ihel3_sep25' codebase (commit 6e5d26a):
- Complex processes (ggttggg, ggttgg)
  > CUDA (no blas, no graphs) is now THE SAME SPEED AS IHEL3 both at small and large grids
  > C++ is still 5%-15% slower for ggttggg (up to 5% for ggttgg)
- Simpler processes (ggttg, ggtt)
  > CUDA (no blas, no graphs) is up to ~10% slower for both small and large grids for ggtt and ggttg
  > C++ is the same speed for ggtt (and possibly faster for ggttg?)

=> In summary, CUDA looks good, but there may be something still to fix for C++?
…g/kernel) - all ok again

With respect to the last itscrd90 logs for the 'hack_ihel3_sep25' codebase (commit 10c3e3b), this is like tput:
- Complex processes (ggttggg, ggttgg)
  > CUDA (no blas, no graphs) is now THE SAME SPEED AS IHEL3 for ggttggg and ggttgg
  > C++ is still 5%-10% slower for ggttggg (up to 5% for ggttgg)
- Simpler processes (ggttg, ggtt)
  > CUDA (no blas, no graphs) is up to ~5% slower for ggtt and ggttg
  > C++ is the same speed for ggtt and ~5% faster for ggttg

=> In summary, CUDA looks good, but there may be something still to fix for C++?
…tern __device__ __constant__' - build warnings and runtime assert

diagrams.cc(40): warning #20044-D: extern declaration of the entity mg5amcGpu::cIPC is treated as a static definition
diagrams.cc(41): warning #20044-D: extern declaration of the entity mg5amcGpu::cIPD is treated as a static definition
diagrams.cc(42): warning #20044-D: extern declaration of the entity mg5amcGpu::cHel is treated as a static definition

ERROR! assertGpu: 'an illegal memory access was encountered' (700) in CPPProcess.cc:794
runTest_cuda.exe: GpuRuntime.h:26: void assertGpu(cudaError_t, const char*, int, bool): Assertion `code == gpuSuccess' failed.
…olAddress - cuda and C++ build/run for hrdcod=0
… different dpgs

alias mscalingtest0='for b in 1 2 4 8 16 32 64 128; \
  do ( CUDACPP_RUNTIME_GOODHELICITIES=ALL ./build.cuda_m_inl0_hrd0_dcd0/check_cuda.exe -p $b 32 1 \
  | \grep "EvtsPerSec\[MECalcOnly\]" | awk -vb=$b "{printf \"%s %4d %3d\n\", \$5, b, 32}" ) \
  |& sed "s/Gpu.*Assert/Assert/"; done'

alias mscalingtest1='for b in 1 2 4 8 16 32 64 128; \
  do ( CUDACPP_RUNTIME_GOODHELICITIES=ALL ./build.cuda_m_inl0_hrd0_dcd1/check_cuda.exe -p $b 32 1 \
  | \grep "EvtsPerSec\[MECalcOnly\]" | awk -vb=$b "{printf \"%s %4d %3d\n\", \$5, b, 32}" ) \
  |& sed "s/Gpu.*Assert/Assert/"; done'

Results are only given for dpg1, dpg10, dpg100

The dpg1000 build is still running in non-parallel mode
(with DCDIAG=1 the build of diagrams1.cc easily grows to 60GB+ RSS i.e. >50% of the node RAM)

The dpg10000

---

BUILD TIMES on
[avalassi@itscrd-a100 gcc11/usr] /data/avalassi/GPU2023/test-madgraph4gpu2/epochX/cudacpp

  make cleanall; \
  CCACHE_RECACHE=1 time make -j15 -f cudacpp.mk bldavxs; \
  CCACHE_RECACHE=1 time make -j15 -f cudacpp.mk bldcuda DCDIAG=0; \
  CCACHE_RECACHE=1 time make -j15 -f cudacpp.mk bldcuda DCDIAG=1

(1)

dpg1dpf100 (155 diagram files)
- avxs:  4m
- dcd0: 24m
- dcd1:  8m

gg_ttgggg.dpg1dpf100.sa/SubProcesses/P1_Sigma_sm_gg_ttxgggg>
ls -ltr build.*/.build* build.*/runTest*exe

-rw-r--r--. 1 avalassi zg         0 Nov  2 13:20 build.none_m_inl0_hrd0/.build.none_m_inl0_hrd0_dcd0_hasCurand_hasNoHiprand_hasNoBlas
-rw-r--r--. 1 avalassi zg         0 Nov  2 13:20 build.avx2_m_inl0_hrd0/.build.avx2_m_inl0_hrd0_dcd0_hasCurand_hasNoHiprand_hasNoBlas
-rw-r--r--. 1 avalassi zg         0 Nov  2 13:20 build.sse4_m_inl0_hrd0/.build.sse4_m_inl0_hrd0_dcd0_hasCurand_hasNoHiprand_hasNoBlas
-rw-r--r--. 1 avalassi zg         0 Nov  2 13:20 build.512y_m_inl0_hrd0/.build.512y_m_inl0_hrd0_dcd0_hasCurand_hasNoHiprand_hasNoBlas
-rw-r--r--. 1 avalassi zg         0 Nov  2 13:20 build.512z_m_inl0_hrd0/.build.512z_m_inl0_hrd0_dcd0_hasCurand_hasNoHiprand_hasNoBlas
-rwxr-xr-x. 1 avalassi zg  30685088 Nov  2 13:24 build.avx2_m_inl0_hrd0/runTest_cpp.exe*
-rwxr-xr-x. 1 avalassi zg  32864032 Nov  2 13:24 build.sse4_m_inl0_hrd0/runTest_cpp.exe*
-rwxr-xr-x. 1 avalassi zg  31299488 Nov  2 13:24 build.512y_m_inl0_hrd0/runTest_cpp.exe*
-rwxr-xr-x. 1 avalassi zg  34296680 Nov  2 13:24 build.512z_m_inl0_hrd0/runTest_cpp.exe*
-rwxr-xr-x. 1 avalassi zg  19213480 Nov  2 13:24 build.none_m_inl0_hrd0/runTest_cpp.exe*
-rw-r--r--. 1 avalassi zg         0 Nov  2 13:24 build.cuda_m_inl0_hrd0_dcd0/.build.cuda_m_inl0_hrd0_dcd0_hasCurand_hasNoHiprand_hasBlas
-rwxr-xr-x. 1 avalassi zg 459395256 Nov  2 13:48 build.cuda_m_inl0_hrd0_dcd0/runTest_cuda.exe*
-rw-r--r--. 1 avalassi zg         0 Nov  2 13:48 build.cuda_m_inl0_hrd0_dcd1/.build.cuda_m_inl0_hrd0_dcd1_hasCurand_hasNoHiprand_hasBlas
-rwxr-xr-x. 1 avalassi zg 717176192 Nov  2 13:56 build.cuda_m_inl0_hrd0_dcd1/runTest_cuda.exe*

(2)

dpg10dpf100 (155 diagram files)
- avxs:  3m
- dcd0:  4m
- dcd1:  3m

gg_ttgggg.dpg10dpf100.sa/SubProcesses/P1_Sigma_sm_gg_ttxgggg>
ls -ltr build.*/.build* build.*/runTest*exe

-rw-r--r--. 1 avalassi zg         0 Nov  2 14:15 build.none_m_inl0_hrd0/.build.none_m_inl0_hrd0_dcd0_hasCurand_hasNoHiprand_hasNoBlas
-rw-r--r--. 1 avalassi zg         0 Nov  2 14:15 build.sse4_m_inl0_hrd0/.build.sse4_m_inl0_hrd0_dcd0_hasCurand_hasNoHiprand_hasNoBlas
-rw-r--r--. 1 avalassi zg         0 Nov  2 14:15 build.512y_m_inl0_hrd0/.build.512y_m_inl0_hrd0_dcd0_hasCurand_hasNoHiprand_hasNoBlas
-rw-r--r--. 1 avalassi zg         0 Nov  2 14:15 build.avx2_m_inl0_hrd0/.build.avx2_m_inl0_hrd0_dcd0_hasCurand_hasNoHiprand_hasNoBlas
-rw-r--r--. 1 avalassi zg         0 Nov  2 14:15 build.512z_m_inl0_hrd0/.build.512z_m_inl0_hrd0_dcd0_hasCurand_hasNoHiprand_hasNoBlas
-rwxr-xr-x. 1 avalassi zg  19795496 Nov  2 14:17 build.sse4_m_inl0_hrd0/runTest_cpp.exe*
-rwxr-xr-x. 1 avalassi zg  10548840 Nov  2 14:17 build.none_m_inl0_hrd0/runTest_cpp.exe*
-rwxr-xr-x. 1 avalassi zg  19962480 Nov  2 14:17 build.512z_m_inl0_hrd0/runTest_cpp.exe*
-rwxr-xr-x. 1 avalassi zg  18136744 Nov  2 14:18 build.512y_m_inl0_hrd0/runTest_cpp.exe*
-rwxr-xr-x. 1 avalassi zg  17133224 Nov  2 14:18 build.avx2_m_inl0_hrd0/runTest_cpp.exe*
-rw-r--r--. 1 avalassi zg         0 Nov  2 14:18 build.cuda_m_inl0_hrd0_dcd0/.build.cuda_m_inl0_hrd0_dcd0_hasCurand_hasNoHiprand_hasBlas
-rwxr-xr-x. 1 avalassi zg 644387016 Nov  2 14:22 build.cuda_m_inl0_hrd0_dcd0/runTest_cuda.exe*
-rw-r--r--. 1 avalassi zg         0 Nov  2 14:22 build.cuda_m_inl0_hrd0_dcd1/.build.cuda_m_inl0_hrd0_dcd1_hasCurand_hasNoHiprand_hasBlas
-rwxr-xr-x. 1 avalassi zg 388979648 Nov  2 14:25 build.cuda_m_inl0_hrd0_dcd1/runTest_cuda.exe*

(3)

dpg100dpf100 (155 diagram files)
- avxs:  4m
- dcd0:  7m
- dcd1:  6m

gg_ttgggg.dpg100dpf100.sa/SubProcesses/P1_Sigma_sm_gg_ttxgggg>
ls -ltr build.*/.build* build.*/runTest*exe

-rw-r--r--. 1 avalassi zg         0 Nov  2 14:57 build.none_m_inl0_hrd0/.build.none_m_inl0_hrd0_dcd0_hasCurand_hasNoHiprand_hasNoBlas
-rw-r--r--. 1 avalassi zg         0 Nov  2 14:57 build.sse4_m_inl0_hrd0/.build.sse4_m_inl0_hrd0_dcd0_hasCurand_hasNoHiprand_hasNoBlas
-rw-r--r--. 1 avalassi zg         0 Nov  2 14:57 build.avx2_m_inl0_hrd0/.build.avx2_m_inl0_hrd0_dcd0_hasCurand_hasNoHiprand_hasNoBlas
-rw-r--r--. 1 avalassi zg         0 Nov  2 14:57 build.512y_m_inl0_hrd0/.build.512y_m_inl0_hrd0_dcd0_hasCurand_hasNoHiprand_hasNoBlas
-rw-r--r--. 1 avalassi zg         0 Nov  2 14:57 build.512z_m_inl0_hrd0/.build.512z_m_inl0_hrd0_dcd0_hasCurand_hasNoHiprand_hasNoBlas
-rwxr-xr-x. 1 avalassi zg  12496440 Nov  2 15:00 build.none_m_inl0_hrd0/runTest_cpp.exe*
-rwxr-xr-x. 1 avalassi zg  15509608 Nov  2 15:00 build.512z_m_inl0_hrd0/runTest_cpp.exe*
-rwxr-xr-x. 1 avalassi zg  13233312 Nov  2 15:00 build.avx2_m_inl0_hrd0/runTest_cpp.exe*
-rwxr-xr-x. 1 avalassi zg  15641632 Nov  2 15:00 build.sse4_m_inl0_hrd0/runTest_cpp.exe*
-rwxr-xr-x. 1 avalassi zg  14138528 Nov  2 15:01 build.512y_m_inl0_hrd0/runTest_cpp.exe*
-rw-r--r--. 1 avalassi zg         0 Nov  2 15:01 build.cuda_m_inl0_hrd0_dcd0/.build.cuda_m_inl0_hrd0_dcd0_hasCurand_hasNoHiprand_hasBlas
-rwxr-xr-x. 1 avalassi zg 481910800 Nov  2 15:08 build.cuda_m_inl0_hrd0_dcd0/runTest_cuda.exe*
-rw-r--r--. 1 avalassi zg         0 Nov  2 15:08 build.cuda_m_inl0_hrd0_dcd1/.build.cuda_m_inl0_hrd0_dcd1_hasCurand_hasNoHiprand_hasBlas
-rwxr-xr-x. 1 avalassi zg 324474280 Nov  2 15:14 build.cuda_m_inl0_hrd0_dcd1/runTest_cuda.exe*

(4)

dpg1000dpf1000 (16 diagram files)
- avxs:    5m
- dcd0: 3h08m
- dcd1:   N/A (*) crashed, probably out of memory

gg_ttgggg.dpg1000dpf1000.sa/SubProcesses/P1_Sigma_sm_gg_ttxgggg>
ls -ltr build.*/.build* build.*/runTest*exe

-rw-r--r--. 1 avalassi zg         0 Nov  2 15:15 build.none_m_inl0_hrd0/.build.none_m_inl0_hrd0_dcd0_hasCurand_hasNoHiprand_hasNoBlas
-rw-r--r--. 1 avalassi zg         0 Nov  2 15:15 build.sse4_m_inl0_hrd0/.build.sse4_m_inl0_hrd0_dcd0_hasCurand_hasNoHiprand_hasNoBlas
-rw-r--r--. 1 avalassi zg         0 Nov  2 15:15 build.512z_m_inl0_hrd0/.build.512z_m_inl0_hrd0_dcd0_hasCurand_hasNoHiprand_hasNoBlas
-rw-r--r--. 1 avalassi zg         0 Nov  2 15:15 build.512y_m_inl0_hrd0/.build.512y_m_inl0_hrd0_dcd0_hasCurand_hasNoHiprand_hasNoBlas
-rw-r--r--. 1 avalassi zg         0 Nov  2 15:15 build.avx2_m_inl0_hrd0/.build.avx2_m_inl0_hrd0_dcd0_hasCurand_hasNoHiprand_hasNoBlas
-rwxr-xr-x. 1 avalassi zg  14519744 Nov  2 15:18 build.sse4_m_inl0_hrd0/runTest_cpp.exe*
-rwxr-xr-x. 1 avalassi zg  12209728 Nov  2 15:18 build.avx2_m_inl0_hrd0/runTest_cpp.exe*
-rwxr-xr-x. 1 avalassi zg  13061696 Nov  2 15:19 build.512y_m_inl0_hrd0/runTest_cpp.exe*
-rwxr-xr-x. 1 avalassi zg  14187880 Nov  2 15:19 build.512z_m_inl0_hrd0/runTest_cpp.exe*
-rwxr-xr-x. 1 avalassi zg  14519304 Nov  2 15:20 build.none_m_inl0_hrd0/runTest_cpp.exe*
-rw-r--r--. 1 avalassi zg         0 Nov  2 15:20 build.cuda_m_inl0_hrd0_dcd0/.build.cuda_m_inl0_hrd0_dcd0_hasCurand_hasNoHiprand_hasBlas
-rwxr-xr-x. 1 avalassi zg 195724144 Nov  2 18:28 build.cuda_m_inl0_hrd0_dcd0/runTest_cuda.exe*
-rw-r--r--. 1 avalassi zg         0 Nov  2 18:28 build.cuda_m_inl0_hrd0_dcd1/.build.cuda_m_inl0_hrd0_dcd1_hasCurand_hasNoHiprand_hasBlas

(*)
Parallel build with DCDIAG=1 crashed:
the builds of five files were taking ~50GB RSS each (total RAM is 120 GB).
Non-parallel build completion was not attempted.

nvcc error   : 'cicc' died due to signal 9 (Kill signal)
make[1]: *** [cudacpp.mk:841: build.cuda_m_inl0_hrd0_dcd1/diagrams1_cuda.o] Error 9
make[1]: *** Waiting nvcc error   : 'cicc' died due to signal 9 (Kill signal)
make[1]: *** [cudacpp.mk:841: build.cuda_m_inl0_hrd0_dcd1/diagrams1_cuda.o] Error 9
make[1]: *** Waiting for unfinished jobs....
nvcc error   : 'cicc' died due to signal 9 (Kill signal)
make[1]: *** [cudacpp.mk:841: build.cuda_m_inl0_hrd0_dcd1/diagrams11_cuda.o] Error 9
nvcc error   : 'cicc' died due to signal 9 (Kill signal)
make[1]: *** [cudacpp.mk:841: build.cuda_m_inl0_hrd0_dcd1/diagrams16_cuda.o] Error 9
nvcc error   : 'cicc' died due to signal 9 (Kill signal)
make[1]: *** [cudacpp.mk:841: build.cuda_m_inl0_hrd0_dcd1/diagrams15_cuda.o] Error 9
nvcc error   : 'cicc' died due to signal 9 (Kill signal)
make[1]: *** [cudacpp.mk:841: build.cuda_m_inl0_hrd0_dcd1/diagrams13_cuda.o] Error 9
make[1]: Leaving directory '/data/avalassi/GPU2023/test-madgraph4gpu2/epochX/cudacpp/gg_ttgggg.dpg1000dpf1000.sa/SubProcesses/P1_Sigma_sm_gg_ttxgggg'

(5)

dpg10000dpf10000 (2 diagram files)
- avxs:    4m
- dcd0:   N/A (**) not attempted
- dcd1:   N/A (**) not attempted

gg_ttgggg.dpg10000dpf10000.sa/SubProcesses/P1_Sigma_sm_gg_ttxgggg>
ls -ltr build.*/.build* build.*/runTest*exe

-rw-r--r--. 1 avalassi zg        0 Nov  4 07:18 build.none_m_inl0_hrd0/.build.none_m_inl0_hrd0_dcd0_hasCurand_hasNoHiprand_hasNoBlas
-rw-r--r--. 1 avalassi zg        0 Nov  4 07:18 build.sse4_m_inl0_hrd0/.build.sse4_m_inl0_hrd0_dcd0_hasCurand_hasNoHiprand_hasNoBlas
-rw-r--r--. 1 avalassi zg        0 Nov  4 07:18 build.avx2_m_inl0_hrd0/.build.avx2_m_inl0_hrd0_dcd0_hasCurand_hasNoHiprand_hasNoBlas
-rw-r--r--. 1 avalassi zg        0 Nov  4 07:18 build.512y_m_inl0_hrd0/.build.512y_m_inl0_hrd0_dcd0_hasCurand_hasNoHiprand_hasNoBlas
-rw-r--r--. 1 avalassi zg        0 Nov  4 07:18 build.512z_m_inl0_hrd0/.build.512z_m_inl0_hrd0_dcd0_hasCurand_hasNoHiprand_hasNoBlas
-rwxr-xr-x. 1 avalassi zg 14519608 Nov  4 07:21 build.none_m_inl0_hrd0/runTest_cpp.exe*
-rwxr-xr-x. 1 avalassi zg 13061992 Nov  4 07:21 build.512y_m_inl0_hrd0/runTest_cpp.exe*
-rwxr-xr-x. 1 avalassi zg 14524136 Nov  4 07:21 build.sse4_m_inl0_hrd0/runTest_cpp.exe*
-rwxr-xr-x. 1 avalassi zg 14188176 Nov  4 07:22 build.512z_m_inl0_hrd0/runTest_cpp.exe*
-rwxr-xr-x. 1 avalassi zg 12210024 Nov  4 07:22 build.avx2_m_inl0_hrd0/runTest_cpp.exe*

(**)
CUDA builds were not attempted for dpg10000.
With DCDIAG=1 these are likely to crash like those for dpg1000.
With DCDIAG=0 these are likely to take >24h, with suboptimal runtime performance.

(6)

dpg100000dpf100000 (1 diagram file)
- avxs:   N/A (***)  crashed, gcc segmentation fault
- dcd0:   N/A (****) stopped after >7 days
- dcd1:   N/A (****) not attempted

(***)
C++ build crashed (both parallel and non-parallel): gcc segmentation fault.

ccache g++  -I. -I../../src -O3  -std=c++17 -Wall -Wshadow -Wextra -ffast-math   -march=x86-64  -DMGONGPU_FPTYPE_DOUBLE -DMGONGPU_FPTYPE2_FLOAT -DMGONGPU_HAS_NO_BLAS -fPIC -DMGONGPU_CHANNELID_DEBUG -c diagrams1.cc -o build.none_m_inl0_hrd0/diagrams1_cpp.o
g++: internal compiler error: Segmentation fault signal terminated program cc1plus
Please submit a full bug report,
with preprocessed source if appropriate.
See <http://bugs.almalinux.org/> for instructions.
make[1]: *** [cudacpp.mk:836: build.none_m_inl0_hrd0/diagrams1_cpp.o] Error 4
make[1]: Leaving directory '/data/avalassi/GPU2023/test-madgraph4gpu2/epochX/cudacpp/gg_ttgggg.dpg100000dpf100000.sa/SubProcesses/P1_Sigma_sm_gg_ttxgggg'

(****)
CUDA build with a configuration similar to DCDIAG=0 had previously been stopped after >7 days.
No further CUDA builds have been attempted whether with DCDIAG=0 or DCDIAG=1.
[avalassi@itscrd-a100 gcc11/usr] /data/avalassi/GPU2023/test-madgraph4gpu2/epochX/cudacpp>
./CODEGEN/generateAndCompare.sh gg_ttgggg --maxdiagpergroup 1000 --mindiagperfile 1000
Code generation and additional checks completed in 341 seconds
[avalassi@itscrd-a100 gcc11/usr] /data/avalassi/GPU2023/test-madgraph4gpu2/epochX/cudacpp>
./CODEGEN/generateAndCompare.sh gg_ttgggg --maxdiagpergroup 100 --mindiagperfile 100
Code generation and additional checks completed in 348 seconds
[avalassi@itscrd-a100 gcc11/usr] /data/avalassi/GPU2023/test-madgraph4gpu2/epochX/cudacpp>
./CODEGEN/generateAndCompare.sh gg_ttgggg --maxdiagpergroup 1 --mindiagperfile 100
Code generation and additional checks completed in 576 seconds
[avalassi@itscrd-a100 gcc11/usr] /data/avalassi/GPU2023/test-madgraph4gpu2/epochX/cudacpp>
./CODEGEN/generateAndCompare.sh gg_ttgggg --maxdiagpergroup 10 --mindiagperfile 100
Code generation and additional checks completed in 461 seconds
[avalassi@itscrd-a100 gcc11/usr] /data/avalassi/GPU2023/test-madgraph4gpu2/epochX/cudacpp>
./CODEGEN/generateAndCompare.sh gg_ttgggg --maxdiagpergroup 10000 --mindiagperfile 10000
Code generation and additional checks completed in 481 seconds
[avalassi@itscrd-a100 gcc11/usr] /data/avalassi/GPU2023/test-madgraph4gpu2/epochX/cudacpp>
./CODEGEN/generateAndCompare.sh gg_ttgggg --maxdiagpergroup 100000 --mindiagperfile 100000
Code generation and additional checks completed in 394 seconds
…rd-a100 on hack_ihel6p1 codebase

There is no change: this is essentially the same code
…plates)

[avalassi@itscrd-a100 gcc11/usr] /data/avalassi/GPU2023/test-madgraph4gpu/epochX/cudacpp>
./CODEGEN/generateAndCompare.sh gg_ttgggg --maxdiagpergroup 1 --mindiagperfile 100
Code generation and additional checks completed in 344 seconds

Build times (C++/gold91): 2m20
gg_ttgggg.dpg1dpf100.sa/SubProcesses/P1_Sigma_sm_gg_ttxgggg>
make cleanall; START=$(date); CCACHE_RECACHE=1 make -j -f cudacpp.mk bldall; echo $START; echo $(date)
  Sat Nov 22 07:12:22 AM CET 2025
  Sat Nov 22 07:14:46 AM CET 2025

Build times (CUDA/a100): 48m (dcd0), 10m (dcd1)
gg_ttgggg.dpg1dpf100.sa/SubProcesses/P1_Sigma_sm_gg_ttxgggg>
make cleanall; START0=$(date); CCACHE_RECACHE=1 make -j -f cudacpp.mk bldcuda DCDIAG=0;
START1=$(date); CCACHE_RECACHE=1 make -j -f cudacpp.mk bldcuda DCDIAG=1;
echo $START0; echo $START1; echo $(date)
  Sat Nov 22 08:41:37 AM CET 2025
  Sat Nov 22 09:29:17 AM CET 2025
  Sat Nov 22 09:39:31 AM CET 2025
…mplates)

[avalassi@itscrd-a100 gcc11/usr] /data/avalassi/GPU2023/test-madgraph4gpu/epochX/cudacpp>
./CODEGEN/generateAndCompare.sh gg_ttgggg --maxdiagpergroup 10 --mindiagperfile 100
Code generation and additional checks completed in 488 seconds

Build times (C++/gold91): 3m10
gg_ttgggg.dpg10dpf100.sa/SubProcesses/P1_Sigma_sm_gg_ttxgggg>
make cleanall; START=$(date); CCACHE_RECACHE=1 make -j -f cudacpp.mk bldall; echo $START; echo $(date)
  Sat Nov 22 07:19:36 AM CET 2025
  Sat Nov 22 07:22:44 AM CET 2025

Build times (CUDA/a100): 4m30 (dcd0), 3m50 (dcd1)
gg_ttgggg.dpg10dpf100.sa/SubProcesses/P1_Sigma_sm_gg_ttxgggg>
make cleanall; START0=$(date); CCACHE_RECACHE=1 make -j -f cudacpp.mk bldcuda DCDIAG=0;
START1=$(date); CCACHE_RECACHE=1 make -j -f cudacpp.mk bldcuda DCDIAG=1;
echo $START0; echo $START1; echo $(date)
  Sat Nov 22 07:11:58 AM CET 2025
  Sat Nov 22 07:16:26 AM CET 2025
  Sat Nov 22 07:20:16 AM CET 2025
…emplates)

[avalassi@itscrd-a100 gcc11/usr] /data/avalassi/GPU2023/test-madgraph4gpu/epochX/cudacpp>
./CODEGEN/generateAndCompare.sh gg_ttgggg --maxdiagpergroup 100 --mindiagperfile 100
Code generation and additional checks completed in 356 seconds

Build times (C++/gold91): 3m
gg_ttgggg.dpg100dpf100.sa/SubProcesses/P1_Sigma_sm_gg_ttxgggg>
make cleanall; START=$(date); CCACHE_RECACHE=1 make -j -f cudacpp.mk bldall; echo $START; echo $(date)
  Sat Nov 22 07:26:57 AM CET 2025
  Sat Nov 22 07:29:50 AM CET 2025

Build times (CUDA/a100): 8m50 (dcd0), 6m20 (dcd1)
gg_ttgggg.dpg100dpf100.sa/SubProcesses/P1_Sigma_sm_gg_ttxgggg>
make cleanall; START0=$(date); CCACHE_RECACHE=1 make -j -f cudacpp.mk bldcuda DCDIAG=0;
START1=$(date); CCACHE_RECACHE=1 make -j -f cudacpp.mk bldcuda DCDIAG=1;
echo $START0; echo $START1; echo $(date)
  Sat Nov 22 07:24:15 AM CET 2025
  Sat Nov 22 07:33:07 AM CET 2025
  Sat Nov 22 07:39:25 AM CET 2025
… templates)

[avalassi@itscrd-a100 gcc11/usr] /data/avalassi/GPU2023/test-madgraph4gpu/epochX/cudacpp>
./CODEGEN/generateAndCompare.sh gg_ttgggg --maxdiagpergroup 1000 --mindiagperfile 1000
Code generation and additional checks completed in 372 seconds

Build times (C++/gold91): 3m40
gg_ttgggg.dpg1000dpf1000.sa/SubProcesses/P1_Sigma_sm_gg_ttxgggg>
make cleanall; START=$(date); CCACHE_RECACHE=1 make -j -f cudacpp.mk bldall; echo $START; echo $(date)
  Sat Nov 22 07:35:06 AM CET 2025
  Sat Nov 22 07:38:48 AM CET 2025

Build times (CUDA/a100): 2h57 (dcd0), FAILED (dcd1)
gg_ttgggg.dpg1000dpf1000.sa/SubProcesses/P1_Sigma_sm_gg_ttxgggg>
make cleanall; START0=$(date); CCACHE_RECACHE=1 make -j -f cudacpp.mk bldcuda DCDIAG=0;
START1=$(date); CCACHE_RECACHE=1 make -j -f cudacpp.mk bldcuda DCDIAG=1;
echo $START0; echo $START1; echo $(date)
...
(Node crashed)
[1123356.164224] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/user.slice/user-14546.slice/session-7302.scope,task=cicc,pid=1243172,uid=14546
[1123356.164254] Out of memory: Killed process 1243172 (cicc) total-vm:157944388kB, anon-rss:70969088kB, file-rss:1152kB, shmem-rss:0kB, UID:14546 pgtables:299660kB oom_score_adj:0
[1123371.062904] oom_reaper: reaped process 1243172 (cicc), now anon-rss:1056kB, file-rss:1152kB, shmem-rss:0kB
...
ls -ltr build.*/.build* build.*/run*exe
-rw-r--r--. 1 avalassi zg         0 Nov 22 09:45 build.cuda_m_inl0_hrd0_dcd0/.build.cuda_m_inl0_hrd0_dcd0_hasCurand_hasNoHiprand_hasBlas
-rwxr-xr-x. 1 avalassi zg 195705368 Nov 22 12:42 build.cuda_m_inl0_hrd0_dcd0/runTest_cuda.exe*
-rw-r--r--. 1 avalassi zg         0 Nov 22 12:42 build.cuda_m_inl0_hrd0_dcd1/.build.cuda_m_inl0_hrd0_dcd1_hasCurand_hasNoHiprand_hasBlas
…ut templates)

[avalassi@itscrd-a100 gcc11/usr] /data/avalassi/GPU2023/test-madgraph4gpu/epochX/cudacpp>
./CODEGEN/generateAndCompare.sh gg_ttgggg --maxdiagpergroup 10000 --mindiagperfile 10000
Code generation and additional checks completed in 358 seconds

Build times (C++/gold91): 28m
gg_ttgggg.dpg10000dpf10000.sa/SubProcesses/P1_Sigma_sm_gg_ttxgggg>
make cleanall; START=$(date); CCACHE_RECACHE=1 make -j -f cudacpp.mk bldall; echo $START; echo $(date)
  Sat Nov 22 07:59:12 AM CET 2025
  Sat Nov 22 08:27:22 AM CET 2025

Build times (CUDA/a100): N/A (dcd0), N/A (dcd1)
(DCDIAG=0 build not attempted as it would probably take too long)
(DCDIAG=1 build not attempted as the dpg1000 build failed)
…hout templates)

[avalassi@itscrd-a100 gcc11/usr] /data/avalassi/GPU2023/test-madgraph4gpu/epochX/cudacpp>
./CODEGEN/generateAndCompare.sh gg_ttgggg --maxdiagpergroup 100000 --mindiagperfile 100000
Code generation and additional checks completed in 489 seconds

Build times (C++/gold91): all five backends fail with "Segmentation fault"
gg_ttgggg.dpg100000dpf100000.sa/SubProcesses/P1_Sigma_sm_gg_ttxgggg>
make cleanall; START=$(date); CCACHE_RECACHE=1 make -j -f cudacpp.mk bldall; echo $START; echo $(date)
  g++: internal compiler error: Segmentation fault signal terminated program cc1plus
  Please submit a full bug report, with preprocessed source if appropriate.
  make[1]: *** [cudacpp.mk:841: build.none_m_inl0_hrd0/diagrams1_cpp.o] Error 4
  Sat Nov 22 07:53:05 AM CET 2025
  Sat Nov 22 07:55:14 AM CET 2025

Build times (CUDA/a100): N/A (dcd0), N/A (dcd1)
(DCDIAG=0 build not attempted as it would probably take too long)
(DCDIAG=1 build not attempted as the dpg1000 build failed)
…plates)

[avalassi@itscrd-a100 gcc11/usr] /data/avalassi/GPU2023/test-madgraph4gpu/epochX/cudacpp>
./CODEGEN/generateAndCompare.sh gg_ttgggg --maxdiagpergroup 200 --mindiagperfile 200
Code generation and additional checks completed in 525 seconds [in parallel to a software build]

Build times (C++/gold91): 3m10
gg_ttgggg.dpg200dpf200.sa/SubProcesses/P1_Sigma_sm_gg_ttxgggg>
make cleanall; START=$(date); CCACHE_RECACHE=1 make -j -f cudacpp.mk bldall; echo $START; echo $(date)
  Sat Nov 22 07:41:52 AM CET 2025
  Sat Nov 22 07:44:04 AM CET 2025

Build times (CUDA/a100): 15m (dcd0), 17m (dcd1)
gg_ttgggg.dpg200dpf200.sa/SubProcesses/P1_Sigma_sm_gg_ttxgggg>
make cleanall; START0=$(date); CCACHE_RECACHE=1 make -j -f cudacpp.mk bldcuda DCDIAG=0;
START1=$(date); CCACHE_RECACHE=1 make -j -f cudacpp.mk bldcuda DCDIAG=1;
echo $START0; echo $START1; echo $(date)
  Sat Nov 22 07:55:54 AM CET 2025
  Sat Nov 22 08:11:29 AM CET 2025
  Sat Nov 22 08:28:47 AM CET 2025
… dpg values

tput/logs_ggttgggg_sa_scan/scan.sh
  Sat Nov 22 09:13:07 AM CET 2025
  Sat Nov 22 09:17:23 AM CET 2025
…t different dpg values

dpg1dpf100
none 0.0 4.717894e-01 1.00x 33.913437
sse4 0.0 4.228031e-01 0.90x 37.842674
avx2 0.0 4.877365e-01 1.03x 32.804596
512y 0.0 5.069305e-01 1.07x 31.562513
512z 0.0 5.272687e-01 1.12x 30.345057

dpg10dpf100
none 0.0 1.672502e+00 1.00x 9.566506
sse4 0.0 1.556929e+00 0.93x 10.276643
avx2 0.0 3.484190e+00 2.08x 4.592172
512y 0.0 3.270659e+00 1.96x 4.891981
512z 0.0 4.175834e+00 2.50x 3.831570

dpg100dpf100
none 0.0 2.227907e+00 1.00x 7.181629
sse4 0.0 3.532345e+00 1.59x 4.529569
avx2 0.0 9.224621e+00 4.14x 1.734489
512y 0.0 1.044090e+01 4.69x 1.532434
512z 0.0 1.521281e+01 6.83x 1.051745

dpg200dpf200
none 0.0 2.540721e+00 1.00x 6.297424
sse4 0.0 4.628705e+00 1.82x 3.456690
avx2 0.0 1.100694e+01 4.33x 1.453629
512y 0.0 1.144459e+01 4.50x 1.398040
512z 0.0 1.798910e+01 7.08x 0.889428

dpg1000dpf1000
none 0.0 2.568091e+00 1.00x 6.230309
sse4 0.0 5.189057e+00 2.02x 3.083412
avx2 0.0 1.236506e+01 4.81x 1.293969
512y 0.0 1.311557e+01 5.11x 1.219924
512z 0.0 2.222294e+01 8.65x 0.719977

dpg10000dpf10000
none 0.0 2.644150e+00 1.00x 6.051095
sse4 0.0 5.453349e+00 2.06x 2.933977
avx2 0.0 1.314281e+01 4.97x 1.217396
512y 0.0 1.370831e+01 5.18x 1.167175
512z 0.0 2.289663e+01 8.66x 0.698793
… ggttgggg.sa at different dpg values

Sat Nov 22 04:34:39 PM CET 2025
Sat Nov 22 07:12:15 PM CET 2025
… for instrumenting color sums

Apply these as follows
  cd gg_ttgggg.<dpg>.sa/SubProcesses
  patch -i ../../patchS.patch
  cd P1_Sigma_sm_gg_ttxgggg/
  patch -i ../../../patchP.patch
…dpg1000dpf1000.sa

cd gg_ttgggg.dpg1000dpf1000.sa/SubProcesses
patch -i ../../patchS.patch
cd P1_Sigma_sm_gg_ttxgggg/
patch -i ../../../patchP.patch
…dpg100dpf100.sa

cd gg_ttgggg.dpg100dpf100.sa/SubProcesses
patch -i ../../patchS.patch
cd P1_Sigma_sm_gg_ttxgggg/
patch -i ../../../patchP.patch
…SIMD/gold

Also update CUDA/a100 script: use common random numbers to compare MEs to SIMD/gold (no curand on gold)
…ts for ggtt4g colortimer using CUDA/a100, now using common random numbers
@valassi valassi self-assigned this Nov 23, 2025
@valassi valassi marked this pull request as draft November 23, 2025 23:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant