Skip to content

Conversation

@thowell
Copy link
Collaborator

@thowell thowell commented Jan 7, 2026

not finding that wp.config.load_module_max_workers = 4 leads to meaningful improvements in compilation time for testspeed?

  1. clear caches to ensure cold start
rm -rf ~/.cache/warp
rm -rf ~/.nv/ComputeCache
  1. run testspeed (note: --clear_kernel_cache seems to be insufficient for a true cold start? see @c0d1f1ed's comment here)
mjwarp-testspeed benchmark/aloha_pot/scene.xml --nconmax=24 --njmax=64 --verbosity=1
Warp 1.12.0.dev20260106 initialized:
   CUDA Toolkit 12.9, Driver 12.4
   Devices:
     "cpu"      : "CPU"
     "cuda:0"   : "NVIDIA RTX 6000 Ada Generation" (48 GiB, sm_89, mempool enabled) 
 ...
 Module mujoco_warp._src.smooth e142b0f load on device 'cuda:0' took 4301.18 ms  (compiled)
Module mujoco_warp._src.collision_driver de6dc98 load on device 'cuda:0' took 108.22 ms  (compiled)
Module _nxn_broadphase__locals__kernel_8c837afb 8c837af load on device 'cuda:0' took 380.59 ms  (compiled)
Module ccd_kernel_builder__locals__ccd_kernel_bee57cf4 bee57cf load on device 'cuda:0' took 8752.90 ms  (compiled)
Module ccd_kernel_builder__locals__ccd_kernel_cb21f988 cb21f98 load on device 'cuda:0' took 8202.03 ms  (compiled)
Module ccd_kernel_builder__locals__ccd_kernel_2472f6ac 2472f6a load on device 'cuda:0' took 9070.16 ms  (compiled)
Module _primitive_narrowphase__locals__primitive_narrowphase_3a95495c 40a6592 load on device 'cuda:0' took 1054.58 ms  (compiled)
Module mujoco_warp._src.constraint 51cd7b3 load on device 'cuda:0' took 2189.22 ms  (compiled)
Module _actuator_velocity__locals__actuator_velocity_90e58dd6 ad563b2 load on device 'cuda:0' took 415.52 ms  (compiled)
Module mujoco_warp._src.passive 4bb84dd load on device 'cuda:0' took 1020.30 ms  (compiled)
Module mujoco_warp._src.forward 3875c24 load on device 'cuda:0' took 767.75 ms  (compiled)
Module mujoco_warp._src.support b7a7a89 load on device 'cuda:0' took 227.39 ms  (compiled)
Module _tile_cholesky_factorize_solve__locals__cholesky_factorize_solve_054bffd7 1b42599 load on device 'cuda:0' took 3145.28 ms  (compiled)
Module _tile_cholesky_factorize_solve__locals__cholesky_factorize_solve_e683b68a dd87c24 load on device 'cuda:0' took 2636.69 ms  (compiled)
Module mujoco_warp._src.solver e5aab2a load on device 'cuda:0' took 1663.45 ms  (compiled)
Module solve_init_jaref__locals__kernel_a59ba19b a59ba19 load on device 'cuda:0' took 120.45 ms  (compiled)
Module mul_m_dense__locals___mul_m_dense_0f38f179 e4856ff load on device 'cuda:0' took 13386.23 ms  (compiled)
Module mul_m_dense__locals___mul_m_dense_1c87edf7 95bf16a load on device 'cuda:0' took 13288.49 ms  (compiled)
Module update_constraint_gauss_cost__locals__kernel_9cb48b61 9cb48b6 load on device 'cuda:0' took 144.89 ms  (compiled)
Module update_gradient_JTDAJ_dense_tiled__locals__kernel_fee7e1f5 31f8346 load on device 'cuda:0' took 16712.15 ms  (compiled)
Module update_gradient_cholesky__locals__kernel_4c46c88d 93f2b07 load on device 'cuda:0' took 3145.85 ms  (compiled)
Module linesearch_jv_fused__locals__kernel_fa5b55ea fa5b55e load on device 'cuda:0' took 123.97 ms  (compiled)
Module linesearch_prepare_gauss__locals__kernel_cdec59d7 cdec59d load on device 'cuda:0' took 140.94 ms  (compiled)
Module mujoco_warp._src.solver e4e7906 load on device 'cuda:0' took 1894.41 ms  (compiled)
Module _tile_euler_dense__locals__euler_dense_bf64cd72 547ddee load on device 'cuda:0' took 1284.59 ms  (compiled)
Module _tile_euler_dense__locals__euler_dense_397ba6c9 a497b05 load on device 'cuda:0' took 1335.14 ms  (compiled)
Module mujoco_warp._src.benchmark 376ea83 load on device 'cuda:0' took 190.87 ms  (compiled)

Summary for 8192 parallel rollouts

Total JIT time: 95.91 s
Total simulation time: 4.12 s
Total steps per second: 1,986,919
Total realtime factor: 3,973.84 x
Total time per step: 503.29 ns
Total converged worlds: 8192 / 8192
  1. clear caches again
rm -rf ~/.cache/warp
rm -rf ~/.nv/ComputeCache
  1. run testspeed with --load_module_max_workers=4
mjwarp-testspeed benchmark/aloha_pot/scene.xml --nconmax=24 --njmax=64 --verbosity=1 --load_module_max_workers=4
Module mujoco_warp._src.smooth e142b0f load on device 'cuda:0' took 4298.49 ms  (compiled)
Module mujoco_warp._src.collision_driver de6dc98 load on device 'cuda:0' took 109.36 ms  (compiled)
Module _nxn_broadphase__locals__kernel_8c837afb 8c837af load on device 'cuda:0' took 382.20 ms  (compiled)
Module ccd_kernel_builder__locals__ccd_kernel_bee57cf4 bee57cf load on device 'cuda:0' took 8685.90 ms  (compiled)
Module ccd_kernel_builder__locals__ccd_kernel_cb21f988 cb21f98 load on device 'cuda:0' took 8174.77 ms  (compiled)
Module ccd_kernel_builder__locals__ccd_kernel_2472f6ac 2472f6a load on device 'cuda:0' took 8996.53 ms  (compiled)
Module _primitive_narrowphase__locals__primitive_narrowphase_3a95495c 40a6592 load on device 'cuda:0' took 1073.99 ms  (compiled)
Module mujoco_warp._src.constraint 51cd7b3 load on device 'cuda:0' took 2189.90 ms  (compiled)
Module _actuator_velocity__locals__actuator_velocity_90e58dd6 ad563b2 load on device 'cuda:0' took 414.26 ms  (compiled)
Module mujoco_warp._src.passive 4bb84dd load on device 'cuda:0' took 1027.73 ms  (compiled)
Module mujoco_warp._src.forward 3875c24 load on device 'cuda:0' took 768.47 ms  (compiled)
Module mujoco_warp._src.support b7a7a89 load on device 'cuda:0' took 235.01 ms  (compiled)
Module _tile_cholesky_factorize_solve__locals__cholesky_factorize_solve_054bffd7 1b42599 load on device 'cuda:0' took 3130.19 ms  (compiled)
Module _tile_cholesky_factorize_solve__locals__cholesky_factorize_solve_e683b68a dd87c24 load on device 'cuda:0' took 2672.58 ms  (compiled)
Module mujoco_warp._src.solver e5aab2a load on device 'cuda:0' took 1639.78 ms  (compiled)
Module solve_init_jaref__locals__kernel_a59ba19b a59ba19 load on device 'cuda:0' took 120.57 ms  (compiled)
Module mul_m_dense__locals___mul_m_dense_0f38f179 e4856ff load on device 'cuda:0' took 13475.88 ms  (compiled)
Module mul_m_dense__locals___mul_m_dense_1c87edf7 95bf16a load on device 'cuda:0' took 13359.95 ms  (compiled)
Module update_constraint_gauss_cost__locals__kernel_9cb48b61 9cb48b6 load on device 'cuda:0' took 136.87 ms  (compiled)
Module update_gradient_JTDAJ_dense_tiled__locals__kernel_fee7e1f5 31f8346 load on device 'cuda:0' took 16746.00 ms  (compiled)
Module update_gradient_cholesky__locals__kernel_4c46c88d 93f2b07 load on device 'cuda:0' took 3117.09 ms  (compiled)
Module linesearch_jv_fused__locals__kernel_fa5b55ea fa5b55e load on device 'cuda:0' took 128.33 ms  (compiled)
Module linesearch_prepare_gauss__locals__kernel_cdec59d7 cdec59d load on device 'cuda:0' took 140.32 ms  (compiled)
Module mujoco_warp._src.solver e4e7906 load on device 'cuda:0' took 1867.08 ms  (compiled)
Module _tile_euler_dense__locals__euler_dense_bf64cd72 547ddee load on device 'cuda:0' took 1268.23 ms  (compiled)
Module _tile_euler_dense__locals__euler_dense_397ba6c9 a497b05 load on device 'cuda:0' took 1310.92 ms  (compiled)
Module mujoco_warp._src.benchmark 376ea83 load on device 'cuda:0' took 189.79 ms  (compiled)

Summary for 8192 parallel rollouts

Total JIT time: 95.87 s
Total simulation time: 4.12 s
Total steps per second: 1,986,515
Total realtime factor: 3,973.03 x
Total time per step: 503.39 ns
Total converged worlds: 8192 / 8192

the expectation would be a theoretical 4x improvement in jit time, but in both cases the jit time is ~95-96 seconds. is the implementation in this pr incorrect? thanks!

@c0d1f1ed @adenzler-nvidia

@thowell thowell linked an issue Jan 7, 2026 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Optimize GJK JIT time

1 participant