Skip to content

Conversation

@thowell
Copy link
Collaborator

@thowell thowell commented Dec 18, 2025

this pr depends on #931, #934, and #935 and adds a parameter that specifies the maximum number of dofs per constraint. this parameter can significantly improve memory utilization by reducing the size of efc.J and efc.J_colind.

benchmark/cloth/scene.xml

Loading model from: benchmark/cloth/scene.xml...
  nbody: 918 nv: 2706 ngeom: 921 nu: 0 is_sparse: True
  broadphase: SAP_TILE broadphase_filter: PLANE|SPHERE|OBB
  solver: CG cone: PYRAMIDAL iterations: 100 iterative linesearch iterations: 50
  integrator: EULER graph_conditional: True
Data
  nworld: 256 naconmax: 128000 njmax: 4000

compare integration of #931, #934, and #935 with main (f2f7957)

mjwarp-testspeed benchmark/cloth/scene.xml --nworld=256 --nconmax=500 --njmax=4000 --nstep=10 --event_trace=True --memory=True

this pr:

Rolling out 10 steps at dt = 0.005...

Summary for 256 parallel rollouts

Total JIT time: 0.31 s
Total simulation time: 0.62 s
Total steps per second: 4,115
Total realtime factor: 20.57 x
Total time per step: 243032.03 ns
Total converged worlds: 256 / 256

Event trace:

step: 240856.00
  forward: 240550.40
    fwd_position: 101425.19
      kinematics: 615.04
      com_pos: 336.00
      camlight: 42.40
      flex: 203.20
      crb: 246.40
      tendon_armature: 6.00
      collision: 876.00
        sap_broadphase: 778.70
        convex_narrowphase: 5.20
        primitive_narrowphase: 62.80
      make_constraint: 99036.00
      transmission: 5.60
    sensor_pos: 5.20
    fwd_velocity: 2840.00
      com_vel: 353.20
      passive: 1814.40
      rne: 640.00
      tendon_bias: 5.20
    sensor_vel: 5.20
    fwd_actuation: 18.40
    fwd_acceleration: 1906.40
      xfrc_accumulate: 1669.85
    solve: 134287.60
      mul_m: 69.60
      solve_m: 91.20
    sensor_acc: 6.00
  euler: 289.60

Total memory: 21526.32 MB / 48640.12 MB (44.26%)
Model memory (0.20%):
 (no field >= 1% of utilized memory)
Data memory (99.80%):
 efc.J_colind: 10570.31 MB (49.10%)
 efc.J: 10570.31 MB (49.10%)

main (f2f7957):

Rolling out 10 steps at dt = 0.005...

Summary for 256 parallel rollouts

Total JIT time: 6.78 s
Total simulation time: 3.07 s
Total steps per second: 833
Total realtime factor: 4.16 x
Total time per step: 1201065.04 ns
Total converged worlds: 256 / 256

Event trace:

step: 1198886.00
  forward: 1198597.20
    fwd_position: 633470.62
      kinematics: 615.47
      com_pos: 339.60
      camlight: 42.40
      flex: 252525.60
      crb: 278.80
      tendon_armature: 4.40
      collision: 883.20
        sap_broadphase: 786.00
        convex_narrowphase: 6.40
        primitive_narrowphase: 64.00
      make_constraint: 378716.00
      transmission: 5.20
    sensor_pos: 5.20
    fwd_velocity: 2846.00
      com_vel: 345.20
      passive: 1836.80
      rne: 629.60
      tendon_bias: 4.80
    sensor_vel: 5.20
    fwd_actuation: 15.60
    fwd_acceleration: 1953.60
      xfrc_accumulate: 1713.60
    solve: 560238.00
      mul_m: 92.00
      solve_m: 94.40
    sensor_acc: 5.60
  euler: 271.60

Total memory: 17788.09 MB / 48640.12 MB (36.57%)
Model memory (0.24%):
 (no field >= 1% of utilized memory)
Data memory (99.76%):
 flexedge_J: 6820.49 MB (38.34%)
 efc.J: 10625.00 MB (59.73%)

summary

  • SPS: 833 -> 4,115 ✅
  • total memory: 17788.09 MB -> 21526.32 MB ❌

throughput is significantly improved, but there is a regression in memory utilization due to the introduction of J_colind


compare sparse constraint rows with main

with the changes introduced in this pr, we can significantly improve the memory utilization by setting the maximum number of dofs per constraint --nefcdof=64.

mjwarp-testspeed benchmark/cloth/scene.xml --nworld=256 --nconmax=500 --njmax=4000 --nstep=10 --event_trace=True --memory=True --nefcdof=64
Rolling out 10 steps at dt = 0.005...

Summary for 256 parallel rollouts

Total JIT time: 0.29 s
Total simulation time: 0.38 s
Total steps per second: 6,662
Total realtime factor: 33.31 x
Total time per step: 150097.21 ns
Total converged worlds: 256 / 256

Event trace:

step: 148139.20
  forward: 147822.80
    fwd_position: 6694.40
      kinematics: 638.80
      com_pos: 350.40
      camlight: 42.80
      flex: 206.80
      crb: 243.60
      tendon_armature: 6.00
      collision: 910.00
        sap_broadphase: 809.60
        convex_narrowphase: 4.80
        primitive_narrowphase: 65.60
      make_constraint: 4229.60
      transmission: 6.00
    sensor_pos: 5.20
    fwd_velocity: 2866.00
      com_vel: 362.40
      passive: 1836.40
      rne: 633.60
      tendon_bias: 5.20
    sensor_vel: 5.20
    fwd_actuation: 19.20
    fwd_acceleration: 1983.20
      xfrc_accumulate: 1736.58
    solve: 136190.80
      mul_m: 69.60
      solve_m: 94.00
    sensor_acc: 6.40
  euler: 299.20

Total memory: 885.69 MB / 48640.12 MB (1.82%)
Model memory (4.76%):
 dof_tri_row: 13.97 MB (1.58%)
 dof_tri_col: 13.97 MB (1.58%)
Data memory (95.24%):
 cdof: 15.86 MB (1.79%)
 cinert: 8.96 MB (1.01%)
 flexedge_J_colind: 15.12 MB (1.71%)
 flexedge_J: 15.12 MB (1.71%)
 crb: 8.96 MB (1.01%)
 cdof_dot: 15.86 MB (1.79%)
 efc.J_colind: 250.00 MB (28.23%)
 efc.J: 250.00 MB (28.23%)
 efc.quad: 11.72 MB (1.32%)

summary

  • SPS: 833 -> 6,662 (7.9x) ✅
  • total memory: 17788.09 MB -> 885.69 MB (save +15GB) ✅

increase throughput by increasing the number of worlds

we can also increase the number of worlds to --nworld=4096

Rolling out 10 steps at dt = 0.005...

Summary for 4096 parallel rollouts

Total JIT time: 0.82 s
Total simulation time: 4.19 s
Total steps per second: 9,777
Total realtime factor: 48.88 x
Total time per step: 102283.59 ns
Total converged worlds: 4096 / 4096

Event trace:

step: 102088.00
  forward: 101824.05
    fwd_position: 6266.05
      kinematics: 413.12
      com_pos: 413.95
      camlight: 3.43
      flex: 344.03
      crb: 297.60
      tendon_armature: 0.32
      collision: 760.82
        sap_broadphase: 751.27
        convex_narrowphase: 0.35
        primitive_narrowphase: 7.38
      make_constraint: 4028.85
      transmission: 0.37
    sensor_pos: 0.40
    fwd_velocity: 2404.78
      com_vel: 242.83
      passive: 1668.13
      rne: 491.75
      tendon_bias: 0.37
    sensor_vel: 0.40
    fwd_actuation: 4.40
    fwd_acceleration: 1499.00
      xfrc_accumulate: 1342.71
    solve: 91645.10
      mul_m: 109.47
      solve_m: 35.28
    sensor_acc: 0.32
  euler: 262.90

Total memory: 13538.90 MB / 48640.12 MB (27.83%)
Model memory (0.31%):
 (no field >= 1% of utilized memory)
Data memory (99.69%):
 cdof: 253.69 MB (1.87%)
 cinert: 143.44 MB (1.06%)
 flexedge_J_colind: 241.97 MB (1.79%)
 flexedge_J: 241.97 MB (1.79%)
 crb: 143.44 MB (1.06%)
 cdof_dot: 253.69 MB (1.87%)
 efc.J_colind: 4000.00 MB (29.54%)
 efc.J: 4000.00 MB (29.54%)
 efc.quad: 187.50 MB (1.38%)

SPS: 6,662 -> 9,777


tl;dr

comparing performance of benchmark/cloth/scene.xml scene to main (f2f7957)
~7.9x throughput improvement
~15GB reduction in device memory


todo:

  • add print warning if number of dofs required for a constraint exceeds the maximum dofs setting

@thowell thowell linked an issue Dec 18, 2025 that may be closed by this pull request
4 tasks
@thowell thowell mentioned this pull request Dec 19, 2025
3 tasks
@thowell thowell mentioned this pull request Jan 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

JacobianType.SPARSE

1 participant