set maximum number of dofs for `efc_J` #936

thowell · 2025-12-18T18:08:00Z

this pr depends on #931, #934, and #935 and adds a parameter that specifies the maximum number of dofs per constraint. this parameter can significantly improve memory utilization by reducing the size of efc.J and efc.J_colind.

benchmark/cloth/scene.xml

Loading model from: benchmark/cloth/scene.xml...
  nbody: 918 nv: 2706 ngeom: 921 nu: 0 is_sparse: True
  broadphase: SAP_TILE broadphase_filter: PLANE|SPHERE|OBB
  solver: CG cone: PYRAMIDAL iterations: 100 iterative linesearch iterations: 50
  integrator: EULER graph_conditional: True
Data
  nworld: 256 naconmax: 128000 njmax: 4000

compare integration of #931, #934, and #935 with main (f2f7957)

mjwarp-testspeed benchmark/cloth/scene.xml --nworld=256 --nconmax=500 --njmax=4000 --nstep=10 --event_trace=True --memory=True

this pr:

Rolling out 10 steps at dt = 0.005...

Summary for 256 parallel rollouts

Total JIT time: 0.31 s
Total simulation time: 0.62 s
Total steps per second: 4,115
Total realtime factor: 20.57 x
Total time per step: 243032.03 ns
Total converged worlds: 256 / 256

Event trace:

step: 240856.00
  forward: 240550.40
    fwd_position: 101425.19
      kinematics: 615.04
      com_pos: 336.00
      camlight: 42.40
      flex: 203.20
      crb: 246.40
      tendon_armature: 6.00
      collision: 876.00
        sap_broadphase: 778.70
        convex_narrowphase: 5.20
        primitive_narrowphase: 62.80
      make_constraint: 99036.00
      transmission: 5.60
    sensor_pos: 5.20
    fwd_velocity: 2840.00
      com_vel: 353.20
      passive: 1814.40
      rne: 640.00
      tendon_bias: 5.20
    sensor_vel: 5.20
    fwd_actuation: 18.40
    fwd_acceleration: 1906.40
      xfrc_accumulate: 1669.85
    solve: 134287.60
      mul_m: 69.60
      solve_m: 91.20
    sensor_acc: 6.00
  euler: 289.60

Total memory: 21526.32 MB / 48640.12 MB (44.26%)
Model memory (0.20%):
 (no field >= 1% of utilized memory)
Data memory (99.80%):
 efc.J_colind: 10570.31 MB (49.10%)
 efc.J: 10570.31 MB (49.10%)

main (f2f7957):

Rolling out 10 steps at dt = 0.005...

Summary for 256 parallel rollouts

Total JIT time: 6.78 s
Total simulation time: 3.07 s
Total steps per second: 833
Total realtime factor: 4.16 x
Total time per step: 1201065.04 ns
Total converged worlds: 256 / 256

Event trace:

step: 1198886.00
  forward: 1198597.20
    fwd_position: 633470.62
      kinematics: 615.47
      com_pos: 339.60
      camlight: 42.40
      flex: 252525.60
      crb: 278.80
      tendon_armature: 4.40
      collision: 883.20
        sap_broadphase: 786.00
        convex_narrowphase: 6.40
        primitive_narrowphase: 64.00
      make_constraint: 378716.00
      transmission: 5.20
    sensor_pos: 5.20
    fwd_velocity: 2846.00
      com_vel: 345.20
      passive: 1836.80
      rne: 629.60
      tendon_bias: 4.80
    sensor_vel: 5.20
    fwd_actuation: 15.60
    fwd_acceleration: 1953.60
      xfrc_accumulate: 1713.60
    solve: 560238.00
      mul_m: 92.00
      solve_m: 94.40
    sensor_acc: 5.60
  euler: 271.60

Total memory: 17788.09 MB / 48640.12 MB (36.57%)
Model memory (0.24%):
 (no field >= 1% of utilized memory)
Data memory (99.76%):
 flexedge_J: 6820.49 MB (38.34%)
 efc.J: 10625.00 MB (59.73%)

summary

SPS: 833 -> 4,115 ✅
total memory: 17788.09 MB -> 21526.32 MB ❌

throughput is significantly improved, but there is a regression in memory utilization due to the introduction of J_colind

compare sparse constraint rows with main

with the changes introduced in this pr, we can significantly improve the memory utilization by setting the maximum number of dofs per constraint --nefcdof=64.

mjwarp-testspeed benchmark/cloth/scene.xml --nworld=256 --nconmax=500 --njmax=4000 --nstep=10 --event_trace=True --memory=True --nefcdof=64

Rolling out 10 steps at dt = 0.005...

Summary for 256 parallel rollouts

Total JIT time: 0.29 s
Total simulation time: 0.38 s
Total steps per second: 6,662
Total realtime factor: 33.31 x
Total time per step: 150097.21 ns
Total converged worlds: 256 / 256

Event trace:

step: 148139.20
  forward: 147822.80
    fwd_position: 6694.40
      kinematics: 638.80
      com_pos: 350.40
      camlight: 42.80
      flex: 206.80
      crb: 243.60
      tendon_armature: 6.00
      collision: 910.00
        sap_broadphase: 809.60
        convex_narrowphase: 4.80
        primitive_narrowphase: 65.60
      make_constraint: 4229.60
      transmission: 6.00
    sensor_pos: 5.20
    fwd_velocity: 2866.00
      com_vel: 362.40
      passive: 1836.40
      rne: 633.60
      tendon_bias: 5.20
    sensor_vel: 5.20
    fwd_actuation: 19.20
    fwd_acceleration: 1983.20
      xfrc_accumulate: 1736.58
    solve: 136190.80
      mul_m: 69.60
      solve_m: 94.00
    sensor_acc: 6.40
  euler: 299.20

Total memory: 885.69 MB / 48640.12 MB (1.82%)
Model memory (4.76%):
 dof_tri_row: 13.97 MB (1.58%)
 dof_tri_col: 13.97 MB (1.58%)
Data memory (95.24%):
 cdof: 15.86 MB (1.79%)
 cinert: 8.96 MB (1.01%)
 flexedge_J_colind: 15.12 MB (1.71%)
 flexedge_J: 15.12 MB (1.71%)
 crb: 8.96 MB (1.01%)
 cdof_dot: 15.86 MB (1.79%)
 efc.J_colind: 250.00 MB (28.23%)
 efc.J: 250.00 MB (28.23%)
 efc.quad: 11.72 MB (1.32%)

summary

SPS: 833 -> 6,662 (7.9x) ✅
total memory: 17788.09 MB -> 885.69 MB (save +15GB) ✅

increase throughput by increasing the number of worlds

we can also increase the number of worlds to --nworld=4096

Rolling out 10 steps at dt = 0.005...

Summary for 4096 parallel rollouts

Total JIT time: 0.82 s
Total simulation time: 4.19 s
Total steps per second: 9,777
Total realtime factor: 48.88 x
Total time per step: 102283.59 ns
Total converged worlds: 4096 / 4096

Event trace:

step: 102088.00
  forward: 101824.05
    fwd_position: 6266.05
      kinematics: 413.12
      com_pos: 413.95
      camlight: 3.43
      flex: 344.03
      crb: 297.60
      tendon_armature: 0.32
      collision: 760.82
        sap_broadphase: 751.27
        convex_narrowphase: 0.35
        primitive_narrowphase: 7.38
      make_constraint: 4028.85
      transmission: 0.37
    sensor_pos: 0.40
    fwd_velocity: 2404.78
      com_vel: 242.83
      passive: 1668.13
      rne: 491.75
      tendon_bias: 0.37
    sensor_vel: 0.40
    fwd_actuation: 4.40
    fwd_acceleration: 1499.00
      xfrc_accumulate: 1342.71
    solve: 91645.10
      mul_m: 109.47
      solve_m: 35.28
    sensor_acc: 0.32
  euler: 262.90

Total memory: 13538.90 MB / 48640.12 MB (27.83%)
Model memory (0.31%):
 (no field >= 1% of utilized memory)
Data memory (99.69%):
 cdof: 253.69 MB (1.87%)
 cinert: 143.44 MB (1.06%)
 flexedge_J_colind: 241.97 MB (1.79%)
 flexedge_J: 241.97 MB (1.79%)
 crb: 143.44 MB (1.06%)
 cdof_dot: 253.69 MB (1.87%)
 efc.J_colind: 4000.00 MB (29.54%)
 efc.J: 4000.00 MB (29.54%)
 efc.quad: 187.50 MB (1.38%)

SPS: 6,662 -> 9,777

tl;dr

comparing performance of benchmark/cloth/scene.xml scene to main (f2f7957)
~7.9x throughput improvement
~15GB reduction in device memory

todo:

add print warning if number of dofs required for a constraint exceeds the maximum dofs setting

thowell added 3 commits December 18, 2025 13:01

sparse efc_J

fe92e1e

sparse contact and sparse flex

e9d9d37

maximum number of dofs per constraint

3ab3514

thowell linked an issue Dec 18, 2025 that may be closed by this pull request

JacobianType.SPARSE #88

Open

4 tasks

thowell mentioned this pull request Dec 19, 2025

update benchmarks containing flex #938

Open

3 tasks

thowell mentioned this pull request Jan 8, 2026

sparse flexedge_J #931

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

set maximum number of dofs for `efc_J` #936

set maximum number of dofs for `efc_J` #936

Uh oh!

thowell commented Dec 18, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

set maximum number of dofs for efc_J #936

Are you sure you want to change the base?

set maximum number of dofs for efc_J #936

Uh oh!

Conversation

thowell commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

set maximum number of dofs for `efc_J` #936

set maximum number of dofs for `efc_J` #936

thowell commented Dec 18, 2025 •

edited

Loading