Improve pathways checkpoint load times #1345

samos123 · 2025-09-23T16:53:58Z

Utilize a shared memory between the Jax client and pathways proxy for data heavy transfers e.g. device_puts.
Increase threads of ThreadPoolExecutor from 32 (python default) to 192.
Remove memory limit from pathways head main container.

Callers of deserialize should utilize a concurrent_restore_gb as large as possible until OOM. Otherwise GCS read and device_put won't happen in parallel. The default of 32GB is too low to achieve optimal performance with Pathways.

* Utilize a shared memory between the Jax client and pathways proxy for data heavy transfers e.g. device_puts. * Increase threads of ThreadPoolExecutor from 32 (python default) to 192. * Remove memory limit from pathways head main container. Callers should utilize a concurrent_restore_gb as large as possible until OOM. Otherwise GCS read and device_put won't happen in parallel. The default of 32GB is too low to achieve optimal performance with Pathways.

muyangyuapple · 2025-09-23T17:17:06Z

axlearn/cloud/gcp/pathways_utils.py

 # This image version extends GRPC timeout for long context models, based on jax-0.5.3-patch060625
 # This image extends GRPC timeout for long context models.
-_PATHWAYS_IMAGE_TAG = "disable_settings_20250701"
+_PATHWAYS_IMAGE_TAG = "shm_proxy"


Could you double check with Shauray that this binary includes the path of extending GRPC timeout? Or it doesn't need it anymore?

muyangyuapple · 2025-09-23T17:20:19Z

axlearn/cloud/gcp/pathways_utils.py

+            # The flag below is needed for better H2D performance.
+            # Rule of thumb: 3x the shard size. So 128GB to be safe.
+            # Decrease if you start running out of host memory on TPU VMs.
+            "--tpu_premapped_buffer_size=137438953472",


Let's use 1/4 of the machine type's host memory and round up to the oder of 2:

https://github.com/apple/axlearn/blob/main/axlearn/cloud/gcp/system_characteristics.py#L494-L499

muyangyuapple · 2025-09-23T17:22:40Z

axlearn/common/array_serialization.py

        self._loop_thread.start()
-        self._single_thread_pool = ThreadPoolExecutor(1)
+        self._single_thread_pool = ThreadPoolExecutor(max_workers=1)
+        self._multi_thread_pool = ThreadPoolExecutor(max_workers=192)


Can we make this a config flag? It depends on how many cpu we allocate to the head pod: https://github.com/apple/axlearn/blob/main/axlearn/cloud/gcp/pathways_utils.py#L317

muyangyuapple · 2025-09-23T17:23:33Z

axlearn/cloud/gcp/pathways_utils.py

        mem_req = f"{self.config.pathways_head_mem}Gi"
        resources = {
            "requests": {"cpu": cpu_req, "memory": mem_req},
-            "limits": {"cpu": cpu_req, "memory": mem_req},


For my education, what's the effect of having "request" and not "limit"?

github-actions · 2025-12-27T02:16:52Z

This pull request has been automatically marked as stale because it has been inactive for 60 days. It will be closed in 7 days if no further activity occurs. If you would like to continue working on this, please remove the stale label or leave a comment.

muyangyuapple reviewed Sep 23, 2025

View reviewed changes

github-actions bot added the stale label Dec 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve pathways checkpoint load times #1345

Improve pathways checkpoint load times #1345

samos123 commented Sep 23, 2025

Uh oh!

muyangyuapple Sep 23, 2025

Uh oh!

muyangyuapple Sep 23, 2025

Uh oh!

muyangyuapple Sep 23, 2025

Uh oh!

muyangyuapple Sep 23, 2025

Uh oh!

github-actions bot commented Dec 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Improve pathways checkpoint load times #1345

Are you sure you want to change the base?

Improve pathways checkpoint load times #1345

Conversation

samos123 commented Sep 23, 2025

Uh oh!

muyangyuapple Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

muyangyuapple Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

muyangyuapple Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

muyangyuapple Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Dec 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants