Skip to content

Conversation

@rohitc33
Copy link
Contributor

@rohitc33 rohitc33 commented Nov 4, 2025

Disabled by default. Can be enabled by passing --bundler_spec=sidecars=colocated-python when launching a Pathways job or LWS job. Both artifact registry and CloudBuild bundlers are supported.

This PR is a refactored version of #1350

@rohitc33 rohitc33 requested review from a team as code owners November 4, 2025 19:19
@rohitc33 rohitc33 force-pushed the rohit-colocated-2 branch 2 times, most recently from 2faf540 to 6eac867 Compare November 6, 2025 22:27
Copy link
Contributor

@muyangyuapple muyangyuapple left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also can you please add it to the lws pathways job? https://github.pie.apple.com/foundation-models/axlearn/blob/3103ca9e3967e73347d89325ab5887f0a0efb90a/axlearn/cloud/gcp/runners/__init__.py#L63

This is used in long running inference service

]
env:
- "DOCKER_BUILDKIT=1"
- "DOCKER_BUILDKIT=1\""""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need to branching the logic in the bundler.py file?

I think we just add some extra dependencies for colocated python. We can use a env_var in the Dockerfile to control that like this: https://github.com/apple/axlearn/blob/main/Dockerfile#L104

Which can be passed in via something like --bundler_spec=INSTALL_PATHWAYS_JAXLIB=true \

Copy link
Contributor Author

@rohitc33 rohitc33 Nov 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To enable colocated python, an extra image for the colocated python sidecar running on each worker pod needs to be built (so 2 images get built). This implementation works the same for the user - to enable it you just need to pass in --bundler_spec=enable_colocated_python=True.

Copy link
Contributor

@muyangyuapple muyangyuapple Nov 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can head pod and sidecar on worker share the same image?

If there are two images to build, we should define another bundler called "sidecar_bundler" here. Adding logics to bundler that is used only by a subset of jobs will make bundler hard to maintain in the long run.

Copy link
Contributor Author

@rohitc33 rohitc33 Dec 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored enable_colocated_python field to generic sidecars field. Now the flag to enable colocated python is:
--bundler_spec=sidecars=colocated-python


def __init__(self, cfg: Config, *, bundler: Bundler):
super().__init__(cfg)
self._enable_colocated_python = getattr(bundler.config, "enable_colocated_python", False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enable_colocated_python should be an attribute in PathwaysReplicatedJob

Also logically, when you use PathwaysColocatedPythonPlugin, you implies that colocated python is already enabled? So this class should not have a attribute call self._enable_colocated_python?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PathwaysColocatedPythonPlugin is for job types which support colocated python but doesn't require it to be enabled; the properties pathways_server_image and pathways_proxy_image return different values based on whether colocated python is enabled or disabled. Later on, it could be worth moving more pathways-specific logic to PathwaysColocatedPythonPlugin and making it just PathwaysPlugin - currently PathwaysReplicatedJob and PathwaysLeaderWorkerTemplate have a lot of duplicated logic.

if self._tpu_type not in USER_FACING_NAME_TO_SYSTEM_CHARACTERISTICS:
raise NotImplementedError(f"Missing system characteristics for {self._tpu_type}")

self._colocated_python = cfg.colocated_python.instantiate(bundler=bundler)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should instantiate it only when colocated python is enabled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's still used when colocated python is disabled to get pathways_server_image and pathways_proxy_image.

@rohitc33
Copy link
Contributor Author

Also can you please add it to the lws pathways job? https://github.pie.apple.com/foundation-models/axlearn/blob/3103ca9e3967e73347d89325ab5887f0a0efb90a/axlearn/cloud/gcp/runners/__init__.py#L63

This is used in long running inference service

It's already implemented in PathwaysLeaderWorkerTemplate.

@rohitc33 rohitc33 force-pushed the rohit-colocated-2 branch 2 times, most recently from f674e10 to 276cd87 Compare November 12, 2025 01:10
@rohitc33 rohitc33 force-pushed the rohit-colocated-2 branch 2 times, most recently from 8a84231 to 463c966 Compare December 8, 2025 20:58
Co-authored-by: lkolluru05 <lkolluru@google.com>
@changlan changlan added the ready-to-merge Ready to merge after clearing all the reviews. label Dec 8, 2025
@changlan changlan merged commit 2a32a62 into apple:main Dec 8, 2025
5 of 6 checks passed
haely pushed a commit that referenced this pull request Dec 11, 2025
Co-authored-by: lkolluru05 <lkolluru@google.com>
ORIGINAL_AUTHOR=rohitc33 <70339497+rohitc33@users.noreply.github.com>
COPYBARA_INTEGRATE_REVIEW=#1353 from rohitc33:rohit-colocated-2 e3ac888
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-to-merge Ready to merge after clearing all the reviews.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants