feat(trainer): run dataset and model initializers in parallel#292
feat(trainer): run dataset and model initializers in parallel#292Sayan4496 wants to merge 1 commit intokubeflow:mainfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
🎉 Welcome to the Kubeflow SDK! 🎉 Thanks for opening your first PR! We're happy to have you as part of our community 🚀 Here's what happens next:
Join the community:
Feel free to ask questions in the comments if you need any help or clarification! |
There was a problem hiding this comment.
Pull request overview
Updates the container backend to reduce startup latency by running dataset and model initializer containers concurrently instead of sequentially.
Changes:
- Execute dataset and model initializers in parallel via
ThreadPoolExecutor. - Wait for initializer completion and propagate failures to the caller.
- Add a debug log after all initializers finish successfully.
| # Wait for all initializers to complete and propagate errors | ||
| for future in as_completed(futures): | ||
| future.result() | ||
|
|
There was a problem hiding this comment.
When iterating as_completed(futures), the first failing initializer raises immediately, which means exceptions from any other initializer future are never consumed (and the surfaced failure becomes completion-order dependent), making debugging nondeterministic when multiple initializers fail; consider capturing results for all futures (e.g., map future→name, collect exceptions from every future.result() in a list, then raise a combined/deterministic error after all have finished).
| # Wait for all initializers to complete and propagate errors | |
| for future in as_completed(futures): | |
| future.result() | |
| # Wait for all initializers to complete and collect errors deterministically | |
| exceptions: list[Exception] = [] | |
| for future in as_completed(futures): | |
| try: | |
| future.result() | |
| except Exception as exc: | |
| exceptions.append(exc) | |
| if exceptions: | |
| if len(exceptions) == 1: | |
| raise RuntimeError("Initializer failed") from exceptions[0] | |
| messages = "\n".join( | |
| f"{idx + 1}) {type(exc).__name__}: {exc}" | |
| for idx, exc in enumerate(exceptions) | |
| ) | |
| raise RuntimeError( | |
| f"Multiple initializers failed:\n{messages}" | |
| ) |
b84f0e2 to
987ee5d
Compare
Signed-off-by: Sayan Deyashi <deyashisayan2@gmail.com>
987ee5d to
85e110d
Compare
Summary
Run dataset and model initializer containers in parallel in the container backend instead of sequential execution.
Motivation
Previously, when both dataset and model initializers were configured, they executed sequentially, increasing total startup time.
Since Docker/Podman allow multiple containers to mount the same volume simultaneously, running them in parallel reduces initialization latency to approximately the maximum of the two durations rather than their sum.
Changes
ThreadPoolExecutor.Testing
184 passedFixes #290