Skip to content
This repository was archived by the owner on Apr 8, 2024. It is now read-only.

Conversation

@jfomhover
Copy link
Contributor

The initialization of MPI in python might interact with the initialization done by the framework themselves (ex: lighgbm training). We might still need it in some scenarios (sharing distributed metrics like in #185), so we're making this optional for now.

When mpi_init_node is left to None, we use OMPI env variables to discover how many nodes there are, and create the right configuration for the distributed script to use.

We're also proposing to upgrade to openmpi4.

NOTE: this is part of an effort to try this distributed training on production scenarios where we're hitting internal mpi exceptions (MPI_ERR_TRUNCATE).

@jfomhover jfomhover temporarily deployed to mlops December 17, 2021 17:28 Inactive
@jfomhover jfomhover temporarily deployed to mlops December 17, 2021 17:56 Inactive
@github-actions
Copy link

Unit Test Results for Build

  1 files  ±0    1 suites  ±0   33s ⏱️ +3s
91 tests +1  91 ✔️ +1  0 💤 ±0  0 ±0 

Results for commit 22285b2. ± Comparison against base commit b1bea5e.

This pull request removes 1 and adds 2 tests. Note that renamed tests count towards both.
tests.common.test_distributed ‑ test_mpi_handler
tests.common.test_distributed ‑ test_mpi_handler_mpi_init
tests.common.test_distributed ‑ test_mpi_handler_no_mpi_init

@github-actions
Copy link

Code Coverage

Package Line Rate Branch Rate Complexity
common 85% 0% 0
scripts 100% 0% 0
scripts.data_processing.generate_data 93% 0% 0
scripts.data_processing.lightgbm_data2bin 95% 0% 0
scripts.data_processing.partition_data 92% 0% 0
scripts.inferencing.custom_win_cli 94% 0% 0
scripts.inferencing.lightgbm_c_api 75% 0% 0
scripts.inferencing.lightgbm_python 95% 0% 0
scripts.inferencing.treelite_python 94% 0% 0
scripts.model_transformation.treelite_compile 92% 0% 0
scripts.sample 93% 0% 0
scripts.training.lightgbm_python 80% 0% 0
Summary 86% (1227 / 1426) 0% (0 / 0) 0

@jfomhover jfomhover merged commit 0411df3 into main Dec 21, 2021
@jfomhover jfomhover deleted the jfomhover/fixtraining branch December 21, 2021 07:09
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants