CI tests on multi-GPU runners #1229

gagank1 · 2025-10-07T09:27:57Z

Description

Implements support for multi-gpu unit tests on 2x RTX A6000 runners
Adds pytest.mark.multi_gpu to several unmarked tests
Adds in integrated data download to Evo2 gradient equivalence test to allow it to run in CI
Fixes bug in Evo2 preprocessor where it wasn't respecting random seed during sampling
Fixes MOCO multi-gpu tests (previously hanging)

Usage

Simply add @pytest.mark.multi_gpu to any future multi-gpu tests. Use the new ciflow:multi-gpu label to run them in PRs. Otherwise, they will run in the merge queue and on the nightly schedule.

Type of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Refactor
Documentation update
Other (please describe):

Pre-submit Checklist

I have tested these changes locally
I have updated the documentation accordingly
I have added/updated tests as needed
All existing tests pass successfully

Summary by CodeRabbit

New Features
- Added a public label to opt into multi-GPU testing.
Documentation
- Clarified CI behavior for multi-GPU tests and contributing guidance; noted nightly multi-GPU runs.
CI
- Split test pipelines into single-/multi-GPU and fast/slow paths; added runtime flags and new per-path jobs with updated verification and coverage aggregation.
Tests
- Marked many tests as multi-GPU, added pre-spawn environment setup and GPU stability workarounds, and enabled on-demand test data preparation.
Bug Fixes
- Made preprocessing split selection deterministic.

Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>

copy-pr-bot · 2025-10-07T09:28:00Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2025-10-07T09:28:09Z

Important

Review skipped

Auto reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

🔍 Trigger a full review

📝 Walkthrough

Walkthrough

Adds a ciflow:multi-gpu label; splits CI test jobs into single-/multi-GPU and fast/slow paths; extends pytest runner with --only-multi-gpu/--skip-multi-gpu; marks many tests with @pytest.mark.multi_gpu; adds pre-spawn env and A6000 NCCL workaround; makes Evo2 preprocessing deterministic and adds a test data downloader helper.

Changes

Cohort / File(s)	Summary
CI labels & docs `.github/labels.yml`, `.github/pull_request_template.md`, `docs/docs/main/contributing/contributing.md`	Adds `ciflow:multi-gpu` label; updates PR template and contributing docs to clarify multi-GPU tests are excluded from PR CI and run separately (nightly).
CI workflows — framework `.github/workflows/unit-tests-framework.yml`	Replaces monolithic test jobs with per-path jobs (`run-tests-single-gpu`, `run-tests-multi-gpu`, `run-tests-slow-single-gpu`, `run-tests-slow-multi-gpu`, `run-tests-notebooks`); updates conditions, per-job scripts, coverage/Codecov steps, and `verify-tests-status` dependencies.
CI workflows — recipes `.github/workflows/unit-tests-recipes.yml`	Splits recipe testing into `unit-tests-single-gpu` and `unit-tests-multi-gpu` matrix jobs; mirrors setup/checkout/install steps; updates verification to depend on both jobs and aggregates failures.
PyTest runner script `ci/scripts/pytest_runner.sh`	Adds CLI flags `--skip-multi-gpu` and `--only-multi-gpu` with variables `SKIP_MULTI_GPU`, `ONLY_MULTI_GPU`; composes combined `MARKER_EXPR` (multi_gpu/slow) and applies `-m` to pytest when needed.
New/updated test runner scripts `ci/scripts/run_pytest_multigpu.sh`, `ci/scripts/run_pytest_slow_multigpu.sh`, `ci/scripts/run_pytest_unittests.sh`, `ci/scripts/run_pytest_slow.sh`	Adds `run_pytest_multigpu.sh` and `run_pytest_slow_multigpu.sh`; updates unittests/slow invocations to pass `--skip-multi-gpu` or appropriate flags to `pytest_runner.sh`.
PyTest marker configs `bionemo-recipes/models/amplify/pyproject.toml`, `bionemo-recipes/models/esm2/pyproject.toml`	Adds `slow` and `multi_gpu` markers to tool.pytest.ini_options marker lists.
Test marker additions — recipes & models `bionemo-recipes/models/esm2/tests/...`, `bionemo-recipes/recipes/...`, `sub-packages/bionemo-llm/tests/...`, `sub-packages/bionemo-evo2/tests/...`	Adds `@pytest.mark.multi_gpu` to many distributed/training tests and parametrizations; updates some test ids and skip/mark combinations.
MOCO distributed tests — env & marks `sub-packages/bionemo-moco/tests/...`	Adds `os`/`socket` imports; sets `MASTER_ADDR` and assigns a free `MASTER_PORT` before spawn; disables NCCL P2P on A6000 GPUs via `NCCL_P2P_DISABLE=1` when CUDA present; expands world_size parametrizations and marks multi-GPU cases.
Evo2 deterministic preprocess `sub-packages/bionemo-evo2/src/bionemo/evo2/data/preprocess.py`	Replaces unordered `set` + `pop()` with ordered `list` and `random.sample(...);pop(0)` to make split assignment deterministic under a seed.
Evo2 tests — data helper & fixture changes `sub-packages/bionemo-evo2/tests/bionemo/evo2/run/test_train.py`	Adds `download_and_preprocess_training_data()` helper (appears duplicated in diff); `dataset_config` fixture now calls helper when dataset paths missing; adds multi-GPU marks to tests.
Various small test updates `sub-packages/bionemo-moco/...`, `sub-packages/bionemo-evo2/tests/...`, `sub-packages/bionemo-llm/...`	Multiple tests updated to include multi-GPU marks or to adjust parametrizations; some tests add environment/setup steps pre-spawn.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Dev as Developer
  participant GH as GitHub Actions
  participant Jobs as unit-tests-* jobs
  participant Verify as verify-tests-status

  Dev->>GH: Push PR or schedule
  GH->>Jobs: Start workflow (matrix)
  alt PR CI
    Jobs->>Jobs: Run single-GPU jobs (multi-GPU jobs skipped)
  else Nightly / scheduled
    Jobs->>Jobs: Run single-GPU jobs
    Jobs->>Jobs: Run multi-GPU jobs after single-GPU success
  end
  Jobs-->>GH: Upload artifacts/coverage
  Jobs->>Verify: Report job results
  Verify-->>GH: Aggregate status

sequenceDiagram
  autonumber
  participant Job as CI job
  participant Runner as pytest_runner.sh
  participant PyTest as pytest

  Job->>Runner: Invoke with flags (--only-multi-gpu / --skip-multi-gpu / --skip-slow / --only-slow)
  Runner->>Runner: Build MARKER_EXPR from flags (multi_gpu, slow)
  alt MARKER_EXPR non-empty
    Runner->>PyTest: pytest -m "<MARKER_EXPR>" [options]
  else
    Runner->>PyTest: pytest [options]
  end
  PyTest-->>Job: Return results and reports

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

I tagged the CI with teal delight,
I hop where GPUs split day and night.
Nightly I wait while single-GPU runs,
Then multi-GPU dances under teal suns.
I nibble seeds and make splits right — hooray! 🐇✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Description Check	⚠️ Warning	The pull request description includes the Description, Usage, Type of changes, and Pre-submit Checklist sections but omits the required CI Pipeline Configuration section from the repository template and lacks a code snippet under Usage, making it incomplete against the expected template.	Please add the “CI Pipeline Configuration” section with the relevant labels and, if applicable, include a short code snippet under “Usage” to fully match the repository’s PR description template.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The title succinctly captures the primary change of enabling CI tests on multi-GPU runners without extraneous detail and clearly informs reviewers of the main intent of the pull request.
Docstring Coverage	✅ Passed	No functions found in the changes. Docstring coverage check skipped.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gagank1 · 2025-10-07T09:28:22Z

/ok to test 68fa036

.github/workflows/unit-tests-framework.yml

Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>

gagank1 · 2025-10-07T09:51:50Z

/ok to test 62f381d

codecov-commenter · 2025-10-07T11:27:53Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.42%. Comparing base (e5e58c8) to head (ebf33b8).
⚠️ Report is 5 commits behind head on main.
✅ All tests successful. No failed tests found.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1229      +/-   ##
==========================================
- Coverage   76.53%   76.42%   -0.12%     
==========================================
  Files         101      101              
  Lines        7931     7931              
==========================================
- Hits         6070     6061       -9     
- Misses       1861     1870       +9

see 1 file with indirect coverage changes

trvachov · 2025-10-07T17:45:56Z

From standup: Is it possible to start having these run nightly, and track queue time + success rate?

Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>

.github/workflows/unit-tests-framework.yml

gagank1 · 2025-10-09T13:23:59Z

/ok to test 250bd29

Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>

gagank1 · 2025-10-09T13:39:40Z

/ok to test 68307b1

...onemo/moco/interpolants/continuous_time/continuous/test_continuous_flow_matching_parallel.py

...bionemo-moco/tests/bionemo/moco/interpolants/continuous_time/continuous/test_vdm_parallel.py

...s/bionemo/moco/interpolants/continuous_time/discrete/test_discrete_flow_matching_parallel.py

.../bionemo-moco/tests/bionemo/moco/interpolants/continuous_time/discrete/test_mdlm_parallel.py

.../bionemo-moco/tests/bionemo/moco/interpolants/discrete_time/continuous/test_ddpm_parallel.py

...es/bionemo-moco/tests/bionemo/moco/interpolants/discrete_time/discrete/test_d3pm_parallel.py

Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>

gagank1 · 2025-10-09T13:54:16Z

/ok to test 6fbe6f4

- Merged in latest changes from origin/main - Added @pytest.mark.multi_gpu to test_distributed_fp8.py::test_multi_process_fp8_recipes_are_synced - Added @pytest.mark.multi_gpu to test_train.py::test_distributed_training_gradient_equivalence - These new multi-GPU tests will run on the 2-GPU runner in merge queue/schedule

…g in evo2 preprocessing that ignored random seed during bootstrap Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>

gagank1 · 2025-10-11T16:06:13Z

/ok to test b777145

Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>

gagank1 · 2025-10-14T11:18:39Z

/ok to test c3f9737

pstjohn · 2025-10-17T20:11:57Z

.github/workflows/unit-tests-framework.yml

+    if: |
+        (needs.build-bionemo-image.result == 'success') &&
+        (needs.run-tests-single-gpu.result == 'success') &&
+        (github.event_name == 'schedule')


we should run either on the schedule event or the PR label?

i thought we wanted to only have it run nightly? or is it nightly + label but no merge queue for now?

Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>

…ests pass OR on PRs with the ciflow:multi-gpu label Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>

.github/workflows/unit-tests-recipes.yml

+    runs-on: ubuntu-latest
+    outputs:
+      labels: ${{ steps.get-labels.outputs.labels || steps.get-labels-empty.outputs.labels }}
+    steps:
+      - name: Get PR number from branch
+        if: startsWith(github.ref, 'refs/heads/pull-request/')
+        id: get-pr-num
+        run: |
+          PR_NUM=$(echo ${{ github.ref_name }} | grep -oE '[0-9]+$')
+          echo "pr_num=$PR_NUM" >> $GITHUB_OUTPUT
+
+      - name: Get PR labels
+        id: get-labels
+        if: startsWith(github.ref, 'refs/heads/pull-request/')
+        env:
+          GH_TOKEN: ${{ github.token }}
+        run: |
+          LABELS=$(gh api repos/${{ github.repository }}/pulls/${{ steps.get-pr-num.outputs.pr_num }} --jq '[.labels[].name]' || echo "[]")
+          echo "labels=$LABELS" >> $GITHUB_OUTPUT
+          echo "Retrieved labels: $LABELS"
+
+      - name: Set empty labels for non-PR branches
+        if: ${{ !startsWith(github.ref, 'refs/heads/pull-request/') }}
+        id: get-labels-empty
+        run: |
+          echo "labels=[]" >> $GITHUB_OUTPUT
+          echo "Set empty labels for non-PR branch"
+
+  unit-tests-multi-gpu:


To fix the problem, explicitly define minimal GITHUB_TOKEN permissions in the workflow so the job does not inherit broad defaults. Since the get-pr-labels job uses gh api to read PR information and nothing else, it only needs read access to repository contents and PRs. The cleanest fix without altering functionality is to add a permissions block scoped specifically to the get-pr-labels job. This avoids affecting other jobs whose required permissions are not shown.

Concretely, in .github/workflows/unit-tests-recipes.yml, under the get-pr-labels: job header (around line 215) and before runs-on: ubuntu-latest, add:

permissions: contents: read pull-requests: read

No additional imports or methods are needed since this is a YAML workflow configuration change only. All other job steps and behavior remain unchanged.

Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>

gagank1 · 2025-11-13T15:06:03Z

/ok to test 2fe65dc

.github/workflows/unit-tests-framework.yml

+    needs:
+      - build-bionemo-image
+      - run-tests-single-gpu
+      - get-pr-labels
+    runs-on: linux-amd64-gpu-rtxa6000-latest-2-nemo
+    container:
+      image: svcbionemo023/bionemo-framework:${{ github.run_id }}
+      credentials:
+        username: ${{ vars.DOCKER_USERNAME }}
+        password: ${{ secrets.DOCKER_PASSWORD }}
+    # Run multi-GPU tests ONLY when:
+    # Prerequisites: build succeeds AND single-GPU tests pass
+    # Then run if: schedule OR (push with ciflow:all OR ciflow:multi-gpu label)
+    # Do NOT run on merge_group or any other events
+    if: |
+        (needs.build-bionemo-image.result == 'success') &&
+        (needs.run-tests-single-gpu.result == 'success') &&
+        (
+          github.event_name == 'schedule' ||
+          (
+            github.event_name == 'push' &&
+            (
+              contains(fromJSON(needs.get-pr-labels.outputs.labels || '[]'), 'ciflow:all') ||
+              contains(fromJSON(needs.get-pr-labels.outputs.labels || '[]'), 'ciflow:multi-gpu')
+            )
+          )
+        )
+    steps:
+      - name: Checkout repository
+        uses: actions/checkout@v4
+
+      - name: Run multi-GPU tests
+        env:
+          BIONEMO_DATA_SOURCE: ngc
+        run: |
+          chmod +x ./ci/scripts/run_pytest_multigpu.sh
+          ./ci/scripts/run_pytest_multigpu.sh
+
+  run-tests-slow-single-gpu:


Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>

gagank1 · 2025-11-13T16:21:41Z

/ok to test 3bf090d

…plify.py Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>

gagank1 · 2025-11-13T17:40:43Z

/ok to test 5f17689

Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>

gagank1 · 2025-11-13T17:53:58Z

/ok to test f14497c

Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>

gagank1 · 2025-11-14T18:24:15Z

/ok to test 91e4dcd

gagank1 · 2025-12-02T20:38:00Z

/ok to test 91e4dcd

Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>

gagank1 · 2025-12-02T21:00:25Z

/ok to test e8c590a

gagank1 · 2025-12-12T21:23:16Z

/ok to test d2bf3ce

gagank1 · 2025-12-18T18:50:46Z

/ok to test 7bcb9ec

gagank1 · 2026-01-29T20:21:03Z

/ok to test ebf33b8

initial commit of adding multi-gpu ci runners

68fa036

Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>

gagank1 self-assigned this Oct 7, 2025

github-advanced-security bot found potential problems Oct 7, 2025

View reviewed changes

.github/workflows/unit-tests-framework.yml Fixed Show fixed Hide fixed

fix missing multi_gpu marker in esm2

62f381d

Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>

gagank1 added the ciflow:slow Run slow single GPU integration tests marked as @pytest.mark.slow for bionemo2 label Oct 7, 2025

switch to a6000 runners

250bd29

Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>

github-advanced-security bot found potential problems Oct 9, 2025

View reviewed changes

.github/workflows/unit-tests-framework.yml Fixed Show fixed Hide fixed

fixed moco multi-gpu tests

68307b1

Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>

github-advanced-security bot found potential problems Oct 9, 2025

View reviewed changes

bind to only localhost

6fbe6f4

Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>

gagank1 added 2 commits October 11, 2025 12:30

implemented data download in evo2 gradient equivalence test. fixed bu…

b777145

…g in evo2 preprocessing that ignored random seed during bootstrap Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>

added new ciflow:multi-gpu label and fixed scheduling

c3f9737

Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>

gagank1 marked this pull request as ready for review October 14, 2025 11:33

gagank1 requested review from dorotat-nv and jstjohn as code owners October 14, 2025 11:33

pstjohn reviewed Oct 17, 2025

View reviewed changes

gagank1 added the ciflow:multi-gpu Run all multi GPU tests (unit tests, slow tests) for bionemo2 label Nov 7, 2025

gagank1 added 3 commits November 6, 2025 21:15

Merge remote-tracking branch 'origin/main' into gkaushik/multi-gpu-ci

617003a

added @pytest.mark.multi_gpu to new recipes tests

16dc4ff

Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>

update recipes workflow to run multigpu tests nightly if single gpu t…

9c994c7

…ests pass OR on PRs with the ciflow:multi-gpu label Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>

github-advanced-security bot found potential problems Nov 7, 2025

View reviewed changes

gagank1 added the ciflow:slow Run slow single GPU integration tests marked as @pytest.mark.slow for bionemo2 label Nov 13, 2025

more scheduling changes

2fe65dc

Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>

github-advanced-security bot found potential problems Nov 13, 2025

View reviewed changes

add missing multi_gpu mark registration

3bf090d

Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>

fix recipes workflow return code, add fp8 check to test_accelerate_am…

5f17689

…plify.py Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>

fine tune scheduling logic

f14497c

Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>

gagank1 added 2 commits November 14, 2025 10:18

fix labels for copy pr bot

d2822a4

Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>

xfail known bug

91e4dcd

Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>

merged main

e8c590a

Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>

gagank1 requested a review from savitha-eng as a code owner December 2, 2025 20:59

Merge remote-tracking branch 'origin/main' into gkaushik/multi-gpu-ci

d2bf3ce

gagank1 enabled auto-merge December 12, 2025 21:23

Merge remote-tracking branch 'origin/main' into gkaushik/multi-gpu-ci

7bcb9ec

Merge remote-tracking branch 'origin/main' into gkaushik/multi-gpu-ci

ebf33b8

@@ -215,6 +215,9 @@
               # With copy-pr-bot, we need to get the PR labels from the PR API rather than from the event metadata.
               get-pr-labels:
                 runs-on: ubuntu-latest
+                permissions:
+                  contents: read
+                  pull-requests: read
                 outputs:
                   labels: ${{ steps.get-labels.outputs.labels || steps.get-labels-empty.outputs.labels }}
                 steps:

CI tests on multi-GPU runners #1229

Are you sure you want to change the base?

CI tests on multi-GPU runners #1229

Uh oh!

Conversation

gagank1 commented Oct 7, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Usage

Type of changes

Pre-submit Checklist

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Oct 7, 2025

Uh oh!

coderabbitai bot commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

gagank1 commented Oct 7, 2025

Uh oh!

Uh oh!

gagank1 commented Oct 7, 2025

Uh oh!

codecov-commenter commented Oct 7, 2025 • edited by codecov bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

trvachov commented Oct 7, 2025

Uh oh!

Uh oh!

gagank1 commented Oct 9, 2025

Uh oh!

gagank1 commented Oct 9, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gagank1 commented Oct 9, 2025

Uh oh!

gagank1 commented Oct 11, 2025

Uh oh!

gagank1 commented Oct 14, 2025

Uh oh!

pstjohn Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

gagank1 Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

Check warning

Copilot Autofix

gagank1 commented Nov 13, 2025

Uh oh!

Check warning

gagank1 commented Nov 13, 2025

Uh oh!

gagank1 commented Nov 13, 2025

Uh oh!

gagank1 commented Nov 13, 2025

Uh oh!

gagank1 commented Nov 14, 2025

Uh oh!

gagank1 commented Dec 2, 2025

Uh oh!

gagank1 commented Dec 2, 2025

Uh oh!

gagank1 commented Dec 12, 2025

Uh oh!

gagank1 commented Dec 18, 2025

Uh oh!

gagank1 commented Oct 7, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 7, 2025 •

edited

Loading

codecov-commenter commented Oct 7, 2025 •

edited by codecov bot

Loading