Skip to content

Conversation

@gagank1
Copy link
Collaborator

@gagank1 gagank1 commented Oct 7, 2025

Description

  • Implements support for multi-gpu unit tests on 2x RTX A6000 runners
  • Adds pytest.mark.multi_gpu to several unmarked tests
  • Adds in integrated data download to Evo2 gradient equivalence test to allow it to run in CI
  • Fixes bug in Evo2 preprocessor where it wasn't respecting random seed during sampling
  • Fixes MOCO multi-gpu tests (previously hanging)

Usage

Simply add @pytest.mark.multi_gpu to any future multi-gpu tests. Use the new ciflow:multi-gpu label to run them in PRs. Otherwise, they will run in the merge queue and on the nightly schedule.

Type of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Refactor
  • Documentation update
  • Other (please describe):

Pre-submit Checklist

  • I have tested these changes locally
  • I have updated the documentation accordingly
  • I have added/updated tests as needed
  • All existing tests pass successfully

Summary by CodeRabbit

  • New Features

    • Added a public label to opt into multi-GPU testing.
  • Documentation

    • Clarified CI behavior for multi-GPU tests and contributing guidance; noted nightly multi-GPU runs.
  • CI

    • Split test pipelines into single-/multi-GPU and fast/slow paths; added runtime flags and new per-path jobs with updated verification and coverage aggregation.
  • Tests

    • Marked many tests as multi-GPU, added pre-spawn environment setup and GPU stability workarounds, and enabled on-demand test data preparation.
  • Bug Fixes

    • Made preprocessing split selection deterministic.

Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
@gagank1 gagank1 self-assigned this Oct 7, 2025
@copy-pr-bot
Copy link

copy-pr-bot bot commented Oct 7, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 7, 2025

Important

Review skipped

Auto reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

  • 🔍 Trigger a full review
📝 Walkthrough

Walkthrough

Adds a ciflow:multi-gpu label; splits CI test jobs into single-/multi-GPU and fast/slow paths; extends pytest runner with --only-multi-gpu/--skip-multi-gpu; marks many tests with @pytest.mark.multi_gpu; adds pre-spawn env and A6000 NCCL workaround; makes Evo2 preprocessing deterministic and adds a test data downloader helper.

Changes

Cohort / File(s) Summary
CI labels & docs
.github/labels.yml, .github/pull_request_template.md, docs/docs/main/contributing/contributing.md
Adds ciflow:multi-gpu label; updates PR template and contributing docs to clarify multi-GPU tests are excluded from PR CI and run separately (nightly).
CI workflows — framework
.github/workflows/unit-tests-framework.yml
Replaces monolithic test jobs with per-path jobs (run-tests-single-gpu, run-tests-multi-gpu, run-tests-slow-single-gpu, run-tests-slow-multi-gpu, run-tests-notebooks); updates conditions, per-job scripts, coverage/Codecov steps, and verify-tests-status dependencies.
CI workflows — recipes
.github/workflows/unit-tests-recipes.yml
Splits recipe testing into unit-tests-single-gpu and unit-tests-multi-gpu matrix jobs; mirrors setup/checkout/install steps; updates verification to depend on both jobs and aggregates failures.
PyTest runner script
ci/scripts/pytest_runner.sh
Adds CLI flags --skip-multi-gpu and --only-multi-gpu with variables SKIP_MULTI_GPU, ONLY_MULTI_GPU; composes combined MARKER_EXPR (multi_gpu/slow) and applies -m to pytest when needed.
New/updated test runner scripts
ci/scripts/run_pytest_multigpu.sh, ci/scripts/run_pytest_slow_multigpu.sh, ci/scripts/run_pytest_unittests.sh, ci/scripts/run_pytest_slow.sh
Adds run_pytest_multigpu.sh and run_pytest_slow_multigpu.sh; updates unittests/slow invocations to pass --skip-multi-gpu or appropriate flags to pytest_runner.sh.
PyTest marker configs
bionemo-recipes/models/amplify/pyproject.toml, bionemo-recipes/models/esm2/pyproject.toml
Adds slow and multi_gpu markers to tool.pytest.ini_options marker lists.
Test marker additions — recipes & models
bionemo-recipes/models/esm2/tests/..., bionemo-recipes/recipes/..., sub-packages/bionemo-llm/tests/..., sub-packages/bionemo-evo2/tests/...
Adds @pytest.mark.multi_gpu to many distributed/training tests and parametrizations; updates some test ids and skip/mark combinations.
MOCO distributed tests — env & marks
sub-packages/bionemo-moco/tests/...
Adds os/socket imports; sets MASTER_ADDR and assigns a free MASTER_PORT before spawn; disables NCCL P2P on A6000 GPUs via NCCL_P2P_DISABLE=1 when CUDA present; expands world_size parametrizations and marks multi-GPU cases.
Evo2 deterministic preprocess
sub-packages/bionemo-evo2/src/bionemo/evo2/data/preprocess.py
Replaces unordered set + pop() with ordered list and random.sample(...);pop(0) to make split assignment deterministic under a seed.
Evo2 tests — data helper & fixture changes
sub-packages/bionemo-evo2/tests/bionemo/evo2/run/test_train.py
Adds download_and_preprocess_training_data() helper (appears duplicated in diff); dataset_config fixture now calls helper when dataset paths missing; adds multi-GPU marks to tests.
Various small test updates
sub-packages/bionemo-moco/..., sub-packages/bionemo-evo2/tests/..., sub-packages/bionemo-llm/...
Multiple tests updated to include multi-GPU marks or to adjust parametrizations; some tests add environment/setup steps pre-spawn.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Dev as Developer
  participant GH as GitHub Actions
  participant Jobs as unit-tests-* jobs
  participant Verify as verify-tests-status

  Dev->>GH: Push PR or schedule
  GH->>Jobs: Start workflow (matrix)
  alt PR CI
    Jobs->>Jobs: Run single-GPU jobs (multi-GPU jobs skipped)
  else Nightly / scheduled
    Jobs->>Jobs: Run single-GPU jobs
    Jobs->>Jobs: Run multi-GPU jobs after single-GPU success
  end
  Jobs-->>GH: Upload artifacts/coverage
  Jobs->>Verify: Report job results
  Verify-->>GH: Aggregate status
Loading
sequenceDiagram
  autonumber
  participant Job as CI job
  participant Runner as pytest_runner.sh
  participant PyTest as pytest

  Job->>Runner: Invoke with flags (--only-multi-gpu / --skip-multi-gpu / --skip-slow / --only-slow)
  Runner->>Runner: Build MARKER_EXPR from flags (multi_gpu, slow)
  alt MARKER_EXPR non-empty
    Runner->>PyTest: pytest -m "<MARKER_EXPR>" [options]
  else
    Runner->>PyTest: pytest [options]
  end
  PyTest-->>Job: Return results and reports
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

I tagged the CI with teal delight,
I hop where GPUs split day and night.
Nightly I wait while single-GPU runs,
Then multi-GPU dances under teal suns.
I nibble seeds and make splits right — hooray! 🐇✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Description Check ⚠️ Warning The pull request description includes the Description, Usage, Type of changes, and Pre-submit Checklist sections but omits the required CI Pipeline Configuration section from the repository template and lacks a code snippet under Usage, making it incomplete against the expected template. Please add the “CI Pipeline Configuration” section with the relevant labels and, if applicable, include a short code snippet under “Usage” to fully match the repository’s PR description template.
✅ Passed checks (2 passed)
Check name Status Explanation
Title Check ✅ Passed The title succinctly captures the primary change of enabling CI tests on multi-GPU runners without extraneous detail and clearly informs reviewers of the main intent of the pull request.
Docstring Coverage ✅ Passed No functions found in the changes. Docstring coverage check skipped.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gagank1
Copy link
Collaborator Author

gagank1 commented Oct 7, 2025

/ok to test 68fa036

Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
@gagank1
Copy link
Collaborator Author

gagank1 commented Oct 7, 2025

/ok to test 62f381d

@gagank1 gagank1 added the ciflow:slow Run slow single GPU integration tests marked as @pytest.mark.slow for bionemo2 label Oct 7, 2025
@codecov-commenter
Copy link

codecov-commenter commented Oct 7, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.42%. Comparing base (e5e58c8) to head (ebf33b8).
⚠️ Report is 5 commits behind head on main.
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1229      +/-   ##
==========================================
- Coverage   76.53%   76.42%   -0.12%     
==========================================
  Files         101      101              
  Lines        7931     7931              
==========================================
- Hits         6070     6061       -9     
- Misses       1861     1870       +9     

see 1 file with indirect coverage changes

@trvachov
Copy link
Collaborator

trvachov commented Oct 7, 2025

From standup: Is it possible to start having these run nightly, and track queue time + success rate?

Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
@gagank1
Copy link
Collaborator Author

gagank1 commented Oct 9, 2025

/ok to test 250bd29

Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
@gagank1
Copy link
Collaborator Author

gagank1 commented Oct 9, 2025

/ok to test 68307b1

Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
@gagank1
Copy link
Collaborator Author

gagank1 commented Oct 9, 2025

/ok to test 6fbe6f4

- Merged in latest changes from origin/main
- Added @pytest.mark.multi_gpu to test_distributed_fp8.py::test_multi_process_fp8_recipes_are_synced
- Added @pytest.mark.multi_gpu to test_train.py::test_distributed_training_gradient_equivalence
- These new multi-GPU tests will run on the 2-GPU runner in merge queue/schedule
…g in evo2 preprocessing that ignored random seed during bootstrap

Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
@gagank1
Copy link
Collaborator Author

gagank1 commented Oct 11, 2025

/ok to test b777145

@gagank1 gagank1 added ciflow:all Run all tests (unit tests, slow tests, and notebooks) for bionemo2 or enforce running all tests ciflow:multi-gpu Run all multi GPU tests (unit tests, slow tests) for bionemo2 and removed ciflow:slow Run slow single GPU integration tests marked as @pytest.mark.slow for bionemo2 labels Oct 14, 2025
Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
@gagank1
Copy link
Collaborator Author

gagank1 commented Oct 14, 2025

/ok to test c3f9737

@gagank1 gagank1 marked this pull request as ready for review October 14, 2025 11:33
if: |
(needs.build-bionemo-image.result == 'success') &&
(needs.run-tests-single-gpu.result == 'success') &&
(github.event_name == 'schedule')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should run either on the schedule event or the PR label?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i thought we wanted to only have it run nightly? or is it nightly + label but no merge queue for now?

@gagank1 gagank1 added the ciflow:multi-gpu Run all multi GPU tests (unit tests, slow tests) for bionemo2 label Nov 7, 2025
Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
…ests pass OR on PRs with the ciflow:multi-gpu label

Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
Comment on lines +199 to +227
runs-on: ubuntu-latest
outputs:
labels: ${{ steps.get-labels.outputs.labels || steps.get-labels-empty.outputs.labels }}
steps:
- name: Get PR number from branch
if: startsWith(github.ref, 'refs/heads/pull-request/')
id: get-pr-num
run: |
PR_NUM=$(echo ${{ github.ref_name }} | grep -oE '[0-9]+$')
echo "pr_num=$PR_NUM" >> $GITHUB_OUTPUT

- name: Get PR labels
id: get-labels
if: startsWith(github.ref, 'refs/heads/pull-request/')
env:
GH_TOKEN: ${{ github.token }}
run: |
LABELS=$(gh api repos/${{ github.repository }}/pulls/${{ steps.get-pr-num.outputs.pr_num }} --jq '[.labels[].name]' || echo "[]")
echo "labels=$LABELS" >> $GITHUB_OUTPUT
echo "Retrieved labels: $LABELS"

- name: Set empty labels for non-PR branches
if: ${{ !startsWith(github.ref, 'refs/heads/pull-request/') }}
id: get-labels-empty
run: |
echo "labels=[]" >> $GITHUB_OUTPUT
echo "Set empty labels for non-PR branch"

unit-tests-multi-gpu:

Check warning

Code scanning / CodeQL

Workflow does not contain permissions Medium

Actions job or workflow does not limit the permissions of the GITHUB_TOKEN. Consider setting an explicit permissions block, using the following as a minimal starting point: {}

Copilot Autofix

AI 3 days ago

To fix the problem, explicitly define minimal GITHUB_TOKEN permissions in the workflow so the job does not inherit broad defaults. Since the get-pr-labels job uses gh api to read PR information and nothing else, it only needs read access to repository contents and PRs. The cleanest fix without altering functionality is to add a permissions block scoped specifically to the get-pr-labels job. This avoids affecting other jobs whose required permissions are not shown.

Concretely, in .github/workflows/unit-tests-recipes.yml, under the get-pr-labels: job header (around line 215) and before runs-on: ubuntu-latest, add:

    permissions:
      contents: read
      pull-requests: read

No additional imports or methods are needed since this is a YAML workflow configuration change only. All other job steps and behavior remain unchanged.

Suggested changeset 1
.github/workflows/unit-tests-recipes.yml

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/.github/workflows/unit-tests-recipes.yml b/.github/workflows/unit-tests-recipes.yml
--- a/.github/workflows/unit-tests-recipes.yml
+++ b/.github/workflows/unit-tests-recipes.yml
@@ -215,6 +215,9 @@
   # With copy-pr-bot, we need to get the PR labels from the PR API rather than from the event metadata.
   get-pr-labels:
     runs-on: ubuntu-latest
+    permissions:
+      contents: read
+      pull-requests: read
     outputs:
       labels: ${{ steps.get-labels.outputs.labels || steps.get-labels-empty.outputs.labels }}
     steps:
EOF
@@ -215,6 +215,9 @@
# With copy-pr-bot, we need to get the PR labels from the PR API rather than from the event metadata.
get-pr-labels:
runs-on: ubuntu-latest
permissions:
contents: read
pull-requests: read
outputs:
labels: ${{ steps.get-labels.outputs.labels || steps.get-labels-empty.outputs.labels }}
steps:
Copilot is powered by AI and may make mistakes. Always verify output.
@gagank1 gagank1 added the ciflow:slow Run slow single GPU integration tests marked as @pytest.mark.slow for bionemo2 label Nov 13, 2025
Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
@gagank1
Copy link
Collaborator Author

gagank1 commented Nov 13, 2025

/ok to test 2fe65dc

Comment on lines 266 to 304
needs:
- build-bionemo-image
- run-tests-single-gpu
- get-pr-labels
runs-on: linux-amd64-gpu-rtxa6000-latest-2-nemo
container:
image: svcbionemo023/bionemo-framework:${{ github.run_id }}
credentials:
username: ${{ vars.DOCKER_USERNAME }}
password: ${{ secrets.DOCKER_PASSWORD }}
# Run multi-GPU tests ONLY when:
# Prerequisites: build succeeds AND single-GPU tests pass
# Then run if: schedule OR (push with ciflow:all OR ciflow:multi-gpu label)
# Do NOT run on merge_group or any other events
if: |
(needs.build-bionemo-image.result == 'success') &&
(needs.run-tests-single-gpu.result == 'success') &&
(
github.event_name == 'schedule' ||
(
github.event_name == 'push' &&
(
contains(fromJSON(needs.get-pr-labels.outputs.labels || '[]'), 'ciflow:all') ||
contains(fromJSON(needs.get-pr-labels.outputs.labels || '[]'), 'ciflow:multi-gpu')
)
)
)
steps:
- name: Checkout repository
uses: actions/checkout@v4

- name: Run multi-GPU tests
env:
BIONEMO_DATA_SOURCE: ngc
run: |
chmod +x ./ci/scripts/run_pytest_multigpu.sh
./ci/scripts/run_pytest_multigpu.sh

run-tests-slow-single-gpu:

Check warning

Code scanning / CodeQL

Workflow does not contain permissions Medium

Actions job or workflow does not limit the permissions of the GITHUB_TOKEN. Consider setting an explicit permissions block, using the following as a minimal starting point: {contents: read}
Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
@gagank1
Copy link
Collaborator Author

gagank1 commented Nov 13, 2025

/ok to test 3bf090d

…plify.py

Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
@gagank1
Copy link
Collaborator Author

gagank1 commented Nov 13, 2025

/ok to test 5f17689

Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
@gagank1
Copy link
Collaborator Author

gagank1 commented Nov 13, 2025

/ok to test f14497c

Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
@gagank1
Copy link
Collaborator Author

gagank1 commented Nov 14, 2025

/ok to test 91e4dcd

1 similar comment
@gagank1
Copy link
Collaborator Author

gagank1 commented Dec 2, 2025

/ok to test 91e4dcd

Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
@gagank1 gagank1 requested a review from savitha-eng as a code owner December 2, 2025 20:59
@gagank1
Copy link
Collaborator Author

gagank1 commented Dec 2, 2025

/ok to test e8c590a

@gagank1
Copy link
Collaborator Author

gagank1 commented Dec 12, 2025

/ok to test d2bf3ce

@gagank1 gagank1 enabled auto-merge December 12, 2025 21:23
@gagank1
Copy link
Collaborator Author

gagank1 commented Dec 18, 2025

/ok to test 7bcb9ec

@gagank1
Copy link
Collaborator Author

gagank1 commented Jan 29, 2026

/ok to test ebf33b8

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow:all Run all tests (unit tests, slow tests, and notebooks) for bionemo2 or enforce running all tests ciflow:multi-gpu Run all multi GPU tests (unit tests, slow tests) for bionemo2 ciflow:slow Run slow single GPU integration tests marked as @pytest.mark.slow for bionemo2

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants