-
Notifications
You must be signed in to change notification settings - Fork 116
CI tests on multi-GPU runners #1229
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the
📝 WalkthroughWalkthroughAdds a Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Dev as Developer
participant GH as GitHub Actions
participant Jobs as unit-tests-* jobs
participant Verify as verify-tests-status
Dev->>GH: Push PR or schedule
GH->>Jobs: Start workflow (matrix)
alt PR CI
Jobs->>Jobs: Run single-GPU jobs (multi-GPU jobs skipped)
else Nightly / scheduled
Jobs->>Jobs: Run single-GPU jobs
Jobs->>Jobs: Run multi-GPU jobs after single-GPU success
end
Jobs-->>GH: Upload artifacts/coverage
Jobs->>Verify: Report job results
Verify-->>GH: Aggregate status
sequenceDiagram
autonumber
participant Job as CI job
participant Runner as pytest_runner.sh
participant PyTest as pytest
Job->>Runner: Invoke with flags (--only-multi-gpu / --skip-multi-gpu / --skip-slow / --only-slow)
Runner->>Runner: Build MARKER_EXPR from flags (multi_gpu, slow)
alt MARKER_EXPR non-empty
Runner->>PyTest: pytest -m "<MARKER_EXPR>" [options]
else
Runner->>PyTest: pytest [options]
end
PyTest-->>Job: Return results and reports
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
/ok to test 68fa036 |
Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
|
/ok to test 62f381d |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1229 +/- ##
==========================================
- Coverage 76.53% 76.42% -0.12%
==========================================
Files 101 101
Lines 7931 7931
==========================================
- Hits 6070 6061 -9
- Misses 1861 1870 +9 |
|
From standup: Is it possible to start having these run nightly, and track queue time + success rate? |
Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
|
/ok to test 250bd29 |
Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
|
/ok to test 68307b1 |
...onemo/moco/interpolants/continuous_time/continuous/test_continuous_flow_matching_parallel.py
Fixed
Show fixed
Hide fixed
...bionemo-moco/tests/bionemo/moco/interpolants/continuous_time/continuous/test_vdm_parallel.py
Fixed
Show fixed
Hide fixed
...s/bionemo/moco/interpolants/continuous_time/discrete/test_discrete_flow_matching_parallel.py
Fixed
Show fixed
Hide fixed
.../bionemo-moco/tests/bionemo/moco/interpolants/continuous_time/discrete/test_mdlm_parallel.py
Fixed
Show fixed
Hide fixed
.../bionemo-moco/tests/bionemo/moco/interpolants/discrete_time/continuous/test_ddpm_parallel.py
Fixed
Show fixed
Hide fixed
...es/bionemo-moco/tests/bionemo/moco/interpolants/discrete_time/discrete/test_d3pm_parallel.py
Fixed
Show fixed
Hide fixed
Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
|
/ok to test 6fbe6f4 |
- Merged in latest changes from origin/main - Added @pytest.mark.multi_gpu to test_distributed_fp8.py::test_multi_process_fp8_recipes_are_synced - Added @pytest.mark.multi_gpu to test_train.py::test_distributed_training_gradient_equivalence - These new multi-GPU tests will run on the 2-GPU runner in merge queue/schedule
…g in evo2 preprocessing that ignored random seed during bootstrap Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
|
/ok to test b777145 |
Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
|
/ok to test c3f9737 |
| if: | | ||
| (needs.build-bionemo-image.result == 'success') && | ||
| (needs.run-tests-single-gpu.result == 'success') && | ||
| (github.event_name == 'schedule') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should run either on the schedule event or the PR label?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i thought we wanted to only have it run nightly? or is it nightly + label but no merge queue for now?
Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
…ests pass OR on PRs with the ciflow:multi-gpu label Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
| runs-on: ubuntu-latest | ||
| outputs: | ||
| labels: ${{ steps.get-labels.outputs.labels || steps.get-labels-empty.outputs.labels }} | ||
| steps: | ||
| - name: Get PR number from branch | ||
| if: startsWith(github.ref, 'refs/heads/pull-request/') | ||
| id: get-pr-num | ||
| run: | | ||
| PR_NUM=$(echo ${{ github.ref_name }} | grep -oE '[0-9]+$') | ||
| echo "pr_num=$PR_NUM" >> $GITHUB_OUTPUT | ||
|
|
||
| - name: Get PR labels | ||
| id: get-labels | ||
| if: startsWith(github.ref, 'refs/heads/pull-request/') | ||
| env: | ||
| GH_TOKEN: ${{ github.token }} | ||
| run: | | ||
| LABELS=$(gh api repos/${{ github.repository }}/pulls/${{ steps.get-pr-num.outputs.pr_num }} --jq '[.labels[].name]' || echo "[]") | ||
| echo "labels=$LABELS" >> $GITHUB_OUTPUT | ||
| echo "Retrieved labels: $LABELS" | ||
|
|
||
| - name: Set empty labels for non-PR branches | ||
| if: ${{ !startsWith(github.ref, 'refs/heads/pull-request/') }} | ||
| id: get-labels-empty | ||
| run: | | ||
| echo "labels=[]" >> $GITHUB_OUTPUT | ||
| echo "Set empty labels for non-PR branch" | ||
|
|
||
| unit-tests-multi-gpu: |
Check warning
Code scanning / CodeQL
Workflow does not contain permissions Medium
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 3 days ago
To fix the problem, explicitly define minimal GITHUB_TOKEN permissions in the workflow so the job does not inherit broad defaults. Since the get-pr-labels job uses gh api to read PR information and nothing else, it only needs read access to repository contents and PRs. The cleanest fix without altering functionality is to add a permissions block scoped specifically to the get-pr-labels job. This avoids affecting other jobs whose required permissions are not shown.
Concretely, in .github/workflows/unit-tests-recipes.yml, under the get-pr-labels: job header (around line 215) and before runs-on: ubuntu-latest, add:
permissions:
contents: read
pull-requests: readNo additional imports or methods are needed since this is a YAML workflow configuration change only. All other job steps and behavior remain unchanged.
-
Copy modified lines R218-R220
| @@ -215,6 +215,9 @@ | ||
| # With copy-pr-bot, we need to get the PR labels from the PR API rather than from the event metadata. | ||
| get-pr-labels: | ||
| runs-on: ubuntu-latest | ||
| permissions: | ||
| contents: read | ||
| pull-requests: read | ||
| outputs: | ||
| labels: ${{ steps.get-labels.outputs.labels || steps.get-labels-empty.outputs.labels }} | ||
| steps: |
Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
|
/ok to test 2fe65dc |
| needs: | ||
| - build-bionemo-image | ||
| - run-tests-single-gpu | ||
| - get-pr-labels | ||
| runs-on: linux-amd64-gpu-rtxa6000-latest-2-nemo | ||
| container: | ||
| image: svcbionemo023/bionemo-framework:${{ github.run_id }} | ||
| credentials: | ||
| username: ${{ vars.DOCKER_USERNAME }} | ||
| password: ${{ secrets.DOCKER_PASSWORD }} | ||
| # Run multi-GPU tests ONLY when: | ||
| # Prerequisites: build succeeds AND single-GPU tests pass | ||
| # Then run if: schedule OR (push with ciflow:all OR ciflow:multi-gpu label) | ||
| # Do NOT run on merge_group or any other events | ||
| if: | | ||
| (needs.build-bionemo-image.result == 'success') && | ||
| (needs.run-tests-single-gpu.result == 'success') && | ||
| ( | ||
| github.event_name == 'schedule' || | ||
| ( | ||
| github.event_name == 'push' && | ||
| ( | ||
| contains(fromJSON(needs.get-pr-labels.outputs.labels || '[]'), 'ciflow:all') || | ||
| contains(fromJSON(needs.get-pr-labels.outputs.labels || '[]'), 'ciflow:multi-gpu') | ||
| ) | ||
| ) | ||
| ) | ||
| steps: | ||
| - name: Checkout repository | ||
| uses: actions/checkout@v4 | ||
|
|
||
| - name: Run multi-GPU tests | ||
| env: | ||
| BIONEMO_DATA_SOURCE: ngc | ||
| run: | | ||
| chmod +x ./ci/scripts/run_pytest_multigpu.sh | ||
| ./ci/scripts/run_pytest_multigpu.sh | ||
|
|
||
| run-tests-slow-single-gpu: |
Check warning
Code scanning / CodeQL
Workflow does not contain permissions Medium
Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
|
/ok to test 3bf090d |
…plify.py Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
|
/ok to test 5f17689 |
Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
|
/ok to test f14497c |
Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
|
/ok to test 91e4dcd |
1 similar comment
|
/ok to test 91e4dcd |
Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
|
/ok to test e8c590a |
|
/ok to test d2bf3ce |
|
/ok to test 7bcb9ec |
|
/ok to test ebf33b8 |
Description
Usage
Simply add @pytest.mark.multi_gpu to any future multi-gpu tests. Use the new ciflow:multi-gpu label to run them in PRs. Otherwise, they will run in the merge queue and on the nightly schedule.
Type of changes
Pre-submit Checklist
Summary by CodeRabbit
New Features
Documentation
CI
Tests
Bug Fixes