chore: added docs for cncf gpu arc#3218
Conversation
Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Pull Request Test Coverage Report for Build 22105459439Details
💛 - Coveralls |
There was a problem hiding this comment.
Pull request overview
This PR adds comprehensive documentation for the GPU runner implementation using CNCF's GitHub Actions Runner Controller (ARC). The documentation serves as a reference for maintaining the GPU CI infrastructure that was introduced in PR #3067, covering the migration from OCI VM-based runners to CNCF ARC, NVIDIA driver compatibility challenges, and current limitations with temporary workarounds.
Changes:
- Added
arc-gpu-runner-doc.mddocumenting GPU testing infrastructure implementation - Covers hardware specifications, workflow configuration changes, and NVIDIA driver compatibility issues
- Documents nvkind limitations and temporary workarounds including runtime class patching
There was a problem hiding this comment.
The documentation states that "The setup script pins to a known working version" and provides installation commands for NVIDIA Container Toolkit version 1.17.8-1. However, this version pinning does not exist in the current GPU setup script (hack/e2e-setup-gpu-cluster.sh).
The version pinning code shown here only exists in docs/proposals/2432-gpu-testing-on-llm-blueprints/OCI VM/bootstrap.sh, which is from the old VM-based approach. The current ARC-based approach relies on the CNCF runners having NVIDIA Container Toolkit pre-installed, and hack/e2e-setup-gpu-cluster.sh only configures it (lines 62-64) without installing any specific version.
This section should either be removed or clarified to indicate that this was the approach for the old VM setup and that CNCF ARC runners come with the toolkit pre-installed.
| **Configuration Changes**: | ||
|
|
||
| ```bash | ||
| # Removed: CDI is disabled to avoid compatibility issues | ||
| # sudo nvidia-ctk runtime configure --runtime=docker --set-as-default --cdi.enabled | ||
|
|
||
| # Current: Use system drivers without CDI |
There was a problem hiding this comment.
The comment "Removed: CDI is disabled to avoid compatibility issues" and the commented-out command suggest that CDI was previously enabled and then disabled. However, this configuration change is not part of the current ARC setup in hack/e2e-setup-gpu-cluster.sh. The actual script only runs lines 62-64 of this code block (without the commented-out CDI line).
This section should be clarified to indicate whether this is historical context from the VM setup or if it's describing the current CNCF ARC runner configuration. If it's the latter, the documentation should explain that CNCF manages this configuration on their runners.
| **Configuration Changes**: | |
| ```bash | |
| # Removed: CDI is disabled to avoid compatibility issues | |
| # sudo nvidia-ctk runtime configure --runtime=docker --set-as-default --cdi.enabled | |
| # Current: Use system drivers without CDI | |
| **Configuration Changes (CNCF ARC runners)**: | |
| The following commands are executed by `hack/e2e-setup-gpu-cluster.sh` on the ephemeral ARC VM that backs each GitHub Actions job. The commented line is kept as historical context from the earlier long-lived VM-based runner and is **not** part of the current ARC configuration; CDI/runtime settings on the underlying runner image are managed by CNCF. | |
| ```bash | |
| # Historical (VM-based runner): CDI enablement was disabled due to compatibility issues | |
| # sudo nvidia-ctk runtime configure --runtime=docker --set-as-default --cdi.enabled | |
| # Current on CNCF ARC runners: Use system drivers without CDI |
64b75b1 to
267a656
Compare
Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>
267a656 to
6704c22
Compare
What this PR does / why we need it:
This PR documents the GPU runner implementation for KEP-2432 (GPU Testing for LLM Blueprints). It covers:
The documentation serves as a reference for maintaining the GPU CI infrastructure and tracking future improvements.
Which issue(s) this PR fixes:
Fixes #2849
Related:
Checklist: