Skip to content

chore: added docs for cncf gpu arc#3218

Open
jaiakash wants to merge 2 commits intokubeflow:masterfrom
jaiakash:gpu-arc-doc
Open

chore: added docs for cncf gpu arc#3218
jaiakash wants to merge 2 commits intokubeflow:masterfrom
jaiakash:gpu-arc-doc

Conversation

@jaiakash
Copy link
Member

What this PR does / why we need it:
This PR documents the GPU runner implementation for KEP-2432 (GPU Testing for LLM Blueprints). It covers:

  • Migration from OCI VM runners to CNCF's ARC
  • NVIDIA driver compatibility issues and workarounds (nvkind CDI blocker)
  • Current limitations and temporary fixes

The documentation serves as a reference for maintaining the GPU CI infrastructure and tracking future improvements.

Which issue(s) this PR fixes:

Fixes #2849

Related:

Checklist:

  • Docs included if any changes are user facing

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>
Copilot AI review requested due to automatic review settings February 17, 2026 15:26
@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign jeffwan for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coveralls
Copy link

coveralls commented Feb 17, 2026

Pull Request Test Coverage Report for Build 22105459439

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 55.998%

Totals Coverage Status
Change from base Build 22081023611: 0.0%
Covered Lines: 1391
Relevant Lines: 2484

💛 - Coveralls

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds comprehensive documentation for the GPU runner implementation using CNCF's GitHub Actions Runner Controller (ARC). The documentation serves as a reference for maintaining the GPU CI infrastructure that was introduced in PR #3067, covering the migration from OCI VM-based runners to CNCF ARC, NVIDIA driver compatibility challenges, and current limitations with temporary workarounds.

Changes:

  • Added arc-gpu-runner-doc.md documenting GPU testing infrastructure implementation
  • Covers hardware specifications, workflow configuration changes, and NVIDIA driver compatibility issues
  • Documents nvkind limitations and temporary workarounds including runtime class patching

Comment on lines 65 to 76
Copy link

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation states that "The setup script pins to a known working version" and provides installation commands for NVIDIA Container Toolkit version 1.17.8-1. However, this version pinning does not exist in the current GPU setup script (hack/e2e-setup-gpu-cluster.sh).

The version pinning code shown here only exists in docs/proposals/2432-gpu-testing-on-llm-blueprints/OCI VM/bootstrap.sh, which is from the old VM-based approach. The current ARC-based approach relies on the CNCF runners having NVIDIA Container Toolkit pre-installed, and hack/e2e-setup-gpu-cluster.sh only configures it (lines 62-64) without installing any specific version.

This section should either be removed or clarified to indicate that this was the approach for the old VM setup and that CNCF ARC runners come with the toolkit pre-installed.

Copilot uses AI. Check for mistakes.
Comment on lines +87 to +93
**Configuration Changes**:

```bash
# Removed: CDI is disabled to avoid compatibility issues
# sudo nvidia-ctk runtime configure --runtime=docker --set-as-default --cdi.enabled

# Current: Use system drivers without CDI
Copy link

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment "Removed: CDI is disabled to avoid compatibility issues" and the commented-out command suggest that CDI was previously enabled and then disabled. However, this configuration change is not part of the current ARC setup in hack/e2e-setup-gpu-cluster.sh. The actual script only runs lines 62-64 of this code block (without the commented-out CDI line).

This section should be clarified to indicate whether this is historical context from the VM setup or if it's describing the current CNCF ARC runner configuration. If it's the latter, the documentation should explain that CNCF manages this configuration on their runners.

Suggested change
**Configuration Changes**:
```bash
# Removed: CDI is disabled to avoid compatibility issues
# sudo nvidia-ctk runtime configure --runtime=docker --set-as-default --cdi.enabled
# Current: Use system drivers without CDI
**Configuration Changes (CNCF ARC runners)**:
The following commands are executed by `hack/e2e-setup-gpu-cluster.sh` on the ephemeral ARC VM that backs each GitHub Actions job. The commented line is kept as historical context from the earlier long-lived VM-based runner and is **not** part of the current ARC configuration; CDI/runtime settings on the underlying runner image are managed by CNCF.
```bash
# Historical (VM-based runner): CDI enablement was disabled due to compatibility issues
# sudo nvidia-ctk runtime configure --runtime=docker --set-as-default --cdi.enabled
# Current on CNCF ARC runners: Use system drivers without CDI

Copilot uses AI. Check for mistakes.
@jaiakash jaiakash force-pushed the gpu-arc-doc branch 2 times, most recently from 64b75b1 to 267a656 Compare February 17, 2026 15:44
Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[GH ARC]: Docs for setup of ARC, yaml manifests, config and gpu based cloudrunner code.

2 participants

Comments