Skip to content

Comments

feat(trainer): implement get_gpu_status for Kubernetesbackend#251

Open
haroon0x wants to merge 1 commit intokubeflow:mainfrom
haroon0x:feat/gpu-observability
Open

feat(trainer): implement get_gpu_status for Kubernetesbackend#251
haroon0x wants to merge 1 commit intokubeflow:mainfrom
haroon0x:feat/gpu-observability

Conversation

@haroon0x
Copy link

What this PR does / why we need it:
This PR implements the get_gpu_status method in the KubernetesBackend, allowing users to retrieve real-time GPU metrics from training pods. This feature enhances observability for AI workloads by providing structured data on GPU utilization, memory usage, temperature, power draw, and performance states.

Key technical changes:

  • Base Layer: Added get_gpu_status as an abstract method in RuntimeBackend.
  • API Layer: Exposed get_gpu_status(job_name) in TrainerClient.
  • Kubernetes Implementation: Uses the Kubernetes exec API to run nvidia-smi inside training pods (node-*). It uses a structured CSV query format (--query-gpu=...) for reliable parsing.
  • Compatibility: Added placeholder implementations for LocalProcessBackend and ContainerBackend to ensure all backends satisfy the revised interface.
  • Verification: Added test_get_gpu_status to kubeflow/trainer/backends/kubernetes/backend_test.py which mocks the Kubernetes stream response.

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):

Fixes #165

Checklist:

  • Docs included if any changes are user facing
  • Unit tests added/updated.
  • All 50 existing tests passed (Kubernetes, Local, Container backends).

Signed-off-by: haroon0x <haroonbmc0@gmail.com>
@google-oss-prow
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign andreyvelich for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@github-actions
Copy link
Contributor

🎉 Welcome to the Kubeflow SDK! 🎉

Thanks for opening your first PR! We're happy to have you as part of our community 🚀

Here's what happens next:

  • If you haven't already, please check out our Contributing Guide for repo-specific guidelines and the Kubeflow Contributor Guide for general community standards
  • Our team will review your PR soon! cc @kubeflow/kubeflow-sdk-team

Join the community:

Feel free to ask questions in the comments if you need any help or clarification!
Thanks again for contributing to Kubeflow! 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Enhancing GPU Visibility for AI Workloads created with Kubeflow SDK

1 participant