Skip to content

Commit f634239

Browse files
yansun1996sajmera-pensando
authored andcommitted
[DOC] Add OpenShift specific steps to prepare pre-compiled driver image
Signed-off-by: yansun1996 <Yan.Sun3@amd.com>
1 parent 1308783 commit f634239

File tree

3 files changed

+323
-98
lines changed

3 files changed

+323
-98
lines changed

docs/_static/ocp_airgapped.png

143 KB
Loading

docs/drivers/precompiled-driver.md

Lines changed: 202 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,8 @@ The AMD GPU Operator uses the Kernel Module Management (KMM) Operator to deploy
88
- OS release version
99
- Kernel version
1010

11+
Users could prepare pre-compiled driver images in advance and import them into the cluster to let KMM skip the driver build stage within the cluster and directly use driver images to load amdgpu kernel modules into the worker nodes.
12+
1113
## How KMM Selects Driver Images
1214

1315
KMM determines the appropriate driver image based on the combination of:
@@ -17,25 +19,42 @@ KMM determines the appropriate driver image based on the combination of:
1719

1820
### Image Tag Format
1921

20-
KMM looks for images with tags in these formats:
22+
KMM looks for driver images based on tags, the controller will use these methods to determine the image tag:
23+
24+
1. Parse the node's `osImage` field to determine the OS and version `kubectl get node -oyaml | grep -i osImage`:
25+
26+
| osImage | OS | version |
27+
|---------|-----------|-------------------|
28+
| `Ubuntu 24.04.1 LTS` | `Ubuntu` | `24.04` |
29+
| `Red Hat Enterprise Linux CoreOS 9.6.20250916-0 (Plow)` | `coreos` | `9.6` |
30+
31+
2. Read the node's `kernelVersion` field to determine to kernel version `kubectl get node -oyaml | grep -i kernelVersion`.
32+
3. Read user configured amdgpu driver version from `DeviceConfig` field `spec.driver.version`.
2133

22-
| OS | Tag Format | Example |
23-
|----|------------|---------|
24-
| Ubuntu | `ubuntu-<OS version>-<kernel>-<driver version>` | `ubuntu-22.04-6.8.0-40-generic-6.1.3` |
25-
| RHEL CoreOS | `coreos-<OS version>-<kernel>-<driver version>` | `coreos-416.94-5.14.0-427.28.1.el9_4.x86_64-6.2.2` |
2634

27-
When a DeviceConfig is created, KMM will:
35+
| OS | Tag Format | Example Image Tag |
36+
|----|------------|-------------------|
37+
| `ubuntu` | `ubuntu-<OS version>-<kernel>-<driver version>` | `ubuntu-22.04-6.8.0-40-generic-6.1.3` |
38+
| `coreos` | `coreos-<OS version>-<kernel>-<driver version>` | `coreos-9.6-5.14.0-427.28.1.el9_4.x86_64-6.2.2` |
39+
40+
When a DeviceConfig is created with driver management enabled (`spec.driver.enable=true`), KMM will:
2841

2942
1. Check if a matching driver image exists in the registry
3043
2. If not found, build the driver image in-cluster using the AMD GPU Operator's Dockerfile
3144
3. If found, directly use the existing image to install the driver
3245

3346
## Building Pre-compiled Driver Images
3447

35-
### Dockerfile Example
48+
### Ubuntu
49+
50+
Follow these image build steps to get a pre-compiled driver images, make sure your system matched with [ROCm required Linux system requirement](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/system-requirements.html).
51+
52+
1. Prepare the Dockerfile
3653

3754
```dockerfile
38-
FROM ubuntu:$$VERSION as builder
55+
ARG OS_VERSION
56+
FROM ubuntu:${OS_VERSION} as builder
57+
ARG OS_CODENAME
3958
ARG KERNEL_FULL_VERSION
4059
ARG DRIVERS_VERSION
4160
ARG REPO_URL
@@ -57,15 +76,16 @@ RUN apt-get update && apt-get install -y bc \
5776
RUN mkdir --parents --mode=0755 /etc/apt/keyrings
5877
RUN wget ${REPO_URL}/rocm/rocm.gpg.key -O - | \
5978
gpg --dearmor | tee /etc/apt/keyrings/rocm.gpg > /dev/null
60-
RUN echo "deb [arch=amd64 signed-by=/etc/apt/keyrings/rocm.gpg] ${REPO_URL}/amdgpu/${DRIVERS_VERSION}/ubuntu $$DRIVER_LABEL main" \
79+
RUN echo "deb [arch=amd64 signed-by=/etc/apt/keyrings/rocm.gpg] ${REPO_URL}/amdgpu/${DRIVERS_VERSION}/ubuntu ${OS_CODENAME} main" \
6180
| tee /etc/apt/sources.list.d/amdgpu.list
6281

6382
# Install and configure driver
6483
RUN apt-get update && apt-get install -y amdgpu-dkms
6584
RUN depmod ${KERNEL_FULL_VERSION}
6685

6786
# Create final image
68-
FROM ubuntu:$$VERSION
87+
ARG OS_VERSION
88+
FROM ubuntu:${OS_VERSION}
6989
ARG KERNEL_FULL_VERSION
7090

7191
RUN apt-get update && apt-get install -y kmod
@@ -74,51 +94,206 @@ RUN apt-get update && apt-get install -y kmod
7494
RUN mkdir -p /opt/lib/modules/${KERNEL_FULL_VERSION}/updates/dkms/
7595
COPY --from=builder /lib/modules/${KERNEL_FULL_VERSION}/updates/dkms/amd* /opt/lib/modules/${KERNEL_FULL_VERSION}/updates/dkms/
7696
COPY --from=builder /lib/modules/${KERNEL_FULL_VERSION}/modules.* /opt/lib/modules/${KERNEL_FULL_VERSION}/
77-
RUN ln -s /lib/modules/${KERNEL_FULL_VERSION}/kernel /opt/lib/modules/${KERNEL_FULL_VERSION}/kernel
97+
COPY --from=builder /lib/modules/${KERNEL_FULL_VERSION}/kernel /opt/lib/modules/${KERNEL_FULL_VERSION}/kernel
7898

7999
# Set up firmware directory
80100
RUN mkdir -p /firmwareDir/updates/amdgpu
81101
COPY --from=builder /lib/firmware/updates/amdgpu /firmwareDir/updates/amdgpu
82102
```
83103

84-
### Build Steps
104+
Build Steps Explanation:
85105

86106
- Choose a base image matching your worker nodes' OS (example: `ubuntu:22.04`)
87107
- Install `amdgpu-dkms` package using the OS package manager
88108
- Update Module Dependencies: run `depmod ${KERNEL_FULL_VERSION}`
89109
- Configure the final image
90110
- Install `kmod` (required for modprobe operations)
91-
- Copy required files to these locations:
111+
- Copy required files to these locations, required by KMM:
92112
- Kernel modules: `/opt/lib/modules/${KERNEL_FULL_VERSION}/`
93113
- Firmware files: `/firmwareDir/updates/amdgpu/`
94114

95-
#### Build the final image
115+
2. Trigger the build with the Dockerfile
116+
117+
Make sure the build node has the same OS and kernel with your production nodes.
118+
119+
See [examples](#image-tag-format) to tag the image with the correct tag name.
96120

97121
```bash
122+
source /etc/os-release
123+
export AMDGPU_VERSION=7.0
98124
docker build \
125+
--build-arg OS_VERSION=${VERSION_ID} \
126+
--build-arg OS_CODENAME=${VERSION_CODENAME} \
99127
--build-arg KERNEL_FULL_VERSION=$(uname -r) \
100-
--build-arg DRIVERS_VERSION=6.1.3 \
101-
--build-arg REPO_URL=https://repo.example.com \
102-
-t amdgpu-driver .
128+
--build-arg DRIVERS_VERSION=${AMDGPU_VERSION} \
129+
--build-arg REPO_URL=https://repo.radeon.com \
130+
-t registry.example.com/amdgpu-driver:ubuntu-${VERSION_ID}-$(uname -r)-${AMDGPU_VERSION} .
103131
```
104132

105-
#### Tag the image
106-
107-
See [examples](#image-tag-format) to tag the image with the correct tag name:
133+
3. Push to the image to a registry
108134

109135
```bash
110-
docker tag amdgpu-driver registry.example.com/amdgpu-driver:ubuntu-22.04-6.8.0-40-generic-6.1.3
136+
docker push registry.example.com/amdgpu-driver:ubuntu-${VERSION_ID}-$(uname -r)-${AMDGPU_VERSION}
111137
```
112138

113-
#### Push to the image to a registry
139+
### OpenShift - Red Hat Enterprise Linux CoreOS
114140

115-
```bash
116-
docker push registry.example.com/amdgpu-driver:ubuntu-22.04-6.8.0-40-generic-6.1.3
141+
Follow these image build steps to get a pre-compiled driver images for OpenShift cluster, make sure your RHEL version and driver version matched with [ROCm required Linux system requirement](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/system-requirements.html).
142+
143+
1. Collect System Information
144+
145+
Please collect system information from OpenShift build node before configuring the build process:
146+
147+
* kernel version: `uname -r`
148+
* kernel compatible OpenShift DriverToolkit image: `oc adm release info --image-for driver-toolkit`
149+
150+
2. Prepare image registry:
151+
152+
Please decide where you want to push your pre-compiled driver image:
153+
154+
* Case 1: Use OpenShift internal registry:
155+
* Enable internal registry (skip this step if you already enabled registry):
156+
```bash
157+
oc patch configs.imageregistry.operator.openshift.io cluster --type merge \
158+
--patch '{"spec":{"storage":{"emptyDir":{}}}}'
159+
oc patch configs.imageregistry.operator.openshift.io cluster --type merge \
160+
--patch '{"spec":{"managementState":"Managed"}}'
161+
# make sure the image registry pods are running
162+
oc get pods -n openshift-image-registry
163+
```
164+
* Create ImageStream
165+
```bash
166+
oc create imagestream amdgpu_kmod
167+
```
168+
* Case 2: Use external image registry:
169+
* Create secret to push image if required:
170+
```bash
171+
kubectl create secret docker-registry docker-auth \
172+
--docker-server=registry.example.com \
173+
--docker-username=xxx \
174+
--docker-password=xxx
175+
```
176+
3. Create OpenShift `BuildConfig`
177+
178+
Please create the following YAML file, the full example is assuming you are using OpenShift internal image registry and build config will be saved in default namespace.
179+
180+
* If you want to configure the build in other namespace, please change the namespace accordingly in the example steps.
181+
* If you want to use other image registry, please replace the `spec.output` part with this:
182+
183+
```yaml
184+
spec:
185+
output:
186+
pushSecret:
187+
name: docker-auth
188+
to:
189+
kind: DockerImage
190+
# follow the Image Tag Format section to get your image ta
191+
name: registry.example.com/amdgpu_kmod:coreos-9.6-5.14.0-570.45.1.el9_6.x86_64-7.0
192+
```
193+
194+
Full example:
195+
196+
```yaml
197+
kind: BuildConfig
198+
apiVersion: build.openshift.io/v1
199+
metadata:
200+
name: amd-gpu-operator-build
201+
namespace: default
202+
labels:
203+
app.kubernetes.io/component: build
204+
spec:
205+
runPolicy: Serial
206+
nodeSelector: null
207+
output:
208+
to:
209+
kind: ImageStreamTag
210+
# follow the Image Tag Format section to get your image tag
211+
name: amdgpu_kmod:coreos-9.6-5.14.0-570.45.1.el9_6.x86_64-7.0
212+
successfulBuildsHistoryLimit: 5
213+
failedBuildsHistoryLimit: 5
214+
strategy:
215+
type: Docker
216+
dockerStrategy:
217+
buildArgs:
218+
- name: DRIVERS_VERSION # amdgpu version
219+
value: '7.0'
220+
- name: REPO_URL
221+
value: 'https://repo.radeon.com'
222+
- name: KERNEL_VERSION
223+
value: 5.14.0-570.45.1.el9_6.x86_64
224+
- name: KERNEL_FULL_VERSION
225+
value: 5.14.0-570.45.1.el9_6.x86_64
226+
- name: DTK_AUTO
227+
# DriverToolkit image, get it from `oc adm release info --image-for driver-toolkit`
228+
value: 'quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b3af1db51aa8a453fbba972e0039a496f0848eb15e6b411ef0bbb7d5ed864ac7'
229+
serviceAccount: builder
230+
source:
231+
type: Dockerfile
232+
dockerfile: |-
233+
ARG DTK_AUTO
234+
FROM ${DTK_AUTO} as builder
235+
ARG KERNEL_VERSION
236+
ARG DRIVERS_VERSION
237+
ARG REPO_URL
238+
RUN dnf install https://dl.fedoraproject.org/pub/epel/epel-release-latest-9.noarch.rpm -y && \
239+
crb enable && \
240+
sed -i "s/\$releasever/9/g" /etc/yum.repos.d/epel*.repo && \
241+
dnf install dnf-plugin-config-manager -y && \
242+
dnf clean all
243+
RUN dnf install -y 'dnf-command(config-manager)' && \
244+
dnf config-manager --add-repo=https://mirror.stream.centos.org/9-stream/BaseOS/x86_64/os/ && \
245+
dnf config-manager --add-repo=https://mirror.stream.centos.org/9-stream/AppStream/x86_64/os/ && \
246+
rpm --import https://www.centos.org/keys/RPM-GPG-KEY-CentOS-Official && \
247+
dnf clean all
248+
RUN source /etc/os-release && \
249+
echo -e "[amdgpu] \n\
250+
name=amdgpu \n\
251+
baseurl=${REPO_URL}/amdgpu/${DRIVERS_VERSION}/el/${VERSION_ID}/main/x86_64/ \n\
252+
enabled=1 \n\
253+
priority=50 \n\
254+
gpgcheck=1 \n\
255+
gpgkey=${REPO_URL}/rocm/rocm.gpg.key" > /etc/yum.repos.d/amdgpu.repo
256+
RUN dnf clean all && \
257+
cat /etc/yum.repos.d/amdgpu.repo && \
258+
dnf install amdgpu-dkms -y && \
259+
depmod ${KERNEL_VERSION} && \
260+
find /lib/modules/${KERNEL_VERSION} -name "*.ko.xz" -exec xz -d {} \; && \
261+
depmod ${KERNEL_VERSION}
262+
RUN mkdir -p /modules_files && \
263+
mkdir -p /amdgpu_ko_files && \
264+
mkdir -p /kernel_files && \
265+
cp /lib/modules/${KERNEL_VERSION}/modules.* /modules_files/ && \
266+
cp -r /lib/modules/${KERNEL_VERSION}/extra/* /amdgpu_ko_files/ && \
267+
cp -r /lib/modules/${KERNEL_VERSION}/kernel/* /kernel_files/
268+
FROM registry.redhat.io/ubi9/ubi-minimal
269+
ARG KERNEL_VERSION
270+
RUN microdnf install -y kmod
271+
COPY --from=builder /amdgpu_ko_files /opt/lib/modules/${KERNEL_VERSION}/extra
272+
COPY --from=builder /kernel_files /opt/lib/modules/${KERNEL_VERSION}/kernel
273+
COPY --from=builder /modules_files /opt/lib/modules/${KERNEL_VERSION}/
274+
COPY --from=builder /lib/firmware/updates/amdgpu /firmwareDir/updates/amdgpu
117275
```
118276

277+
4. Trigger driver image build
278+
279+
* Option 1 - Web Console:
280+
* Login to OpenShift web console with username and password
281+
* Select `Builds` then select `BuildConfigs` in the navigation bar
282+
* Click `Create BuildConfig` then select YAML view, copy over the YAML file created in last step
283+
* Select the `BuildConfig` in the list, click `Actions` then select `Start Build`
284+
* Select `Builds` in the current `BuildConfig` page, a new build should be triggered and in running status.
285+
* Wait for it to be completed, you can also monitor the progress in `Logs` section, in the end it should show push is successful.
286+
* Delete the `BuildConfig` if needed.
287+
* Option 2 - Command Line Interface (CLI):
288+
* Create the `BuildConfig` by using the YAML file created in the last step: `oc apply -f build-config.yaml`
289+
* Start the build: `oc start-build amd-gpu-operator-build`
290+
* Check the build status: `oc get build` and `oc get pods | grep build`
291+
* Wait for it to complete, the logs should show that push is successful
292+
* Delete the `BuildConfig` if needed: `oc delete -f build-config.yaml`
293+
119294
## Using Pre-compiled Images
120295
121-
Configure your DeviceConfig to use the pre-compiled images:
296+
In previous section [Building Pre-compiled Driver Images](#building-pre-compiled-driver-images) we pushed driver image to `registry.example.com/amdgpu-driver`. Now you can configure your `DeviceConfig` to use the pre-compiled images:
122297
123298
```yaml
124299
apiVersion: amd.com/v1alpha1
@@ -129,21 +304,14 @@ metadata:
129304
spec:
130305
driver:
131306
# Registry path without tag - operator manages tags
132-
image: registry.example.com/amdgpu-driver
307+
# If you use OpenShift internal image registry, by default the operator will auto select the internal image registry URL
308+
image: registry.example.com/amdgpu_kmod
133309
134310
# Registry credentials if required
135311
imageRegistrySecret:
136312
name: docker-auth
137-
138313
# Driver version
139314
version: "7.0"
140-
141-
devicePlugin:
142-
devicePluginImage: rocm/k8s-device-plugin:latest
143-
nodeLabellerImage: rocm/k8s-device-plugin:labeller-latest
144-
145-
selector:
146-
feature.node.kubernetes.io/amd-gpu: "true"
147315
```
148316
149317
> **Important**: Do not include the image tag in the `image` field - the operator automatically appends the appropriate tag based on the node's OS and kernel version.

0 commit comments

Comments
 (0)