Skip to content

Comments

[Bug] Autoscaler doesn't support TLS#1119

Merged
kevin85421 merged 3 commits intoray-project:masterfrom
chrisxstyles:master
May 30, 2023
Merged

[Bug] Autoscaler doesn't support TLS#1119
kevin85421 merged 3 commits intoray-project:masterfrom
chrisxstyles:master

Conversation

@chrisxstyles
Copy link
Contributor

@chrisxstyles chrisxstyles commented May 27, 2023

Why are these changes needed?

This PR adds the ability to mount volumes in AutoscalerOptions to enable TLS in Autoscaler. At the moment, if you enable TLS and the autoscaler, the autoscaler is unable to connect to the Ray head because the certificates doesn't exist and there was no way to mount volumes where the certs reside like what you can do in the other containers for example

This PR adds the ability to mount volumes in the AutoscalerOptions so we can mount the TLS volumes.

Related issue number

No github issue but here's the Slack chat for more context - https://ray-distributed.slack.com/archives/C02GFQ82JPM/p1684511397156609

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@kevin85421 kevin85421 self-requested a review May 27, 2023 04:58
@kevin85421
Copy link
Member

Thank you for your contribution! Would you be able to provide a sample YAML file in this directory? Additionally, I have added you to the Slack channel for KubeRay contributors, #kuberay-discuss. We have a bi-weekly community sync meeting, and you are welcome to join.

@chrisxstyles
Copy link
Contributor Author

@kevin85421 - Thank you! I've added the sample YAML file now. Also, looks like 1 check is failing due to the same issue on this PR #1096

Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I also tested this PR manually.

# Step 0: Build operator image, create Kind cluster, and load into the Kind cluster
# (path: ray-operator)
make docker-image
kind create cluster --image=kindest/node:v1.23.0
kind load docker-image controller:latest

# Step 1: Install the operator and CRD (path: helm-chart/kuberay-operator)
helm install kuberay-operator . --set image.repository=controller,image.tag=latest

# Step 2: Create an autoscaling RayCluster with TLS support.
# (path: ray-operator/config/samples)
kubectl apply -f ray-cluster.autoscaler.tls.yaml

# Step 3: Test autoscaler
export HEAD_POD=$(kubectl get pods --selector=ray.io/node-type=head -o custom-columns=POD:metadata.name --no-headers)

# 2 worker Pods will be created if the cluster has enough CPUs.
kubectl exec $HEAD_POD -it -c ray-head -- python -c "import ray;ray.init();ray.autoscaler.sdk.request_resources(num_cpus=4)"

expectedContainer.Resources = customResources
expectedContainer.EnvFrom = customEnvFrom
expectedContainer.Env = append(expectedContainer.Env, customEnv...)
expectedContainer.VolumeMounts = append(customVolumeMounts, expectedContainer.VolumeMounts...)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment to explain why VolumeMount uses append(customVolumeMounts, expectedContainer.VolumeMounts...) instead of append(expectedContainer.VolumeMounts, customVolumeMounts...)? It may seem confusing because Env uses append(expectedContainer.Env, customEnv...).

I traced the following functions to understand the order of the VolumeMounts slice.

  • The function mergeAutoscalerOverrides: Append AutoscalerOptions.VolumeMounts to the autoscaler container.

    func mergeAutoscalerOverrides(autoscalerContainer *v1.Container, autoscalerOptions *rayiov1alpha1.AutoscalerOptions) {

  • Use addEmptyDir to append the ray-log to the autoscaler container's VolumeMounts.

    addEmptyDir(&pod.Spec.Containers[autoscalerContainerIndex], &pod, RayLogVolumeName, RayLogVolumeMountPath, v1.StorageMediumDefault)

# autoscalerOptions is an OPTIONAL field specifying configuration overrides for the Ray autoscaler.
# The example configuration shown below below represents the DEFAULT values.
# (You may delete autoscalerOptions if the defaults are suitable.)
autoscalerOptions:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is autoscalerOptions the only difference compared to ray-cluster.tls.yaml? If so, we can add the autoscalerOptions section to ray-cluster.tls.yaml. I apologize for missing that. I didn't realize the YAML file would be that long.

@kevin85421
Copy link
Member

I will draft a follow-up PR to address my comments (#1119 (comment), #1119 (comment)). Merge this PR.

@kevin85421 kevin85421 merged commit 2586468 into ray-project:master May 30, 2023
@chrisxstyles chrisxstyles mentioned this pull request Jun 1, 2023
4 tasks
lowang-bh pushed a commit to lowang-bh/kuberay that referenced this pull request Sep 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants