Ray is a framework for building and operating distributed applications that need performance, scalability, and fault tolerance. It offers an API for distributed computing and many libraries and tools for machine learning, reinforcement learning, hyperparameter tuning, and more. You can write code once and run it across machines, clusters, or clouds with Ray.
Ray works with Kubernetes to automate provisioning, scaling, and monitoring of Ray applications. It also supports features like secrets, a horizontal pod auto-scaler, and a metrics server.
KubeRay is a Kubernetes operator that simplifies deploying and managing Ray applications on Kubernetes. It provides three Custom Resource Definitions (CRDs) for different use cases:
-
RayCluster: This CRD defines the configuration of a Ray cluster on Kubernetes. It includes the cluster name, size, node types, autoscaling policies, and customization options. You can create, modify, or remove a RayCluster with
kubectl
commands or the Ray Jobs CLI. Use this CRD to manage your Ray clusters and run Ray applications on a single cluster. -
RayJob: This CRD defines the entrypoint, environment, and shutdown behavior of a Ray job on Kubernetes. You can submit a Ray job with the Ray jobs CLI or the Python SDK. KubeRay automatically creates and deletes a temporary Ray cluster for each Ray job. Use this CRD to run single-Ray applications on Kubernetes.
-
RayService: This CRD manages the service name, type, port, and backend configuration of a Ray service. You can create, update, and delete a RayService with
kubectl
or the Ray Serve CLI. KubeRay supports zero-downtime upgrades and high availability for Ray services. Use this CRD to deploy and manage Ray Serve applications on Kubernetes.
This article explains how to deploy a RayCluster, a RayJob, and a RayService on Kubernetes with the KubeRay Operator.
Prerequisites
Before you start:
- Setup a Kubernetes
- Access a Ubuntu server using SSH as a non-root user with sudo privileges to manage the cluster.
- Install and Configure Kubectl.
- Install the Helm package manager with the following command:
$ sudo snap install helm --classic
Deploy the Ray Workloads
Ray Workloads can run on any machine, cluster, or Cloud. Use Kuberay to deploy the Ray workloads on a VKE instance. Use the Ray project definition files from the KubeRay GitHub repository to deploy the three Workloads on your VKE instance.
In this section, install the Kuberay operator.
-
Using Helm, add kuberay to your system.
$ helm repo add KubeRay https://ray-project.github.io/kuberay-helm/
-
Update the Helm repositories.
$ helm repo update
-
Install the KubeRay operator.
$ helm install kuberay-operator kuberay/kuberay-operator --version 1.0.0
It could take up to
2
minutes for the installation to complete. -
Check if the operator is running.
$ kubectl get pods
Output:
NAME READY STATUS RESTARTS AGE kuberay-operator-678c7d7997-v4ppc 1/1 Running 0 78s
Deploy a RayCluster
Now that the KubeRay operator is running, you can now deploy a RayCluster in the default
namespace. Deploy a RayCluster by downloading the Ray project RayCluster configuration file, ray_v1alpha1_raycluster.yaml
, applying it with kubectl
, and connecting to the cluster with ray
as described in the steps below.
-
Install the RayCluster from the Helm chart repository.
$ helm install raycluster kuberay/ray-cluster --version 1.0.0
The installation may take up to 10 minutes to complete.
-
Check to confirm that the ray cluster is running:
$ kubectl get rayclusters
Output:
NAME DESIRED WORKERS AVAILABLE WORKERS STATUS AGE raycluster-kuberay 1 1 ready 10m33s
The KubeRay operator starts the RayCluster and creates head and worker pods.
-
View the RayCluster’s pod in the RayCluster named “raycluster-kuberay” to confirm they are running.
$ kubectl get pods --selector=ray.io/cluster=raycluster-kuberay
NAME READY STATUS RESTARTS AGE raycluster-kuberay-head-6sldq 1/1 Running 0 13m raycluster-kuberay-worker-workergroup-jz2k7 1/1 Running 0 13m
Deploy a RayJob
Create a RayJob using the Ray project’s RayJob resource definition file. This file creates a RayCluster resource with the specified configuration. Parse the output of the RayJob resources with jq
.
-
Download the RayJob resource definition file,
ray_v1alpha1_rayjob.yaml
.$ curl -LO https://raw.githubusercontent.com/ray-project/kuberay/v1.0.0/ray-operator/config/samples/ray_v1alpha1_rayjob.yaml
-
Start a RayJob by applying the downloaded file.
$ kubectl apply -f ray_v1alpha1_rayjob.yaml
-
View the available RayJob resources.
$ kubectl get rayjob
Output:
NAME AGE rayjob-sample 32s
-
Wait for at least
3
minutes for the RayJob to start before you view the available RayCluster resources.$ kubectl get raycluster
Output:
NAME DESIRED WORKERS AVAILABLE WORKERS STATUS AGE rayservice-sample-raycluster-4fmtr 1 1 ready 2m27s
-
View the available pods.
$ kubectl get pods
Output:
kuberay-operator-678c7d7997-l6dhq 1/1 Running 0 91m rayjob-sample-4vtkg 0/1 Completed 0 2m49s rayjob-sample-raycluster-4fmtr-head-q5scw 1/1 Running 0 3m46s rayjob-sample-raycluster-4fmtr-worker-small-group-xqhrt 1/1 Running 0 3m46s
-
Install
jq
.$ apt install jq
You need
jq
to parse the JSON output ofkubectl get rayjob
. -
Check to verify if the job has finished.
$ kubectl get rayjobs.ray.io rayjob-sample -o json | jq '.status.jobStatus'
Output:
"SUCCEEDED"
-
View RayJob output.
$ kubectl logs -l=job-name=rayjob-sample
Output:
2024-01-23 06:50:44,384 INFO cli.py:27 -- Job submission server address: http://rayjob-sample-raycluster-9c546-head-svc.default.svc.cluster.local:8265 2024-01-23 06:50:44,385 SUCC cli.py:33 -- ------------------------------------------------ 2024-01-23 06:50:44,386 SUCC cli.py:34 -- Job 'rayjob-sample-4fmtr' submitted successfully 2024-01-23 06:50:44,387 SUCC cli.py:35 -- ------------------------------------------------ 2024-01-23 06:50:44,388 INFO cli.py:226 -- Next steps 2024-01-23 06:50:44,389 INFO cli.py:227 -- Query the logs of the job: 2024-01-23 06:50:44,390 INFO cli.py:229 -- ray job logs rayjob-sample-4fmtr
Deploy a RayService
Deploy a RayService with the Ray project’s RayService definition file, apply this file to your VKE cluster to create a RayCluster resource. The Ray operator manages the RayCluster resource for the RayService.
-
Download the RayService resource definition file,
ray_v1alpha1_rayservice.yaml
.$ curl -LO https://raw.githubusercontent.com/ray-project/kuberay/v1.0.0/ray-operator/config/samples/ray_v1alpha1_rayservice.yaml
-
Start a RayService by applying the downloaded file.
$ kubectl apply -f ray_v1alpha1_rayservice.yaml
-
View the available RayService resources.
$ kubectl get rayservice
Output:
NAME AGE rayservice-sample 42s
-
Wait for at least
3
minutes for the RayService to start, then view the available RayCluster resources.$ kubectl get raycluster
Output:
NAME DESIRED WORKERS AVAILABLE WORKERS STATUS AGE rayservice-sample-raycluster-zpjwg 1 1 ready 2m27s
-
View the available pods.
$ kubectl get pods -l=ray.io/is-ray-node=yes
Output:
rayservice-sample-raycluster-zpjwg-worker-small-group-vfvjb 1/1 Running 0 3m52s rayservice-sample-raycluster-zpjwg-head-mscgh 1/1 Running 0 3m52s
-
View the available Ray services.
$ kubectl get services
Output:
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT (S) AGE rayservice-sample-head-svc ClusterIP 10.96.60.41 <none> 10001/TCP,8265/TCP,52365/TCP,6379/TCP,8080/TCP,8000/TCP 4m58s rayservice-sample-raycluster-zpjwg-head-svc ClusterIP 10.96.77.237 <none> 10001/TCP,8265/TCP,52365/TCP,6379/TCP,8080/TCP,8000/TCP 5m25s rayservice-sample-serve-svc ClusterIP 10.96.161.84 <none> 8000/TCP 2m48s
Test the RayCluster
Submit a job to the RayCluster to execute it directly on the head pod by following these steps:
-
Get the name of the head node.
$ export HEAD_POD=$(kubectl get pods --selector=ray.io/node-type=head -o custom-columns=POD:metadata.name --no-headers)
-
View the name of the head node.
$ echo $HEAD_POD
Output:
raycluster-kuberay-head-6sldq
-
Submit a simple ray job to print the cluster resources.
$ kubectl exec -it $HEAD_POD -- python -c "import ray; ray.init(); print(ray.cluster_resources())"
Output:
2024-01-23 10:57:46,041 INFO worker.py:1458 -- Connecting to existing Ray cluster at address: 10.244.0.6:6379... 2024-01-23 10:57:46,126 INFO worker.py:1633 -- Connected to Ray cluster. View the dashboard at http://10.244.0.6:8265 {'memory': 3000000000.0, 'object_store_memory': 743061503.0, 'node:10.244.0.7': 1.8, 'CPU': 2.0, 'node:10.244.0.6': 1.0, 'node:__internal_head__': 1.0}
Conclusion
You deployed Ray Cluster, Ray Job, and Ray Service on a Kubernetes cluster using the KubeRay operator. Ray Workloads support features such as distributed data loading, distributed training, fault tolerance, batch inference, multi-model serving, and dynamic scaling. For more information about Ray, see the official documentation.