Table of Contents

Ray is a framework for building and operating distributed applications that need performance, scalability, and fault tolerance. It offers an API for distributed computing and many libraries and tools for machine learning, reinforcement learning, hyperparameter tuning, and more. You can write code once and run it across machines, clusters, or clouds with Ray.

Ray works with Kubernetes to automate provisioning, scaling, and monitoring of Ray applications. It also supports features like secrets, a horizontal pod auto-scaler, and a metrics server.

KubeRay is a Kubernetes operator that simplifies deploying and managing Ray applications on Kubernetes. It provides three Custom Resource Definitions (CRDs) for different use cases:

RayCluster: This CRD defines the configuration of a Ray cluster on Kubernetes. It includes the cluster name, size, node types, autoscaling policies, and customization options. You can create, modify, or remove a RayCluster with kubectl commands or the Ray Jobs CLI. Use this CRD to manage your Ray clusters and run Ray applications on a single cluster.
RayJob: This CRD defines the entrypoint, environment, and shutdown behavior of a Ray job on Kubernetes. You can submit a Ray job with the Ray jobs CLI or the Python SDK. KubeRay automatically creates and deletes a temporary Ray cluster for each Ray job. Use this CRD to run single-Ray applications on Kubernetes.
RayService: This CRD manages the service name, type, port, and backend configuration of a Ray service. You can create, update, and delete a RayService with kubectl or the Ray Serve CLI. KubeRay supports zero-downtime upgrades and high availability for Ray services. Use this CRD to deploy and manage Ray Serve applications on Kubernetes.

This article explains how to deploy a RayCluster, a RayJob, and a RayService on Kubernetes with the KubeRay Operator.

Prerequisites

Before you start:

Setup a Kubernetes
Access a Ubuntu server using SSH as a non-root user with sudo privileges to manage the cluster.
Install and Configure Kubectl.
Install the Helm package manager with the following command:

$ sudo snap install helm --classic

Deploy the Ray Workloads

Ray Workloads can run on any machine, cluster, or Cloud. Use Kuberay to deploy the Ray workloads on a VKE instance. Use the Ray project definition files from the KubeRay GitHub repository to deploy the three Workloads on your VKE instance.

In this section, install the Kuberay operator.

Using Helm, add kuberay to your system.

$ helm repo add KubeRay https://ray-project.github.io/kuberay-helm/

Update the Helm repositories.
```
$ helm repo update
```
Install the KubeRay operator.
```
$ helm install kuberay-operator kuberay/kuberay-operator --version 1.0.0
```
It could take up to 2 minutes for the installation to complete.

Check if the operator is running.

$ kubectl get pods

Output:

NAME                                READY   STATUS    RESTARTS   AGE
kuberay-operator-678c7d7997-v4ppc   1/1     Running   0          78s

Deploy a RayCluster

Now that the KubeRay operator is running, you can now deploy a RayCluster in the default namespace. Deploy a RayCluster by downloading the Ray project RayCluster configuration file, ray_v1alpha1_raycluster.yaml, applying it with kubectl, and connecting to the cluster with ray as described in the steps below.

Install the RayCluster from the Helm chart repository.
```
$ helm install raycluster kuberay/ray-cluster --version 1.0.0
```
The installation may take up to 10 minutes to complete.

Check to confirm that the ray cluster is running:

$ kubectl get rayclusters

Output:

NAME                 DESIRED WORKERS   AVAILABLE WORKERS   STATUS   AGE
raycluster-kuberay   1                 1                   ready    10m33s

The KubeRay operator starts the RayCluster and creates head and worker pods.

View the RayCluster’s pod in the RayCluster named “raycluster-kuberay” to confirm they are running.

$ kubectl get pods --selector=ray.io/cluster=raycluster-kuberay

NAME                                          READY   STATUS    RESTARTS   AGE
raycluster-kuberay-head-6sldq                 1/1     Running   0          13m
raycluster-kuberay-worker-workergroup-jz2k7   1/1     Running   0          13m

Deploy a RayJob

Create a RayJob using the Ray project’s RayJob resource definition file. This file creates a RayCluster resource with the specified configuration. Parse the output of the RayJob resources with jq.

Download the RayJob resource definition file, ray_v1alpha1_rayjob.yaml.

$ curl -LO https://raw.githubusercontent.com/ray-project/kuberay/v1.0.0/ray-operator/config/samples/ray_v1alpha1_rayjob.yaml

Start a RayJob by applying the downloaded file.
```
$ kubectl apply -f ray_v1alpha1_rayjob.yaml
```

View the available RayJob resources.

$ kubectl get rayjob

Output:

NAME            AGE
rayjob-sample   32s

Wait for at least 3 minutes for the RayJob to start before you view the available RayCluster resources.

$ kubectl get raycluster

Output:

NAME                                 DESIRED WORKERS   AVAILABLE WORKERS   STATUS   AGE
rayservice-sample-raycluster-4fmtr   1                 1                   ready    2m27s

View the available pods.

$ kubectl get pods

Output:

kuberay-operator-678c7d7997-l6dhq                          		 1/1     Running     0          91m
rayjob-sample-4vtkg                                      			 0/1     Completed   0          2m49s
rayjob-sample-raycluster-4fmtr-head-q5scw                  	 1/1     Running     0          3m46s
rayjob-sample-raycluster-4fmtr-worker-small-group-xqhrt         1/1     Running     0          3m46s

Install jq.
```
$ apt install jq 
```
You need jq to parse the JSON output of kubectl get rayjob.

Check to verify if the job has finished.

$ kubectl get rayjobs.ray.io rayjob-sample -o json | jq '.status.jobStatus'

Output:

"SUCCEEDED"

View RayJob output.

$ kubectl logs -l=job-name=rayjob-sample

Output:

2024-01-23 06:50:44,384 INFO cli.py:27 -- Job submission server address: http://rayjob-sample-raycluster-9c546-head-svc.default.svc.cluster.local:8265
2024-01-23 06:50:44,385 SUCC cli.py:33 -- ------------------------------------------------
2024-01-23 06:50:44,386 SUCC cli.py:34 -- Job 'rayjob-sample-4fmtr' submitted successfully
2024-01-23 06:50:44,387 SUCC cli.py:35 -- ------------------------------------------------
2024-01-23 06:50:44,388 INFO cli.py:226 -- Next steps
2024-01-23 06:50:44,389 INFO cli.py:227 -- Query the logs of the job:
2024-01-23 06:50:44,390 INFO cli.py:229 -- ray job logs rayjob-sample-4fmtr

Deploy a RayService

Deploy a RayService with the Ray project’s RayService definition file, apply this file to your VKE cluster to create a RayCluster resource. The Ray operator manages the RayCluster resource for the RayService.

Download the RayService resource definition file, ray_v1alpha1_rayservice.yaml.

$ curl -LO https://raw.githubusercontent.com/ray-project/kuberay/v1.0.0/ray-operator/config/samples/ray_v1alpha1_rayservice.yaml

Start a RayService by applying the downloaded file.
```
$ kubectl apply -f ray_v1alpha1_rayservice.yaml
```

View the available RayService resources.

$ kubectl get rayservice

Output:

NAME                AGE
rayservice-sample   42s

Wait for at least 3 minutes for the RayService to start, then view the available RayCluster resources.

$ kubectl get raycluster

Output:

NAME                                 DESIRED WORKERS   AVAILABLE WORKERS   STATUS   AGE
rayservice-sample-raycluster-zpjwg   1                 1                   ready    2m27s

View the available pods.

$ kubectl get pods -l=ray.io/is-ray-node=yes

Output:

rayservice-sample-raycluster-zpjwg-worker-small-group-vfvjb   1/1     Running   0          3m52s
rayservice-sample-raycluster-zpjwg-head-mscgh             1/1     Running   0          3m52s

View the available Ray services.

$ kubectl get services

Output:

NAME                                          TYPE        CLUSTER-IP      EXTERNAL-IP   PORT     (S)                                                   AGE
rayservice-sample-head-svc                    ClusterIP   10.96.60.41     <none>        10001/TCP,8265/TCP,52365/TCP,6379/TCP,8080/TCP,8000/TCP   4m58s
rayservice-sample-raycluster-zpjwg-head-svc   ClusterIP   10.96.77.237   <none>        10001/TCP,8265/TCP,52365/TCP,6379/TCP,8080/TCP,8000/TCP   5m25s
rayservice-sample-serve-svc                   ClusterIP   10.96.161.84    <none>        8000/TCP                                                  2m48s

Test the RayCluster

Submit a job to the RayCluster to execute it directly on the head pod by following these steps:

Get the name of the head node.

$ export HEAD_POD=$(kubectl get pods --selector=ray.io/node-type=head -o custom-columns=POD:metadata.name --no-headers)

View the name of the head node.

$ echo $HEAD_POD

Output:

raycluster-kuberay-head-6sldq

Submit a simple ray job to print the cluster resources.

$ kubectl exec -it $HEAD_POD -- python -c "import ray; ray.init(); print(ray.cluster_resources())"

Output:

2024-01-23 10:57:46,041 INFO worker.py:1458 -- Connecting to existing Ray cluster at address: 10.244.0.6:6379...
2024-01-23 10:57:46,126 INFO worker.py:1633 -- Connected to Ray cluster. View the dashboard at http://10.244.0.6:8265 
{'memory': 3000000000.0, 'object_store_memory': 743061503.0, 'node:10.244.0.7': 1.8, 'CPU': 2.0, 'node:10.244.0.6': 1.0, 'node:__internal_head__': 1.0}

Conclusion

You deployed Ray Cluster, Ray Job, and Ray Service on a Kubernetes cluster using the KubeRay operator. Ray Workloads support features such as distributed data loading, distributed training, fault tolerance, batch inference, multi-model serving, and dynamic scaling. For more information about Ray, see the official documentation.

Deploying Ray on Kubernetes

Prerequisites

Deploy the Ray Workloads

Deploy a RayCluster

Deploy a RayJob

Deploy a RayService

Test the RayCluster

Conclusion

Kubernetes MCP Server: Step by Step Guide

Running Distributed ML Training with JobSet on Kubernetes

Kubectl Quick Reference 2025