Running Distributed ML Training with JobSet on Kubernetes

Table of Contents

Introduction

As modern ML models become increasingly large and complex, training them often requires leveraging hundreds or thousands of accelerator chips spread across many hosts. Kubernetes has become a natural choice for scheduling and managing these distributed training workloads, but existing primitives aren’t always enough to capture the unique patterns of ML and HPC jobs.

This is where JobSet comes in—a unified, open source API designed to simplify distributed ML training and HPC workloads on Kubernetes.

What is JobSet?

JobSet models a distributed batch workload as a group of Kubernetes Jobs. Instead of dealing with fragmented solutions (such as custom resources for different ML frameworks), JobSet provides a unified way to manage:

Multi-template Pods: Different roles (like driver and workers) can have separate pod templates.
Job Groups: Group pods into clusters (e.g., per network topology) to optimize communication.
Inter-Pod Communication: Automatic creation and lifecycle management of headless services.
Startup Sequencing: Enforcing the order in which pods should start (e.g., waiting for workers or driver to initialize).

Why JobSet?

Modern distributed training – especially for large language models (LLMs) – often requires scaling out across tens of thousands of GPUs or TPUs. However, native Kubernetes Jobs alone do not provide:

Multiple Pod Templates: For complex workloads, different components require different resource specifications and policies.
Topology-Aware Scheduling: Distributing workload across accelerator islands (such as GPU racks or TPU slices) while minimizing high-latency network communication.
Coordinated Lifecycle Management: Automatically managing service discovery, pod failures, and restart policies.

JobSet addresses these challenges by extending the Job API with additional features tailored to ML and HPC workloads.

JobSet Architecture and Key Features

Replicated Jobs: Define a ReplicatedJob as a template and let JobSet create multiple child Jobs. For example, you can split a distributed training workload into smaller jobs, each running on a different accelerator island.
Automatic Headless Service Management: Enables pod-to-pod communication by automatically managing the headless services required for distributed training.
Configurable Success and Failure Policies: Mark a JobSet complete only when all necessary replicas succeed, or restart the entire JobSet if one of the critical jobs fails.
Exclusive Placement per Topology Domain: Ensure that each child Job runs exclusively on a specified topology domain, such as a single rack or TPU slice, thereby optimizing intra-island communication.
Kueue Integration: Allow for job queuing and oversubscription so that workloads run as capacity becomes available, preventing scheduling deadlocks in busy clusters.

Demo: Running JobSet with Jax on TPU Slices

In this demo, we will run a distributed ML training workload on 4 TPU v5e slices using Jax. Follow the steps below to deploy the JobSet.

Step 1: Define the JobSet Spec

Save the following YAML into a file (e.g., jobset-demo.yaml):

apiVersion: jobset.x-k8s.io/v1alpha2
kind: JobSet
metadata:
  name: multislice
  annotations:
    # Give each child Job exclusive usage of a TPU slice 
    alpha.jobset.sigs.k8s.io/exclusive-topology: cloud.google.com/gke-nodepool
spec:
  failurePolicy:
    maxRestarts: 3
  replicatedJobs:
  - name: workers
    replicas: 4 # Set to number of TPU slices
    template:
      spec:
        parallelism: 2 # Set to number of VMs per TPU slice
        completions: 2 # Set to number of VMs per TPU slice
        backoffLimit: 0
        template:
          spec:
            hostNetwork: true
            dnsPolicy: ClusterFirstWithHostNet
            nodeSelector:
              cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
              cloud.google.com/gke-tpu-topology: 2x4
            containers:
            - name: jax-tpu
              image: python:3.8
              ports:
              - containerPort: 8471
              - containerPort: 8080
              securityContext:
                privileged: true
              command:
              - bash
              - -c
              - |
                pip install "jax[tpu]" -f https://storage.googleapis.com/jax-releases/libtpu_releases.html
                python -c 'import jax; print("Global device count:", jax.device_count())'
                sleep 60                
              resources:
                limits:
                  google.com/tpu: "4"

Also,create your Custom Resource Definition using the command below:

 kubectl create -f https://raw.githubusercontent.com/kubernetes-sigs/jobset/refs/heads/main/charts/jobset/crds/jobset.x-k8s.io_jobsets.yaml

kubectl apply -f jobset-demo.yaml

Step 2: Submit the JobSet

Apply the JobSet spec to your Kubernetes cluster using:

kubectl apply -f jobset-demo.yaml

This command instructs Kubernetes to create a JobSet that internally spawns four child Jobs. Each child Job is configured to run on a different TPU slice and launches two pods (one per VM in the slice).

Step 3: Monitor the JobSet

To observe the status of your JobSet and its child Jobs, use:

kubectl get jobset multislice -o yaml

Ensure you have your `minikube` or `microk8s cluster` running before you run the `kubectl` command below. Additionally, check the logs of one of the pods to verify that your container is running the Jax workload correctly:

kubectl logs <pod-name>

Expected Output and Observations

When you inspect the logs from a pod (or several pods), you should see an output similar to:

Global device count: 4

This indicates that Jax successfully detected the available TPU devices. Here’s what to expect overall:

Pod Deployment: Four child Jobs will be created, each with 2 pods.
Network and Communication: Automatic headless service creation ensures that pods can communicate with one another using predictable hostnames.
Failure Handling: If a pod or job fails, JobSet’s failure policy will attempt up to 3 restarts for the entire JobSet.
Exclusive Topology: Each child Job is scheduled exclusively on its designated TPU slice, optimizing intra-slice communication.

Conclusion

JobSet simplifies the orchestration of complex, distributed training workloads by extending Kubernetes’ native Job API. Whether you are training large language models or running high-performance computing jobs, JobSet’s unified API, automatic service management, and topology-aware scheduling offer a streamlined experience for ML engineers.

By following this demo, you should now have a clear idea of how to deploy and monitor a distributed JobSet using Kubernetes. Enjoy scaling your ML workloads efficiently!

Collabnix Wishes you a Happy ML training and happy scheduling!