Using Kubernetes and Slurm Together

Table of Contents

Slurm is a job scheduler that is commonly used for managing high-performance computing (HPC) workloads. Kubernetes is a container orchestration platform that is used for managing and deploying containerized applications. Slurm and Kubernetes can be integrated to provide a unified platform for managing machine learning workloads.

Slurm provides a number of features that are beneficial for managing machine learning workloads, including:

Resource management: Slurm can manage the allocation of computing resources, such as CPUs, GPUs, and memory, to machine learning jobs.
Job scheduling: Slurm can schedule machine learning jobs to the available resources in the cluster, taking into account job priority, resource requirements, and other factors.
Workload monitoring: Slurm can monitor the execution of machine learning jobs and provide information about their status, resource usage, and performance.

Slurm is a complex piece of software, but there are a number of resources available to help you learn more about how to use it. The Slurm documentation is a good place to start. You can also find a number of tutorials and blog posts online.

Benefits of Integrating Slurm with Kubernetes

There are several benefits to integrating Slurm with Kubernetes for machine learning workloads:

Improved resource utilization: Slurm can help to improve resource utilization by scheduling jobs to the available resources in the Kubernetes cluster.
Simplified workload management: Slurm provides a unified interface for managing machine learning workloads, regardless of whether they are running on Kubernetes or on traditional HPC resources.
Increased scalability: Slurm can help to scale machine learning workloads by scheduling them to multiple Kubernetes clusters.

Use Cases

The following are some examples of use cases for integrating Slurm with Kubernetes for machine learning workloads:

Training large language models: Training large language models can require a significant amount of resources. Slurm can help to ensure that these resources are available when needed and that they are used efficiently.
Running distributed machine learning jobs: Slurm can help to schedule and manage distributed machine learning jobs across multiple Kubernetes clusters.
Running machine learning pipelines: Slurm can help to schedule and manage machine learning pipelines that consist of multiple tasks.

Infrastructure Requirements

To integrate Slurm with Kubernetes, you will need the following infrastructure:

A Kubernetes cluster
A Slurm cluster
A network connection between the Kubernetes cluster and the Slurm cluster

Installing Slurm on Kubernetes Cluster

To install Slurm on a Kubernetes cluster, you can use the following steps:

Create a Slurm cluster resource definition (CRD):

kubectl create -f slurm-crd.yaml

Install the Slurm Helm chart:

helm install slurm -f slurm-helm-chart.yaml

Wait for the Slurm cluster to be deployed:

kubectl get slurmcluster -n slurm

Once the Slurm cluster is deployed, you can start scheduling machine learning workloads to it using the Slurm command-line interface (CLI).

YAML File

The following is an example of a YAML file for a Slurm cluster resource definition (CRD):

apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  name: slurmclusters.slurm.dev
spec:
  group: slurm.dev
  versions:
  - name: v1alpha1
    served: true
    storage: true
    schema:
      openAPIV3Schema:
        type: object
        properties:
          spec:
            type: object
            properties:
              controller:
                type: string
              nodes:
                type: array
                items:
                  type: string
          status:
            type: object
            properties:
              state:
                type: string

The following is an example of a YAML file for a Slurm Helm chart:

apiVersion: v2
name: slurm
version: 1.0.0
description: A Helm chart for deploying Slurm on Kubernetes.
dependencies:
- name: slurm-controller
  version: 1.0.0
  repository: https://charts.helm.sh/stable
- name: slurm-node
  version: 1.0.0
  repository: https://charts.helm.sh/stable

Integrating Slurm with Kubernetes can provide a number of benefits for managing machine learning workloads. Slurm can help to improve resource utilization, simplify workload management, and increase scalability.

Scheduling Machine Learning Workloads to Slurm

Once you have installed Slurm on Kubernetes, you can start scheduling machine learning workloads to it using the Slurm command-line interface (CLI).

To schedule a machine learning job to Slurm, you can use the following command:

sbatch slurm_job.sh

The sbatch command will submit the job to Slurm and return a job ID. You can use the job ID to track the status of the job and to obtain information about its resource usage and performance.

The following is an example of a Slurm job script:

#!/bin/bash

# Set the job name
#SBATCH --job-name=my_machine_learning_job

# Set the number of nodes and cores per node
#SBATCH --nodes=1 --ntasks-per-node=8

# Set the working directory
#SBATCH --workdir=/path/to/my/workdir

# Load the required modules
module load python/3.8

# Run the machine learning job
python train_model.py

Once you have submitted the job to Slurm, you can monitor its status using the following command:

squeue -u your_username

This command will list all of the jobs that you have submitted to Slurm, along with their status.

You can also obtain information about the resource usage and performance of a job using the following command:

sacct -j job_id

This command will show you the amount of time that the job has been running, the resources that it has been using, and its exit status.

Conclusion

By following the steps outlined in this blog post, you can start scheduling machine learning workloads to Slurm on Kubernetes today.