Join our Discord Server
Collabnix Team The Collabnix Team is a diverse collective of Docker, Kubernetes, and IoT experts united by a passion for cloud-native technologies. With backgrounds spanning across DevOps, platform engineering, cloud architecture, and container orchestration, our contributors bring together decades of combined experience from various industries and technical domains.

Running GPU Workloads on Kubernetes with NVIDIA: A Step-by-Step Guide

6 min read

In the age of Artificial Intelligence and Machine Learning, computing capabilities are being pushed to their limits. From deep learning to processing complex simulations, the demand for high-performance computing resources like Graphics Processing Units (GPUs) has never been greater. This necessity is particularly apparent in environments running on Kubernetes, where orchestrating and scaling workloads efficiently is paramount.

However, running GPU-accelerated workloads in Kubernetes isn’t a straightforward task. One must consider driver compatibility, resource scheduling, and ensuring that the GPU resources are optimally utilized. Misconfigurations can lead to suboptimal performance or even failed deployments, causing critical delays in productivity and wastage of resources. The challenge intensifies when scaling across multiple nodes or deploying in cloud environments. Here, we’ll unravel the complexities and guide you through the entire setup process with NVIDIA GPUs on Kubernetes.

This guide serves as an end-to-end walkthrough—from understanding what GPUs bring to Kubernetes workloads, why you should care, and finally, executing a deployment that maximizes your infrastructure’s potential. Comprehensive knowledge about setting up and managing GPU workloads will empower you to leverage NVIDIA’s powerful hardware to accelerate your computing tasks effectively.

Prerequisites and Background

Before diving into practical steps, it’s important to familiarize yourself with several key concepts that encompass Kubernetes and GPU operations. A general understanding of Kubernetes architecture, including pods, nodes, and services, is essential. For further background on Kubernetes, check out the official Kubernetes documentation.

Furthermore, you should have a working Kubernetes cluster with at least one node that supports NVIDIA GPUs. This guide is based on having NVIDIA Docker runtime and NVIDIA drivers installed. Make sure your host system is equipped with a compatible NVIDIA GPU, and verify that all necessary software components, such as the NVIDIA Container Toolkit, are installed.

In addition to the hardware and software requirements, ensure that your Kubernetes version supports device plugins, a key feature that allows the allocation of specialized hardware resources like GPUs to your pods. The device plugin system was introduced in Kubernetes 1.8, so make sure your setup is compatible by checking the Kubernetes documentation on device plugins.

Step 1: Installing NVIDIA Drivers on Your Nodes


sudo apt update
sudo apt install -y nvidia-driver-520
sudo reboot
nvidia-smi

The first step involves installing the NVIDIA drivers on your host machine. These drivers act as a liaison between your GPU hardware and the operating system, thus allowing software, like a Kubernetes node, to utilize GPU functionalities. Begin by updating your existing package repository indexes using sudo apt update to ensure you fetch the latest versions.

Proceed to install the NVIDIA driver using sudo apt install -y nvidia-driver-520. This command installs version 520 of the driver, which is known for being stable and compatible with most NVIDIA GPUs available. Always consult the NVIDIA CUDA Toolkits documentation to confirm the compatibility of the driver with your hardware.

Post installation, a system reboot is essential to initialize the drivers and ensure they work correctly with the system kernel. The command sudo reboot will restart your machine, so please ensure you’ve saved all your work. Once rebooted, validate the installation by executing the command nvidia-smi. This utility provides a detailed view of the current driver versions, running processes, and GPU utilizations, confirming that the drivers are correctly installed and functioning.

Step 2: Setting Up NVIDIA Container Toolkit


# Setting up the package repository
distribution=$(. /etc/os-release;echo $ID$VERSION_ID) 
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update

# Installing NVIDIA container toolkit
sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker

Once the drivers are operating properly, the next crucial component is the NVIDIA Container Toolkit. It is the cornerstone tool that integrates NVIDIA GPU functionalities with Docker and by extension, Kubernetes. We begin by setting up the NVIDIA Docker repository. This is necessary as the package resides here and isn’t available via default repositories. Ensure to replace $distribution with your system’s OS distribution details using shell command substitution.

Using curl, fetch the repository’s GPG key to authenticate and verify the packages you will install. Pipe this key directly into the sudo apt-key add - to add it to your keyring. Then, download and set the NVIDIA Docker list file, responsible for informing the package manager about the new source repository path.

Update your package index with sudo apt-get update to allow the package manager to access the new repository. The toolkit installation itself is finished off with sudo apt-get install -y nvidia-container-toolkit. Finally, restart Docker with sudo systemctl restart docker to apply changes and integrate the toolkit into the Docker runtime. Verify the proper setup by running an nvidia/cuda container with docker run --rm --gpus all nvidia/cuda:11.6-base nvidia-smi to ensure GPU resources are accessible through the container runtime.

Deploying NVIDIA Device Plugin in Kubernetes

To efficiently manage NVIDIA GPUs within a Kubernetes cluster, deploying the NVIDIA device plugin is crucial. This plugin acts as a bridge between Kubernetes and the GPU resources available on your nodes, ensuring seamless allocation and utilization of GPUs by containerized applications.

The NVIDIA device plugin operates by exposing GPU resources to the Kubernetes scheduler, enabling it to be aware of GPU availability and allocate these resources accordingly. To start, make sure you have a Kubernetes cluster running with nodes that have NVIDIA drivers installed, as covered earlier. If you’re new to Kubernetes, consider reviewing the basics on Collabnix’s extensive Kubernetes tutorials.

Installation Steps

First, ensure you have access to a Kubernetes cluster. You can verify your cluster access by running:

kubectl cluster-info

Next, deploy the NVIDIA device plugin using the following command, which applies a pre-configured DaemonSet manifest from NVIDIA’s GitHub repository:

kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.13.0/nvidia-device-plugin.yml

Verify the deployment by checking the status of the DaemonSet. It should show the plugin pods running on each node with GPU resources:

kubectl get ds -n kube-system

This DaemonSet should list the NVIDIA device plugin with all pods in a ‘ready’ state. If any pods are not running, check your logs with:

kubectl logs [POD_NAME] -n kube-system

Configuring GPU Resource Limits in Kubernetes

With the NVIDIA device plugin installed, the next step is to configure your workloads to use the GPUs. Kubernetes manages resources through requests and limits, and GPUs are no different. You need to define GPU resource limits within your Pod specifications.

Resource Configuration Example

Below is a YAML configuration for a Kubernetes Pod that requests a GPU resource:

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  containers:
  - name: gpu-container
    image: nvidia/cuda:11.6-base
    resources:
      limits:
        nvidia.com/gpu: 1 # requesting 1 GPU

In this configuration, the key line is nvidia.com/gpu: 1, which requests one GPU for the Pod’s lifecycle. This setup allows Kubernetes to allocate the specified number of GPUs to the Pod and manages their lifecycle effectively.

Further details about the Kubernetes resource requirements can be found in the official Kubernetes documentation.

Testing GPU Workloads with Sample Applications

Testing is a pivotal part of ensuring that your GPUs are functioning as expected in a Kubernetes environment. A common application for benchmarking and testing GPUs is the CUDA-enabled N-body simulation. Let’s see how you can deploy this application.

Step-by-Step Deployment

Begin by creating a YAML file for the simulation. Here is a sample configuration:

apiVersion: v1
kind: Pod
metadata:
  name: nbody-simulation
spec:
  containers:
  - name: nbody-container
    image: nvidia/samples:nbody-cuda11.4
    resources:
      limits:
        nvidia.com/gpu: 1
    command: ["/bin/bash", "-c", "nbody -benchmark"]

Create and run the Pod with:

kubectl apply -f nbody-simulation.yaml

After deployment, monitor the simulation output and GPU utilization:

kubectl logs nbody-simulation

The logs should provide benchmark results indicating GPU utilization and performance metrics.

Best Practices for Managing GPU Resources

Efficient management of GPU resources is vital to maintaining a high-performance environment. Here are some best practices:

  • Monitor Utilization: Regularly monitor GPU utilization using tools like Prometheus paired with Grafana dashboards to identify underutilized or overutilized resources.
  • Right-Sizing: Appropriately size your containers to align with GPU workloads, revising resource quotas to optimize costs and performance.
  • Scheduling Policies: Utilize specific scheduling policies and taints/tolerations to ensure only necessary workloads are placed on GPU nodes, freeing up resources for intended applications.

Solutions for Common Errors and Troubleshooting Tips

Managing GPU workloads comes with its own set of challenges. Let’s dive into some common issues and resolutions:

Error 1: GPU not Recognized

Solution: Ensure NVIDIA drivers are up-to-date. Use nvidia-smi on the host to verify driver installation. Redeploy the NVIDIA plugin if issues persist.

Error 2: Insufficient GPU Memory

Solution: Adjust your application to utilize less memory. It’s also advisable to monitor GPU memory consumption via nvidia-smi and optimize your resource allocation strategy.

Error 3: Pod Stuck in Pending State

Solution: Check the availability of GPU resources. If none are available, redistribute workloads or increase GPU resources on the cluster nodes.

Error 4: Plugin Pods CrashLoopBackOff

Solution: Inspect logs using kubectl logs [POD_NAME] -n kube-system to identify the cause. Often, this issue arises from misconfigurations in the device plugin.

Performance Optimization Tips

To ensure peak performance when using GPUs in Kubernetes, consider these optimization tips:

  • GPU Auto-Scaling: Adaptively scale your workloads based on GPU demand using Kubernetes auto-scaling features paired with custom metrics for efficient resource allocation.
  • Use of Specialized Libraries: Incorporate libraries such as cuDNN and NCCL to optimize computational loads on GPUs, which can significantly speed up machine learning tasks.
  • Parallel Execution: Design applications to leverage parallel processing capabilities inherent to GPUs, maximizing throughput and efficiency.

Architecture Deep Dive

Under the hood, the integration of NVIDIA GPUs with Kubernetes involves a symbiotic architecture. When a DaemonSet is deployed, it ensures the NVIDIA device plugin runs on each node with GPU capabilities, establishing a persistent runtime environment compatible with CUDA.

The plugin interacts with the Container Runtime Interface (CRI) to expose GPUs as first-class resources within Kubernetes. Each node’s kubelet communicates with the device plugin to list available GPUs, enabling the kubectl scheduler to bound these resources to requesting Pods.

This architecture allows for flexibility and scalability, maintaining a consistent GPU environment even as clusters evolve. It abstracts away the complexity of direct hardware management in favor of manageable, software-defined resources that can be configured with Kubernetes-native tools.

Further Reading and Resources

Conclusion

In this guide, we’ve covered the intricate steps necessary to deploy and utilize NVIDIA GPUs within a Kubernetes environment. From setting up the toolkit and device plugins, to deploying GPU workloads and optimizing performance, each step is pivotal for seamless integration. By following best practices and resolving common errors, Kubernetes users can significantly enhance their application performance via GPU acceleration. As you continue your journey, remember that keeping abreast of the latest cloud-native technologies and guidelines is key to staying ahead in the fast-evolving field of GPU computing.

Have Queries? Join https://launchpass.com/collabnix

Collabnix Team The Collabnix Team is a diverse collective of Docker, Kubernetes, and IoT experts united by a passion for cloud-native technologies. With backgrounds spanning across DevOps, platform engineering, cloud architecture, and container orchestration, our contributors bring together decades of combined experience from various industries and technical domains.
Join our Discord Server
Index