Join our Discord Server
Collabnix Team The Collabnix Team is a diverse collective of Docker, Kubernetes, and IoT experts united by a passion for cloud-native technologies. With backgrounds spanning across DevOps, platform engineering, cloud architecture, and container orchestration, our contributors bring together decades of combined experience from various industries and technical domains.

GPU Allocation in Kubernetes: A Comprehensive Guide

6 min read

Understanding GPU Allocation in Kubernetes

Understanding how Kubernetes allocates GPUs to workloads is crucial for anyone working with AI/ML applications or high-performance computing. This comprehensive guide explores the intricate mechanisms behind GPU allocation in Kubernetes, from the device plugin framework to the complete allocation process.

Overview: The GPU Allocation Challenge

Traditional Kubernetes was designed for CPU and memory resources, which are divisible and can be shared easily. GPUs, however, present unique challenges:

  • Indivisible Resources: By default, GPUs are allocated as whole units
  • Vendor-Specific Drivers: Each GPU vendor requires specific drivers and runtime configurations
  • Specialized Hardware: GPUs need vendor-specific initialization and setup
  • Complex Runtime Integration: Container runtimes need special configuration to access GPU hardware

The Device Plugin Framework: Foundation of GPU Allocation

Kubernetes solves GPU allocation through the Device Plugin Framework, which allows vendors to extend Kubernetes without modifying core code.

Architecture Components

GPU allocation Kubernetes diagram illustrating resource management

Device Plugin Registration Process

When a GPU device plugin starts, it follows a specific registration process:

# Device Plugin DaemonSet Example
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin-daemonset
  namespace: kube-system
spec:
  selector:
    matchLabels:
      name: nvidia-device-plugin-ds
  template:
    metadata:
      labels:
        name: nvidia-device-plugin-ds
    spec:
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      containers:
      - image: nvcr.io/nvidia/k8s-device-plugin:v0.14.0
        name: nvidia-device-plugin-ctr
        env:
        - name: FAIL_ON_INIT_ERROR
          value: "false"
        - name: DEVICE_LIST_STRATEGY
          value: "envvar"
        - name: DEVICE_ID_STRATEGY
          value: "uuid"
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop: ["ALL"]
        volumeMounts:
        - name: device-plugin
          mountPath: /var/lib/kubelet/device-plugins
      volumes:
      - name: device-plugin
        hostPath:
          path: /var/lib/kubelet/device-plugins
      nodeSelector:
        accelerator: nvidia

Step-by-Step GPU Allocation Process

Step 1: Device Discovery and Registration

The device plugin performs the following operations:

// Simplified Device Plugin Registration (Go code)
func (dp *NvidiaDevicePlugin) Register() error {
    // Connect to kubelet's device plugin socket
    conn, err := grpc.Dial(
        "unix:///var/lib/kubelet/device-plugins/kubelet.sock",
        grpc.WithInsecure(),
    )
    if err != nil {
        return err
    }

    // Register with kubelet
    client := pluginapi.NewRegistrationClient(conn)
    request := &pluginapi.RegisterRequest{
        Version:      pluginapi.Version,
        Endpoint:     "nvidia.sock",
        ResourceName: "nvidia.com/gpu",
    }
    
    _, err = client.Register(context.Background(), request)
    return err
}

Step 2: Resource Advertisement

The device plugin advertises available GPUs to kubelet:

// ListAndWatch reports available devices
func (dp *NvidiaDevicePlugin) ListAndWatch(e *pluginapi.Empty, s pluginapi.DevicePlugin_ListAndWatchServer) error {
    devices := dp.getDevices()
    
    response := &pluginapi.ListAndWatchResponse{
        Devices: devices,
    }
    
    if err := s.Send(response); err != nil {
        return err
    }
    
    // Continue monitoring for device health changes
    for {
        select {
        case <-dp.health:
            // Send updated device list on health changes
            devices := dp.getDevices()
            response := &pluginapi.ListAndWatchResponse{
                Devices: devices,
            }
            s.Send(response)
        }
    }
}

func (dp *NvidiaDevicePlugin) getDevices() []*pluginapi.Device {
    var devices []*pluginapi.Device
    
    // Use NVML to discover GPUs
    count, err := nvml.DeviceGetCount()
    if err != nil {
        return devices
    }
    
    for i := 0; i < count; i++ {
        device, err := nvml.DeviceGetHandleByIndex(i)
        if err != nil {
            continue
        }
        
        uuid, err := device.GetUUID()
        if err != nil {
            continue
        }
        
        devices = append(devices, &pluginapi.Device{
            ID:     uuid,
            Health: pluginapi.Healthy,
        })
    }
    
    return devices
}

Step 3: Pod Specification and Scheduling

When a user creates a pod with GPU requirements:

apiVersion: v1
kind: Pod
metadata:
  name: gpu-workload
  labels:
    app: ai-training
spec:
  containers:
  - name: training-container
    image: nvcr.io/nvidia/pytorch:23.10-py3
    resources:
      limits:
        nvidia.com/gpu: 2  # Request 2 GPUs
        memory: 16Gi
        cpu: 8
      requests:
        nvidia.com/gpu: 2  # Must equal limits for GPUs
        memory: 8Gi
        cpu: 4
    env:
    - name: CUDA_VISIBLE_DEVICES
      value: "all"
  nodeSelector:
    accelerator: nvidia-tesla-v100
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule

Important GPU Resource Constraints:

  • GPUs must be specified in the limits section
  • requests must equal limits if both are specified
  • GPU resources are only allocated as integers (no fractional GPUs by default)

Step 4: Kubernetes Scheduler Decision

The scheduler performs resource matching:

# Scheduler evaluates nodes based on:
# 1. Available nvidia.com/gpu resources
# 2. Node selectors and affinity rules
# 3. Taints and tolerations
# 4. Resource constraints

# Node capacity after device plugin registration
apiVersion: v1
kind: Node
metadata:
  name: gpu-node-1
  labels:
    accelerator: nvidia-tesla-v100
    gpu-count: "8"
spec:
  taints:
  - key: nvidia.com/gpu
    value: "true"
    effect: NoSchedule
status:
  capacity:
    nvidia.com/gpu: "8"      # Total GPUs on node
    cpu: "64"
    memory: "256Gi"
  allocatable:
    nvidia.com/gpu: "8"      # Available for scheduling
    cpu: "62"
    memory: "250Gi"

Step 5: Kubelet Device Allocation

When kubelet needs to create a container with GPU resources:

// Allocate method called by kubelet
func (dp *NvidiaDevicePlugin) Allocate(ctx context.Context, reqs *pluginapi.AllocateRequest) (*pluginapi.AllocateResponse, error) {
    var responses []*pluginapi.ContainerAllocateResponse
    
    for _, req := range reqs.ContainerRequests {
        response := &pluginapi.ContainerAllocateResponse{}
        
        // Device-specific preparations
        for _, deviceID := range req.DevicesIDs {
            // Add device to container
            response.Devices = append(response.Devices, &pluginapi.DeviceSpec{
                ContainerPath: fmt.Sprintf("/dev/nvidia%d", getDeviceIndex(deviceID)),
                HostPath:      fmt.Sprintf("/dev/nvidia%d", getDeviceIndex(deviceID)),
                Permissions:   "rwm",
            })
        }
        
        // Set environment variables for NVIDIA runtime
        response.Envs = map[string]string{
            "NVIDIA_VISIBLE_DEVICES": strings.Join(req.DevicesIDs, ","),
            "NVIDIA_DRIVER_CAPABILITIES": "compute,utility",
        }
        
        // Mount driver directories
        response.Mounts = append(response.Mounts, &pluginapi.Mount{
            ContainerPath: "/usr/local/nvidia",
            HostPath:      "/usr/local/nvidia",
            Readonly:      true,
        })
        
        responses = append(responses, response)
    }
    
    return &pluginapi.AllocateResponse{
        ContainerResponses: responses,
    }, nil
}

Step 6: Container Runtime Integration

The container runtime (containerd/CRI-O) must be configured with GPU support:

# /etc/containerd/config.toml
version = 2

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
  runtime_type = "io.containerd.runc.v2"
  [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
    BinaryName = "/usr/bin/nvidia-container-runtime"
    Runtime = "/usr/bin/nvidia-container-runtime"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
  runtime_type = "io.containerd.runc.v2"
  [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
    SystemdCgroup = true

[plugins."io.containerd.grpc.v1.cri".containerd.default_runtime_name]
  runtime_name = "nvidia"

Step 7: Final Container Creation

The complete flow results in a container with GPU access:

# Inside the container, GPUs are visible
$ nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.60.13    Driver Version: 525.60.13    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:00:1E.0 Off |                    0 |
| N/A   35C    P0    54W / 300W |      0MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:00:1F.0 Off |                    0 |
| N/A   34C    P0    53W / 300W |      0MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

Advanced GPU Allocation Scenarios

Time-Slicing for GPU Sharing

Modern device plugins support GPU sharing through time-slicing:

apiVersion: v1
kind: ConfigMap
metadata:
  name: time-slicing-config
  namespace: gpu-operator
data:
  any: |-
    version: v1
    flags:
      migStrategy: none
    sharing:
      timeSlicing:
        renameByDefault: false
        failRequestsGreaterThanOne: false
        resources:
        - name: nvidia.com/gpu
          replicas: 4  # Each GPU appears as 4 shareable resources
---
# Apply configuration to nodes
apiVersion: v1
kind: Node
metadata:
  name: gpu-node-1
  labels:
    nvidia.com/device-plugin.config: time-slicing

Multi-Instance GPU (MIG) Support

For NVIDIA A100 GPUs, MIG enables hardware partitioning:

apiVersion: v1
kind: ConfigMap
metadata:
  name: mig-config
data:
  config.yaml: |
    version: v1
    mig-configs:
      all-1g.5gb:
        - devices: all
          mig-enabled: true
          mig-devices:
            1g.5gb: 7  # Create 7 MIG instances per GPU
      all-2g.10gb:
        - devices: all
          mig-enabled: true
          mig-devices:
            2g.10gb: 3  # Create 3 larger MIG instances per GPU
---
# Pod requesting MIG instance
apiVersion: v1
kind: Pod
metadata:
  name: mig-workload
spec:
  containers:
  - name: inference
    image: nvcr.io/nvidia/tensorflow:23.02-tf2-py3
    resources:
      limits:
        nvidia.com/mig-1g.5gb: 1  # Request specific MIG slice

Dynamic Resource Allocation (DRA) – Future Direction

DRA represents the next evolution of GPU allocation:

# DRA enables more flexible resource allocation
apiVersion: resource.k8s.io/v1alpha3
kind: ResourceClaim
metadata:
  name: gpu-claim
spec:
  devices:
    requests:
    - name: gpu
      deviceClassName: nvidia-gpu
      selectors:
      - cel:
          expression: 'device.attributes["memory"] >= 16000'  # 16GB minimum
---
apiVersion: v1
kind: Pod
metadata:
  name: dra-workload
spec:
  resourceClaims:
  - name: gpu-claim
    resourceClaimName: gpu-claim
  containers:
  - name: training
    image: pytorch/pytorch:latest
    resources:
      claims:
      - name: gpu-claim

GPU Allocation State Management

Health Monitoring and Recovery

The device plugin continuously monitors GPU health:

func (dp *NvidiaDevicePlugin) healthCheck() {
    for {
        select {
        case <-time.After(30 * time.Second):
            unhealthyDevices := dp.checkDeviceHealth()
            if len(unhealthyDevices) > 0 {
                dp.notifyUnhealthyDevices(unhealthyDevices)
            }
        case <-dp.stop:
            return
        }
    }
}

func (dp *NvidiaDevicePlugin) checkDeviceHealth() []string {
    var unhealthy []string
    
    count, err := nvml.DeviceGetCount()
    if err != nil {
        return unhealthy
    }
    
    for i := 0; i < count; i++ {
        device, err := nvml.DeviceGetHandleByIndex(i)
        if err != nil {
            continue
        }
        
        // Check various health indicators
        memInfo, err := device.GetMemoryInfo()
        if err != nil {
            uuid, _ := device.GetUUID()
            unhealthy = append(unhealthy, uuid)
            continue
        }
        
        // Additional health checks...
        temperature, err := device.GetTemperature(nvml.TEMPERATURE_GPU)
        if err != nil || temperature > 90 { // Temperature threshold
            uuid, _ := device.GetUUID()
            unhealthy = append(unhealthy, uuid)
        }
    }
    
    return unhealthy
}

Resource Cleanup and Deallocation

When a pod is deleted, resources are automatically freed:

# Pod deletion triggers cleanup
apiVersion: v1
kind: Pod
metadata:
  name: gpu-workload
  finalizers:
  - gpu-cleanup.example.com/finalizer
spec:
  containers:
  - name: training
    image: pytorch/pytorch:latest
    resources:
      limits:
        nvidia.com/gpu: 1
---
# The device plugin automatically:
# 1. Receives notification of pod deletion
# 2. Releases GPU from allocation list
# 3. Makes GPU available for new allocations
# 4. Updates node capacity information

Troubleshooting GPU Allocation Issues

Common Allocation Problems

# Check node GPU capacity
kubectl describe node gpu-node-1 | grep nvidia.com/gpu

# Check device plugin status
kubectl get pods -n kube-system | grep nvidia-device-plugin

# Check device plugin logs
kubectl logs -n kube-system nvidia-device-plugin-daemonset-xxxxx

# Verify GPU drivers on node
kubectl exec -it debug-pod -- nvidia-smi

# Check container runtime configuration
kubectl exec -it debug-pod -- cat /etc/docker/daemon.json

Debugging Allocation Failures

# Debug pod for GPU troubleshooting
apiVersion: v1
kind: Pod
metadata:
  name: gpu-debug
spec:
  containers:
  - name: debug
    image: nvidia/cuda:11.8-devel-ubuntu22.04
    command: ["/bin/bash", "-c", "sleep 3600"]
    resources:
      limits:
        nvidia.com/gpu: 1
    env:
    - name: NVIDIA_VISIBLE_DEVICES
      value: "all"
    - name: NVIDIA_DRIVER_CAPABILITIES
      value: "compute,utility"
  nodeSelector:
    accelerator: nvidia
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule

Performance Optimization

# Optimize GPU allocation with node affinity
apiVersion: v1
kind: Pod
metadata:
  name: optimized-gpu-workload
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: gpu-memory
            operator: Gt
            values: ["32000"]  # Require >32GB GPU memory
          - key: gpu-generation
            operator: In
            values: ["ampere", "hopper"]  # Modern GPU architectures
  containers:
  - name: training
    image: nvcr.io/nvidia/pytorch:23.10-py3
    resources:
      limits:
        nvidia.com/gpu: 8
        cpu: 32
        memory: 128Gi
    env:
    - name: NCCL_DEBUG
      value: "INFO"
    - name: CUDA_DEVICE_ORDER
      value: "PCI_BUS_ID"

Best Practices for GPU Allocation

1. Resource Planning

# Use resource quotas to manage GPU allocation
apiVersion: v1
kind: ResourceQuota
metadata:
  name: gpu-quota
  namespace: ml-workloads
spec:
  hard:
    nvidia.com/gpu: "16"  # Limit total GPU usage
    requests.memory: "512Gi"
    requests.cpu: "128"

2. Monitoring and Alerting

# Prometheus rule for GPU utilization
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: gpu-alerts
spec:
  groups:
  - name: gpu
    rules:
    - alert: LowGPUUtilization
      expr: DCGM_FI_DEV_GPU_UTIL < 20
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "GPU utilization is low"
        description: "GPU {{ $labels.gpu }} on {{ $labels.instance }} has utilization below 20% for 10 minutes"

3. Security Considerations

# Secure GPU workload with Pod Security Standards
apiVersion: v1
kind: Pod
metadata:
  name: secure-gpu-workload
  labels:
    pod-security.kubernetes.io/enforce: restricted
spec:
  securityContext:
    runAsNonRoot: true
    runAsUser: 1000
    runAsGroup: 1000
    fsGroup: 1000
    seccompProfile:
      type: RuntimeDefault
  containers:
  - name: secure-training
    image: nvcr.io/nvidia/pytorch:23.10-py3
    securityContext:
      allowPrivilegeEscalation: false
      readOnlyRootFilesystem: true
      capabilities:
        drop:
        - ALL
    resources:
      limits:
        nvidia.com/gpu: 1
        memory: 8Gi
        cpu: 4

Conclusion

GPU allocation in Kubernetes is a sophisticated process involving multiple components working in harmony:

  1. Device Plugin Framework provides the foundation for GPU discovery and allocation
  2. Registration Process establishes communication between device plugins and kubelet
  3. Resource Advertisement makes GPUs available as schedulable resources
  4. Scheduler Integration matches workloads with appropriate GPU-enabled nodes
  5. Runtime Configuration ensures containers can access allocated GPUs
  6. Health Monitoring maintains resource availability and reliability

Understanding this mechanism is crucial for:

  • Optimizing GPU utilization in Kubernetes clusters
  • Troubleshooting allocation issues effectively
  • Implementing advanced features like GPU sharing and dynamic allocation
  • Ensuring security and reliability of GPU workloads

As Kubernetes continues to evolve, features like Dynamic Resource Allocation (DRA) will provide even more flexibility in GPU resource management, making it easier to efficiently utilize these expensive and powerful resources in cloud-native environments.


Key Takeaways:

  • GPU allocation relies on the device plugin framework for vendor extensibility
  • The process involves kubelet, device plugins, container runtime, and GPU drivers
  • Proper configuration at each layer is essential for successful GPU allocation
  • Modern features like time-slicing and MIG enable better resource utilization
  • Monitoring and security considerations are crucial for production deployments

Have Queries? Join https://launchpass.com/collabnix

Collabnix Team The Collabnix Team is a diverse collective of Docker, Kubernetes, and IoT experts united by a passion for cloud-native technologies. With backgrounds spanning across DevOps, platform engineering, cloud architecture, and container orchestration, our contributors bring together decades of combined experience from various industries and technical domains.
Join our Discord Server
Index