Join our Discord Server
Collabnix Team The Collabnix Team is a diverse collective of Docker, Kubernetes, and IoT experts united by a passion for cloud-native technologies. With backgrounds spanning across DevOps, platform engineering, cloud architecture, and container orchestration, our contributors bring together decades of combined experience from various industries and technical domains.

Kubernetes GPU Resource Management Best Practices: Complete Technical Guide for 2025

12 min read

Table of Contents

As artificial intelligence and machine learning workloads continue to dominate modern computing infrastructure, efficiently managing GPU resources in Kubernetes clusters has become critical for organizations looking to maximize performance while controlling costs. With GPU acceleration providing 10-100x performance improvements over CPU-only processing and 48% of organizations now using Kubernetes for AI/ML workloads, implementing proper GPU resource management practices is essential for production-ready infrastructure.

This comprehensive guide covers the latest best practices for managing NVIDIA GPUs in multi-node Kubernetes clusters, including installation, configuration, optimization, and monitoring strategies validated against official Kubernetes documentation and industry implementations.

GPU Operator Installation and Configuration

Best Practice 1: NVIDIA GPU Operator Deployment

The NVIDIA GPU Operator uses the operator framework within Kubernetes to automate the management of all NVIDIA software components needed to provision GPU. These components include the NVIDIA drivers (to enable CUDA), Kubernetes device plugin for GPUs, the NVIDIA Container Runtime, automatic node labelling, DCGM based monitoring and others.

Prerequisites Setup

Before installing the GPU Operator, ensure your cluster meets these requirements:

# Verify Node Feature Discovery (NFD) status
kubectl get nodes -o json | jq '.items[].metadata.labels | keys | any(startswith("feature.node.kubernetes.io"))'

# If output is true, NFD is already running
# If false, NFD will be deployed by the GPU Operator

Installation with Helm

# gpu-operator-values.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: gpu-operator
  labels:
    pod-security.kubernetes.io/enforce: privileged
---
# Install GPU Operator with Helm

# Add NVIDIA Helm repository
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

# Create namespace with proper security policies
kubectl create namespace gpu-operator
kubectl label --overwrite ns gpu-operator pod-security.kubernetes.io/enforce=privileged

# Install GPU Operator with latest version
helm install gpu-operator nvidia/gpu-operator \
  -n gpu-operator \
  --version=v25.3.0 \
  --wait \
  --create-namespace

Verification

# Verify all GPU Operator components are running
kubectl get pods -n gpu-operator

# Expected output should include:
# - gpu-operator-*
# - gpu-feature-discovery-*
# - nvidia-container-toolkit-daemonset-*
# - nvidia-dcgm-exporter-*
# - nvidia-device-plugin-daemonset-*
# - nvidia-driver-daemonset-*

Best Practice 2: Custom Configuration for Enterprise Environments

For production environments, customize the GPU Operator deployment:

# gpu-operator-custom-values.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: gpu-operator-custom-config
  namespace: gpu-operator
data:
  values.yaml: |
    operator:
      defaultRuntime: containerd
      runtimeClass: nvidia
    driver:
      version: "570.86.15"  # Pin to tested driver version
      repository: nvcr.io/nvidia
      usePrecompiled: true
    toolkit:
      version: v1.16.1-ubi8
    devicePlugin:
      version: v0.14.5
      config:
        name: ""  # Will be set for time-slicing later
    dcgmExporter:
      version: 3.3.0-3.1.8
      serviceMonitor:
        enabled: true
    migManager:
      enabled: true
      config:
        name: ""
    nodeStatusExporter:
      enabled: true
    gfd:
      version: v0.8.2

# Install with custom configuration
helm install gpu-operator nvidia/gpu-operator \
  -n gpu-operator \
  --version=v25.3.0 \
  -f gpu-operator-custom-values.yaml \
  --wait

Node Labeling and GPU Discovery

Best Practice 3: Automated Node Labeling with NFD

As an administrator, you can automatically discover and label all your GPU enabled nodes by deploying Kubernetes Node Feature Discovery (NFD). NFD detects the hardware features that are available on each node in a Kubernetes cluster.

NFD Configuration for GPU Nodes

# nfd-gpu-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: nfd-gpu-config
  namespace: gpu-operator
data:
  nfd-worker.conf: |
    core:
      labelWhiteList: "^feature.node.kubernetes.io/"
    sources:
      pci:
        deviceClassWhitelist:
          - "03"  # Display controllers (GPUs)
          - "12"  # Processing accelerators
        deviceLabelFields:
          - vendor
          - class
          - subsystem_vendor
          - subsystem_device
      custom:
        - name: "nvidia-gpu"
          matchOn:
            - pciId:
                vendor: "10de"  # NVIDIA vendor ID
          labels:
            nvidia.com/gpu: "present"
            nvidia.com/gpu.family: "{{.PCI_DEVICE_ID}}"

Manual Node Labeling for Specific GPU Types

# Label nodes with specific GPU models for targeted scheduling
kubectl label nodes gpu-node-1 \
  accelerator=nvidia-tesla-v100 \
  gpu-memory=32Gi \
  gpu-compute-capability=7.0 \
  nvidia.com/gpu.family=tesla

kubectl label nodes gpu-node-2 \
  accelerator=nvidia-tesla-a100 \
  gpu-memory=80Gi \
  gpu-compute-capability=8.0 \
  nvidia.com/gpu.family=ampere

kubectl label nodes gpu-node-3 \
  accelerator=nvidia-tesla-h100 \
  gpu-memory=80Gi \
  gpu-compute-capability=9.0 \
  nvidia.com/gpu.family=hopper

Best Practice 4: GPU Node Taints and Tolerations

Implement taints to ensure only GPU workloads are scheduled on expensive GPU nodes:

bash

# Taint GPU nodes to prevent non-GPU workloads
kubectl taint nodes gpu-node-1 nvidia.com/gpu:NoSchedule
kubectl taint nodes gpu-node-2 nvidia.com/gpu:NoSchedule  
kubectl taint nodes gpu-node-3 nvidia.com/gpu:NoSchedule

# Alternative: Taint by GPU type
kubectl taint nodes gpu-node-1 accelerator=nvidia-tesla-v100:NoSchedule

GPU Resource Allocation Strategies

Best Practice 5: Proper Resource Specification

GPUs are only supposed to be specified in the limits section, which means: You can specify GPU limits without specifying requests, because Kubernetes will use the limit as the request value by default. You can specify GPU in both limits and requests but these two values must be equal. You cannot specify GPU requests without specifying limits.

Basic GPU Resource Request

# basic-gpu-workload.yaml
apiVersion: v1
kind: Pod
metadata:
  name: gpu-training-pod
  labels:
    app: ml-training
spec:
  restartPolicy: Never
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule
  containers:
  - name: training-container
    image: nvcr.io/nvidia/tensorflow:24.01-tf2-py3
    command: ["python", "-c"]
    args: 
    - |
      import tensorflow as tf
      print("GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))
      # Your training code here
    resources:
      limits:
        nvidia.com/gpu: 1  # Request 1 whole GPU
        memory: 16Gi
        cpu: 8
      requests:
        nvidia.com/gpu: 1  # Must match limits for GPUs
        memory: 8Gi
        cpu: 4
    env:
    - name: NVIDIA_VISIBLE_DEVICES
      value: "all"
    - name: NVIDIA_DRIVER_CAPABILITIES
      value: "compute,utility"

Multi-GPU Workload Configuration

# multi-gpu-workload.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: distributed-training
spec:
  replicas: 2
  selector:
    matchLabels:
      app: distributed-training
  template:
    metadata:
      labels:
        app: distributed-training
    spec:
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - distributed-training
              topologyKey: kubernetes.io/hostname
      containers:
      - name: training-worker
        image: nvcr.io/nvidia/pytorch:24.01-py3
        resources:
          limits:
            nvidia.com/gpu: 4  # Request 4 GPUs per pod
            memory: 64Gi
            cpu: 32
          requests:
            nvidia.com/gpu: 4
            memory: 32Gi
            cpu: 16
        env:
        - name: NCCL_DEBUG
          value: "INFO"
        - name: NCCL_SOCKET_IFNAME
          value: "eth0"
        volumeMounts:
        - name: shm
          mountPath: /dev/shm
      volumes:
      - name: shm
        emptyDir:
          medium: Memory
          sizeLimit: 32Gi

Multi-Instance GPU (MIG) Configuration

Best Practice 6: MIG Profile Configuration

MIG allows you to partition a GPU into several smaller, predefined instances, each of which looks like a mini-GPU that provides memory and fault isolation at the hardware layer.

MIG ConfigMap Setup

# mig-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: mig-config
  namespace: gpu-operator
data:
  config.yaml: |
    version: v1
    mig-configs:
      all-1g.5gb:
        - devices: [0]
          mig-enabled: true
          mig-devices:
            1g.5gb: 7
      all-2g.10gb:
        - devices: [0]
          mig-enabled: true
          mig-devices:
            2g.10gb: 3
      mixed-config:
        - devices: [0]
          mig-enabled: true
          mig-devices:
            1g.5gb: 2
            2g.10gb: 1
            3g.20gb: 1
      all-disabled:
        - devices: [0]
          mig-enabled: false

Apply MIG Configuration

# Apply MIG configuration
kubectl create -n gpu-operator -f mig-config.yaml

# Update ClusterPolicy to use MIG
kubectl patch clusterpolicies.nvidia.com/cluster-policy \
  -n gpu-operator --type merge \
  --patch '{"spec": {"migManager": {"config": {"name": "mig-config", "default": "all-disabled"}}}}'

MIG Workload Example

# mig-workload.yaml
apiVersion: v1
kind: Pod
metadata:
  name: mig-inference-pod
spec:
  restartPolicy: Never
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule
  nodeSelector:
    nvidia.com/mig.config: mixed-config
  containers:
  - name: inference-container
    image: nvcr.io/nvidia/tritonserver:24.01-py3
    resources:
      limits:
        nvidia.com/mig-1g.5gb: 1  # Request 1g.5gb MIG slice
        memory: 8Gi
        cpu: 4
    env:
    - name: CUDA_MPS_PIPE_DIRECTORY
      value: "/tmp/nvidia-mps"
    - name: CUDA_MPS_LOG_DIRECTORY  
      value: "/tmp/nvidia-log"

GPU Time-Slicing Implementation

Best Practice 7: Time-Slicing Configuration

This mechanism for enabling time-slicing of GPUs in Kubernetes enables a system administrator to define a set of replicas for a GPU, each of which can be handed out independently to a pod to run workloads on. Unlike Multi-Instance GPU (MIG), there is no memory or fault-isolation between replicas, but for some workloads this is better than not being able to share at all.

Cluster-Wide Time-Slicing

# time-slicing-config-all.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: time-slicing-config-all
  namespace: gpu-operator
data:
  any: |-
    version: v1
    flags:
      migStrategy: none
    sharing:
      timeSlicing:
        renameByDefault: false
        failRequestsGreaterThanOne: false
        resources:
        - name: nvidia.com/gpu
          replicas: 4  # Allow 4 containers per GPU

Node-Specific Time-Slicing

# time-slicing-config-fine.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: time-slicing-config-fine
  namespace: gpu-operator
data:
  a100-80gb: |-
    version: v1
    flags:
      migStrategy: mixed
    sharing:
      timeSlicing:
        resources:
        - name: nvidia.com/gpu
          replicas: 8
        - name: nvidia.com/mig-1g.5gb
          replicas: 2
        - name: nvidia.com/mig-2g.10gb
          replicas: 2
        - name: nvidia.com/mig-3g.20gb
          replicas: 3
        - name: nvidia.com/mig-7g.40gb
          replicas: 7
  tesla-v100: |-
    version: v1
    flags:
      migStrategy: none
    sharing:
      timeSlicing:
        resources:
        - name: nvidia.com/gpu
          replicas: 4
  tesla-t4: |-
    version: v1
    flags:
      migStrategy: none
    sharing:
      timeSlicing:
        resources:
        - name: nvidia.com/gpu
          replicas: 2

Apply Time-Slicing Configuration

# Create the ConfigMap
kubectl create -n gpu-operator -f time-slicing-config-fine.yaml

# Configure the device plugin to use time-slicing
kubectl patch clusterpolicies.nvidia.com/cluster-policy \
  -n gpu-operator --type merge \
  --patch '{"spec": {"devicePlugin": {"config": {"name": "time-slicing-config-fine"}}}}'

# Label nodes for specific configurations
kubectl label nodes gpu-node-1 nvidia.com/device-plugin.config=a100-80gb
kubectl label nodes gpu-node-2 nvidia.com/device-plugin.config=tesla-v100
kubectl label nodes gpu-node-3 nvidia.com/device-plugin.config=tesla-t4

Time-Sliced Workload

yaml

# time-sliced-workload.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: inference-service
spec:
  replicas: 8  # Can exceed physical GPU count due to time-slicing
  selector:
    matchLabels:
      app: inference-service
  template:
    metadata:
      labels:
        app: inference-service
    spec:
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      containers:
      - name: inference-server
        image: nvcr.io/nvidia/tritonserver:24.01-py3
        ports:
        - containerPort: 8000
        - containerPort: 8001
        - containerPort: 8002
        resources:
          limits:
            nvidia.com/gpu: 1  # Each replica gets 1/4 of actual GPU
            memory: 4Gi
            cpu: 2
        env:
        - name: CUDA_VISIBLE_DEVICES
          value: "0"
        - name: TRITON_MODEL_REPOSITORY
          value: "/models"
        livenessProbe:
          httpGet:
            path: /v2/health/live
            port: 8000
          initialDelaySeconds: 30
        readinessProbe:
          httpGet:
            path: /v2/health/ready
            port: 8000
          initialDelaySeconds: 5

Resource Quotas and Limits

Best Practice 8: GPU Resource Quotas

Take the GPU resource as an example, if the resource name is nvidia.com/gpu, and you want to limit the total number of GPUs requested in a namespace to 4, you can define a quota as follows

Namespace Resource Quotas

# gpu-resource-quotas.yaml
apiVersion: v1
kind: ResourceQuota
metadata:
  name: gpu-quota
  namespace: ml-training
spec:
  hard:
    requests.nvidia.com/gpu: "8"      # Max 8 GPUs total
    limits.nvidia.com/gpu: "8"        # Must match requests
    requests.nvidia.com/mig-1g.5gb: "4"  # Max 4 MIG 1g.5gb slices
    requests.nvidia.com/mig-2g.10gb: "2" # Max 2 MIG 2g.10gb slices
    requests.cpu: "64"                # CPU limits
    requests.memory: "256Gi"          # Memory limits
    persistentvolumeclaims: "10"      # PVC limits
    pods: "20"                        # Pod limits
---
apiVersion: v1
kind: ResourceQuota
metadata:
  name: inference-quota
  namespace: ml-inference
spec:
  hard:
    requests.nvidia.com/gpu: "4"      # Smaller quota for inference
    limits.nvidia.com/gpu: "4"
    requests.cpu: "32"
    requests.memory: "128Gi"
    persistentvolumeclaims: "5"
    pods: "50"                        # More pods for inference

LimitRange for GPU Workloads

# gpu-limit-ranges.yaml
apiVersion: v1
kind: LimitRange
metadata:
  name: gpu-limits
  namespace: ml-training
spec:
  limits:
  - type: Container
    default:
      nvidia.com/gpu: "1"
      memory: "8Gi"
      cpu: "4"
    defaultRequest:
      nvidia.com/gpu: "1"
      memory: "4Gi"
      cpu: "2"
    max:
      nvidia.com/gpu: "8"      # Max GPUs per container
      memory: "64Gi"
      cpu: "32"
    min:
      nvidia.com/gpu: "1"      # Min GPUs per container
      memory: "1Gi"
      cpu: "1"
  - type: Pod
    max:
      nvidia.com/gpu: "8"      # Max GPUs per pod
      memory: "128Gi"
      cpu: "64"

Node Affinity and Scheduling

Best Practice 9: Advanced GPU Scheduling

Node Affinity for GPU Types

# gpu-node-affinity.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: gpu-training-a100
spec:
  replicas: 2
  selector:
    matchLabels:
      app: training-a100
  template:
    metadata:
      labels:
        app: training-a100
    spec:
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: accelerator
                operator: In
                values:
                - nvidia-tesla-a100
              - key: gpu-memory
                operator: In
                values:
                - "80Gi"
              - key: nvidia.com/gpu.compute-capability
                operator: Gt
                values:
                - "7.5"
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            preference:
              matchExpressions:
              - key: nvidia.com/gpu.family
                operator: In
                values:
                - ampere
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 50
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - training-a100
              topologyKey: kubernetes.io/hostname
      containers:
      - name: training
        image: nvcr.io/nvidia/pytorch:24.01-py3
        resources:
          limits:
            nvidia.com/gpu: 4
            memory: 64Gi
            cpu: 32

Priority Classes for GPU Workloads

# gpu-priority-classes.yaml
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority-gpu
value: 1000
globalDefault: false
description: "High priority for critical GPU workloads"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: medium-priority-gpu
value: 500
globalDefault: false
description: "Medium priority for standard GPU workloads"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: low-priority-gpu
value: 100
globalDefault: false
description: "Low priority for batch GPU workloads"

High-Priority GPU Workload

# priority-gpu-workload.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: urgent-training-job
spec:
  template:
    spec:
      priorityClassName: high-priority-gpu
      restartPolicy: Never
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      containers:
      - name: training
        image: nvcr.io/nvidia/tensorflow:24.01-tf2-py3
        resources:
          limits:
            nvidia.com/gpu: 2
            memory: 32Gi
          requests:
            nvidia.com/gpu: 2
            memory: 16Gi
        command: ["python"]
        args: ["/workspace/train.py", "--epochs=100", "--batch-size=64"]

Monitoring and Observability

Best Practice 10: DCGM Monitoring Setup

NVIDIA DCGM is a set of tools for managing and monitoring NVIDIA GPUs in large-scale, Linux-based cluster environments. It’s a low overhead tool that can perform a variety of functions including active health monitoring, diagnostics, system validation, policies, power and clock management, group configuration, and accounting.

DCGM Exporter Configuration

# dcgm-exporter-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: dcgm-exporter-metrics
  namespace: gpu-operator
data:
  dcp-metrics-included.csv: |
    # Format
    # If line starts with a '#' it is considered a comment
    # DCGM FIELD, Prometheus metric type, help message
    
    # Clocks
    DCGM_FI_DEV_SM_CLOCK, gauge, SM clock frequency (in MHz).
    DCGM_FI_DEV_MEM_CLOCK, gauge, Memory clock frequency (in MHz).
    
    # Temperature
    DCGM_FI_DEV_GPU_TEMP, gauge, GPU temperature (in C).
    DCGM_FI_DEV_MEMORY_TEMP, gauge, Memory temperature (in C).
    
    # Power
    DCGM_FI_DEV_POWER_USAGE, gauge, Power draw (in W).
    DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION, counter, Total energy consumption since boot (in mJ).
    
    # Utilization
    DCGM_FI_DEV_GPU_UTIL, gauge, GPU utilization (in %).
    DCGM_FI_DEV_MEM_COPY_UTIL, gauge, Memory utilization (in %).
    DCGM_FI_PROF_GR_ENGINE_ACTIVE, gauge, Graphics/Compute engine activity (in %).
    DCGM_FI_PROF_SM_ACTIVE, gauge, Streaming Multiprocessor activity (in %).
    DCGM_FI_PROF_SM_OCCUPANCY, gauge, Streaming Multiprocessor occupancy (in %).
    
    # Memory
    DCGM_FI_DEV_FB_FREE, gauge, Framebuffer memory free (in MiB).
    DCGM_FI_DEV_FB_USED, gauge, Framebuffer memory used (in MiB).
    DCGM_FI_DEV_FB_TOTAL, gauge, Total framebuffer memory (in MiB).
    
    # XID errors
    DCGM_FI_DEV_XID_ERRORS, gauge, Value of the last XID error encountered.
    
    # PCIe
    DCGM_FI_DEV_PCIE_REPLAY_COUNTER, counter, Total number of PCIe retries.
    
    # NVLink
    DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL, counter, Total number of NVLink bandwidth counters for all lanes.

Prometheus ServiceMonitor

# dcgm-servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: dcgm-exporter
  namespace: gpu-operator
  labels:
    app.kubernetes.io/name: dcgm-exporter
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: dcgm-exporter
  endpoints:
  - port: metrics
    interval: 30s
    path: /metrics
    relabelings:
    - sourceLabels: [__meta_kubernetes_pod_node_name]
      targetLabel: node
    - sourceLabels: [__meta_kubernetes_pod_name]
      targetLabel: pod
    - sourceLabels: [__meta_kubernetes_namespace]
      targetLabel: namespace

GPU Alerts Configuration

# gpu-alerts.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: gpu-alerts
  namespace: gpu-operator
spec:
  groups:
  - name: gpu.rules
    interval: 30s
    rules:
    - alert: HighGPUMemoryUsage
      expr: (DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_TOTAL) * 100 > 90
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "High GPU memory usage on {{ $labels.node }}"
        description: "GPU {{ $labels.gpu }} on node {{ $labels.node }} has {{ $value }}% memory usage"
    
    - alert: HighGPUUtilization
      expr: DCGM_FI_DEV_GPU_UTIL > 95
      for: 10m
      labels:
        severity: info
      annotations:
        summary: "High GPU utilization on {{ $labels.node }}"
        description: "GPU {{ $labels.gpu }} on node {{ $labels.node }} has {{ $value }}% utilization"
    
    - alert: GPUTemperatureHigh
      expr: DCGM_FI_DEV_GPU_TEMP > 85
      for: 2m
      labels:
        severity: critical
      annotations:
        summary: "GPU temperature high on {{ $labels.node }}"
        description: "GPU {{ $labels.gpu }} on node {{ $labels.node }} temperature is {{ $value }}°C"
    
    - alert: GPUXIDErrors
      expr: increase(DCGM_FI_DEV_XID_ERRORS[5m]) > 0
      for: 0m
      labels:
        severity: critical
      annotations:
        summary: "GPU XID errors detected on {{ $labels.node }}"
        description: "GPU {{ $labels.gpu }} on node {{ $labels.node }} has XID errors"
    
    - alert: LowGPUUtilization
      expr: DCGM_FI_DEV_GPU_UTIL < 10
      for: 15m
      labels:
        severity: warning
      annotations:
        summary: "Low GPU utilization on {{ $labels.node }}"
        description: "GPU {{ $labels.gpu }} on node {{ $labels.node }} has only {{ $value }}% utilization"

Production Deployment Patterns

Best Practice 11: Multi-Tenant GPU Cluster

yaml

# multi-tenant-setup.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: team-research
  labels:
    gpu-tier: "premium"
    cost-center: "research"
---
apiVersion: v1
kind: Namespace
metadata:
  name: team-development
  labels:
    gpu-tier: "standard"
    cost-center: "engineering"
---
apiVersion: v1
kind: Namespace
metadata:
  name: team-inference
  labels:
    gpu-tier: "shared"
    cost-center: "production"
---
# Research team gets dedicated A100 nodes
apiVersion: v1
kind: ResourceQuota
metadata:
  name: research-gpu-quota
  namespace: team-research
spec:
  hard:
    requests.nvidia.com/gpu: "16"
    limits.nvidia.com/gpu: "16"
    requests.cpu: "128"
    requests.memory: "512Gi"
---
# Development team gets mixed GPU access
apiVersion: v1
kind: ResourceQuota
metadata:
  name: dev-gpu-quota
  namespace: team-development
spec:
  hard:
    requests.nvidia.com/gpu: "8"
    limits.nvidia.com/gpu: "8"
    requests.cpu: "64"
    requests.memory: "256Gi"
---
# Inference team gets time-sliced GPUs
apiVersion: v1
kind: ResourceQuota
metadata:
  name: inference-gpu-quota
  namespace: team-inference
spec:
  hard:
    requests.nvidia.com/gpu: "32"  # Higher due to time-slicing
    limits.nvidia.com/gpu: "32"
    requests.cpu: "128"
    requests.memory: "512Gi"

Best Practice 12: AutoScaling GPU Workloads

# gpu-hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: gpu-inference-hpa
  namespace: team-inference
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: inference-service
  minReplicas: 2
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
  - type: Pods
    pods:
      metric:
        name: gpu_utilization
      target:
        type: AverageValue
        averageValue: "75"
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 50
        periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Percent
        value: 100
        periodSeconds: 30

Best Practice 13: Cluster Autoscaling for GPU Nodes

# cluster-autoscaler-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-autoscaler-status
  namespace: kube-system
data:
  nodes.max: "100"
  nodes.min: "3"
  scale-down-delay-after-add: "10m"
  scale-down-unneeded-time: "5m"
  scale-down-gpu-utilization-threshold: "0.5"
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: cluster-autoscaler-gpu
  namespace: kube-system
spec:
  replicas: 1
  selector:
    matchLabels:
      app: cluster-autoscaler-gpu
  template:
    metadata:
      labels:
        app: cluster-autoscaler-gpu
    spec:
      containers:
      - image: k8s.gcr.io/autoscaling/cluster-autoscaler:v1.27.0
        name: cluster-autoscaler
        resources:
          limits:
            cpu: 100m
            memory: 300Mi
          requests:
            cpu: 100m
            memory: 300Mi
        command:
        - ./cluster-autoscaler
        - --v=4
        - --stderrthreshold=info
        - --cloud-provider=aws  # Adjust for your cloud provider
        - --skip-nodes-with-local-storage=false
        - --expander=least-waste
        - --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/gpu-cluster
        - --balance-similar-node-groups
        - --scale-down-enabled=true
        - --scale-down-delay-after-add=10m
        - --scale-down-unneeded-time=5m
        - --scale-down-gpu-utilization-threshold=0.5
        env:
        - name: AWS_REGION
          value: us-west-2

Troubleshooting and Optimization

Best Practice 14: Common Issues and Solutions

GPU Driver Issues

# Check GPU driver installation
kubectl logs -n gpu-operator -l app=nvidia-driver-daemonset

# Verify GPU discovery
kubectl get nodes -o json | jq '.items[] | {name: .metadata.name, gpus: .status.allocatable["nvidia.com/gpu"]}'

# Check device plugin status
kubectl logs -n gpu-operator -l app=nvidia-device-plugin-daemonset

# Verify GPU is visible in pod
kubectl exec -it <pod-name> -- nvidia-smi

Resource Allocation Debugging

# debug-gpu-pod.yaml
apiVersion: v1
kind: Pod
metadata:
  name: gpu-debug
spec:
  restartPolicy: Never
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule
  containers:
  - name: gpu-debug
    image: nvcr.io/nvidia/cuda:12.3-runtime-ubuntu22.04
    command: ["/bin/bash"]
    args: ["-c", "while true; do nvidia-smi; sleep 30; done"]
    resources:
      limits:
        nvidia.com/gpu: 1
    env:
    - name: NVIDIA_VISIBLE_DEVICES
      value: "all"
    - name: NVIDIA_DRIVER_CAPABILITIES
      value: "compute,utility,graphics"

Performance Optimization Script

bash

#!/bin/bash
# gpu-optimization.sh

echo "=== GPU Cluster Optimization Report ==="

# Check GPU utilization across cluster
echo "GPU Utilization by Node:"
kubectl get pods -o wide | grep nvidia-dcgm-exporter | while read line; do
  POD=$(echo $line | awk '{print $1}')
  NODE=$(echo $line | awk '{print $7}')
  echo "Node: $NODE"
  kubectl exec $POD -- curl -s localhost:9400/metrics | grep "DCGM_FI_DEV_GPU_UTIL" | head -5
  echo "---"
done

# Check pending GPU pods
echo "Pending GPU Pods:"
kubectl get pods --all-namespaces -o wide | grep Pending | while read line; do
  NAMESPACE=$(echo $line | awk '{print $1}')
  POD=$(echo $line | awk '{print $2}')
  kubectl describe pod $POD -n $NAMESPACE | grep "nvidia.com/gpu" > /dev/null
  if [ $? -eq 0 ]; then
    echo "GPU Pod Pending: $NAMESPACE/$POD"
    kubectl describe pod $POD -n $NAMESPACE | grep -A 5 "Events:"
  fi
done

# Check GPU node capacity
echo "GPU Node Capacity:"
kubectl describe nodes | grep -A 5 -B 5 "nvidia.com/gpu"

# Check time-slicing configuration
echo "Time-Slicing Status:"
kubectl get configmap -n gpu-operator | grep time-slicing

Best Practice 15: Performance Tuning

GPU Memory Optimization

yaml

# memory-optimized-workload.yaml
apiVersion: v1
kind: Pod
metadata:
  name: memory-optimized-training
spec:
  containers:
  - name: training
    image: nvcr.io/nvidia/pytorch:24.01-py3
    resources:
      limits:
        nvidia.com/gpu: 1
        memory: 32Gi
    env:
    - name: PYTORCH_CUDA_ALLOC_CONF
      value: "max_split_size_mb:128"
    - name: CUDA_LAUNCH_BLOCKING
      value: "0"
    - name: CUDA_CACHE_MAXSIZE
      value: "2147483647"
    - name: PYTHONUNBUFFERED
      value: "1"
    command: ["python"]
    args: 
    - "-c"
    - |
      import torch
      import gc
      
      # Enable memory optimization
      torch.backends.cudnn.benchmark = True
      torch.backends.cudnn.deterministic = False
      
      # Use memory efficient attention
      torch.backends.cuda.enable_flash_sdp(True)
      
      # Your training code with memory management
      device = torch.cuda.current_device()
      torch.cuda.set_per_process_memory_fraction(0.95, device)
      
      # Training loop with periodic cleanup
      for epoch in range(100):
          # Training code here
          if epoch % 10 == 0:
              gc.collect()
              torch.cuda.empty_cache()

Conclusion

Implementing proper GPU resource management in Kubernetes requires careful attention to hardware configuration, software setup, resource allocation, and monitoring. The best practices outlined in this guide provide a comprehensive framework for organizations to:

  1. Efficiently provision and manage NVIDIA GPUs using the GPU Operator
  2. Optimize resource utilization through MIG and time-slicing strategies
  3. Implement proper scheduling with node affinity and tolerations
  4. Monitor performance and health using DCGM and Prometheus
  5. Scale workloads effectively while maintaining cost control

As GPU acceleration continues to provide 10-100x performance improvements over CPU-only processing, following these validated best practices ensures your Kubernetes clusters can efficiently support demanding AI/ML workloads while maximizing hardware investments.

For the latest updates and community discussions, refer to the official Kubernetes GPU documentation and the NVIDIA GPU Operator documentation

Have Queries? Join https://launchpass.com/collabnix

Collabnix Team The Collabnix Team is a diverse collective of Docker, Kubernetes, and IoT experts united by a passion for cloud-native technologies. With backgrounds spanning across DevOps, platform engineering, cloud architecture, and container orchestration, our contributors bring together decades of combined experience from various industries and technical domains.
Join our Discord Server
Table of Contents
Index