As artificial intelligence and machine learning workloads continue to dominate modern computing infrastructure, efficiently managing GPU resources in Kubernetes clusters has become critical for organizations looking to maximize performance while controlling costs. With GPU acceleration providing 10-100x performance improvements over CPU-only processing and 48% of organizations now using Kubernetes for AI/ML workloads, implementing proper GPU resource management practices is essential for production-ready infrastructure.
This comprehensive guide covers the latest best practices for managing NVIDIA GPUs in multi-node Kubernetes clusters, including installation, configuration, optimization, and monitoring strategies validated against official Kubernetes documentation and industry implementations.
GPU Operator Installation and Configuration
Best Practice 1: NVIDIA GPU Operator Deployment
The NVIDIA GPU Operator uses the operator framework within Kubernetes to automate the management of all NVIDIA software components needed to provision GPU. These components include the NVIDIA drivers (to enable CUDA), Kubernetes device plugin for GPUs, the NVIDIA Container Runtime, automatic node labelling, DCGM based monitoring and others.
Prerequisites Setup
Before installing the GPU Operator, ensure your cluster meets these requirements:
# Verify Node Feature Discovery (NFD) status
kubectl get nodes -o json | jq '.items[].metadata.labels | keys | any(startswith("feature.node.kubernetes.io"))'
# If output is true, NFD is already running
# If false, NFD will be deployed by the GPU Operator
Installation with Helm
# gpu-operator-values.yaml
apiVersion: v1
kind: Namespace
metadata:
name: gpu-operator
labels:
pod-security.kubernetes.io/enforce: privileged
---
# Install GPU Operator with Helm
# Add NVIDIA Helm repository
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
# Create namespace with proper security policies
kubectl create namespace gpu-operator
kubectl label --overwrite ns gpu-operator pod-security.kubernetes.io/enforce=privileged
# Install GPU Operator with latest version
helm install gpu-operator nvidia/gpu-operator \
-n gpu-operator \
--version=v25.3.0 \
--wait \
--create-namespace
Verification
# Verify all GPU Operator components are running
kubectl get pods -n gpu-operator
# Expected output should include:
# - gpu-operator-*
# - gpu-feature-discovery-*
# - nvidia-container-toolkit-daemonset-*
# - nvidia-dcgm-exporter-*
# - nvidia-device-plugin-daemonset-*
# - nvidia-driver-daemonset-*
Best Practice 2: Custom Configuration for Enterprise Environments
For production environments, customize the GPU Operator deployment:
# gpu-operator-custom-values.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: gpu-operator-custom-config
namespace: gpu-operator
data:
values.yaml: |
operator:
defaultRuntime: containerd
runtimeClass: nvidia
driver:
version: "570.86.15" # Pin to tested driver version
repository: nvcr.io/nvidia
usePrecompiled: true
toolkit:
version: v1.16.1-ubi8
devicePlugin:
version: v0.14.5
config:
name: "" # Will be set for time-slicing later
dcgmExporter:
version: 3.3.0-3.1.8
serviceMonitor:
enabled: true
migManager:
enabled: true
config:
name: ""
nodeStatusExporter:
enabled: true
gfd:
version: v0.8.2
# Install with custom configuration
helm install gpu-operator nvidia/gpu-operator \
-n gpu-operator \
--version=v25.3.0 \
-f gpu-operator-custom-values.yaml \
--wait
Node Labeling and GPU Discovery
Best Practice 3: Automated Node Labeling with NFD
As an administrator, you can automatically discover and label all your GPU enabled nodes by deploying Kubernetes Node Feature Discovery (NFD). NFD detects the hardware features that are available on each node in a Kubernetes cluster.
NFD Configuration for GPU Nodes
# nfd-gpu-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: nfd-gpu-config
namespace: gpu-operator
data:
nfd-worker.conf: |
core:
labelWhiteList: "^feature.node.kubernetes.io/"
sources:
pci:
deviceClassWhitelist:
- "03" # Display controllers (GPUs)
- "12" # Processing accelerators
deviceLabelFields:
- vendor
- class
- subsystem_vendor
- subsystem_device
custom:
- name: "nvidia-gpu"
matchOn:
- pciId:
vendor: "10de" # NVIDIA vendor ID
labels:
nvidia.com/gpu: "present"
nvidia.com/gpu.family: "{{.PCI_DEVICE_ID}}"
Manual Node Labeling for Specific GPU Types
# Label nodes with specific GPU models for targeted scheduling
kubectl label nodes gpu-node-1 \
accelerator=nvidia-tesla-v100 \
gpu-memory=32Gi \
gpu-compute-capability=7.0 \
nvidia.com/gpu.family=tesla
kubectl label nodes gpu-node-2 \
accelerator=nvidia-tesla-a100 \
gpu-memory=80Gi \
gpu-compute-capability=8.0 \
nvidia.com/gpu.family=ampere
kubectl label nodes gpu-node-3 \
accelerator=nvidia-tesla-h100 \
gpu-memory=80Gi \
gpu-compute-capability=9.0 \
nvidia.com/gpu.family=hopper
Best Practice 4: GPU Node Taints and Tolerations
Implement taints to ensure only GPU workloads are scheduled on expensive GPU nodes:
bash
# Taint GPU nodes to prevent non-GPU workloads
kubectl taint nodes gpu-node-1 nvidia.com/gpu:NoSchedule
kubectl taint nodes gpu-node-2 nvidia.com/gpu:NoSchedule
kubectl taint nodes gpu-node-3 nvidia.com/gpu:NoSchedule
# Alternative: Taint by GPU type
kubectl taint nodes gpu-node-1 accelerator=nvidia-tesla-v100:NoSchedule
GPU Resource Allocation Strategies
Best Practice 5: Proper Resource Specification
GPUs are only supposed to be specified in the limits section, which means: You can specify GPU limits without specifying requests, because Kubernetes will use the limit as the request value by default. You can specify GPU in both limits and requests but these two values must be equal. You cannot specify GPU requests without specifying limits.
Basic GPU Resource Request
# basic-gpu-workload.yaml
apiVersion: v1
kind: Pod
metadata:
name: gpu-training-pod
labels:
app: ml-training
spec:
restartPolicy: Never
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
containers:
- name: training-container
image: nvcr.io/nvidia/tensorflow:24.01-tf2-py3
command: ["python", "-c"]
args:
- |
import tensorflow as tf
print("GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))
# Your training code here
resources:
limits:
nvidia.com/gpu: 1 # Request 1 whole GPU
memory: 16Gi
cpu: 8
requests:
nvidia.com/gpu: 1 # Must match limits for GPUs
memory: 8Gi
cpu: 4
env:
- name: NVIDIA_VISIBLE_DEVICES
value: "all"
- name: NVIDIA_DRIVER_CAPABILITIES
value: "compute,utility"
Multi-GPU Workload Configuration
# multi-gpu-workload.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: distributed-training
spec:
replicas: 2
selector:
matchLabels:
app: distributed-training
template:
metadata:
labels:
app: distributed-training
spec:
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- distributed-training
topologyKey: kubernetes.io/hostname
containers:
- name: training-worker
image: nvcr.io/nvidia/pytorch:24.01-py3
resources:
limits:
nvidia.com/gpu: 4 # Request 4 GPUs per pod
memory: 64Gi
cpu: 32
requests:
nvidia.com/gpu: 4
memory: 32Gi
cpu: 16
env:
- name: NCCL_DEBUG
value: "INFO"
- name: NCCL_SOCKET_IFNAME
value: "eth0"
volumeMounts:
- name: shm
mountPath: /dev/shm
volumes:
- name: shm
emptyDir:
medium: Memory
sizeLimit: 32Gi
Multi-Instance GPU (MIG) Configuration
Best Practice 6: MIG Profile Configuration
MIG allows you to partition a GPU into several smaller, predefined instances, each of which looks like a mini-GPU that provides memory and fault isolation at the hardware layer.
MIG ConfigMap Setup
# mig-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: mig-config
namespace: gpu-operator
data:
config.yaml: |
version: v1
mig-configs:
all-1g.5gb:
- devices: [0]
mig-enabled: true
mig-devices:
1g.5gb: 7
all-2g.10gb:
- devices: [0]
mig-enabled: true
mig-devices:
2g.10gb: 3
mixed-config:
- devices: [0]
mig-enabled: true
mig-devices:
1g.5gb: 2
2g.10gb: 1
3g.20gb: 1
all-disabled:
- devices: [0]
mig-enabled: false
Apply MIG Configuration
# Apply MIG configuration
kubectl create -n gpu-operator -f mig-config.yaml
# Update ClusterPolicy to use MIG
kubectl patch clusterpolicies.nvidia.com/cluster-policy \
-n gpu-operator --type merge \
--patch '{"spec": {"migManager": {"config": {"name": "mig-config", "default": "all-disabled"}}}}'
MIG Workload Example
# mig-workload.yaml
apiVersion: v1
kind: Pod
metadata:
name: mig-inference-pod
spec:
restartPolicy: Never
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
nodeSelector:
nvidia.com/mig.config: mixed-config
containers:
- name: inference-container
image: nvcr.io/nvidia/tritonserver:24.01-py3
resources:
limits:
nvidia.com/mig-1g.5gb: 1 # Request 1g.5gb MIG slice
memory: 8Gi
cpu: 4
env:
- name: CUDA_MPS_PIPE_DIRECTORY
value: "/tmp/nvidia-mps"
- name: CUDA_MPS_LOG_DIRECTORY
value: "/tmp/nvidia-log"
GPU Time-Slicing Implementation
Best Practice 7: Time-Slicing Configuration
This mechanism for enabling time-slicing of GPUs in Kubernetes enables a system administrator to define a set of replicas for a GPU, each of which can be handed out independently to a pod to run workloads on. Unlike Multi-Instance GPU (MIG), there is no memory or fault-isolation between replicas, but for some workloads this is better than not being able to share at all.
Cluster-Wide Time-Slicing
# time-slicing-config-all.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: time-slicing-config-all
namespace: gpu-operator
data:
any: |-
version: v1
flags:
migStrategy: none
sharing:
timeSlicing:
renameByDefault: false
failRequestsGreaterThanOne: false
resources:
- name: nvidia.com/gpu
replicas: 4 # Allow 4 containers per GPU
Node-Specific Time-Slicing
# time-slicing-config-fine.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: time-slicing-config-fine
namespace: gpu-operator
data:
a100-80gb: |-
version: v1
flags:
migStrategy: mixed
sharing:
timeSlicing:
resources:
- name: nvidia.com/gpu
replicas: 8
- name: nvidia.com/mig-1g.5gb
replicas: 2
- name: nvidia.com/mig-2g.10gb
replicas: 2
- name: nvidia.com/mig-3g.20gb
replicas: 3
- name: nvidia.com/mig-7g.40gb
replicas: 7
tesla-v100: |-
version: v1
flags:
migStrategy: none
sharing:
timeSlicing:
resources:
- name: nvidia.com/gpu
replicas: 4
tesla-t4: |-
version: v1
flags:
migStrategy: none
sharing:
timeSlicing:
resources:
- name: nvidia.com/gpu
replicas: 2
Apply Time-Slicing Configuration
# Create the ConfigMap
kubectl create -n gpu-operator -f time-slicing-config-fine.yaml
# Configure the device plugin to use time-slicing
kubectl patch clusterpolicies.nvidia.com/cluster-policy \
-n gpu-operator --type merge \
--patch '{"spec": {"devicePlugin": {"config": {"name": "time-slicing-config-fine"}}}}'
# Label nodes for specific configurations
kubectl label nodes gpu-node-1 nvidia.com/device-plugin.config=a100-80gb
kubectl label nodes gpu-node-2 nvidia.com/device-plugin.config=tesla-v100
kubectl label nodes gpu-node-3 nvidia.com/device-plugin.config=tesla-t4
Time-Sliced Workload
yaml
# time-sliced-workload.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: inference-service
spec:
replicas: 8 # Can exceed physical GPU count due to time-slicing
selector:
matchLabels:
app: inference-service
template:
metadata:
labels:
app: inference-service
spec:
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
containers:
- name: inference-server
image: nvcr.io/nvidia/tritonserver:24.01-py3
ports:
- containerPort: 8000
- containerPort: 8001
- containerPort: 8002
resources:
limits:
nvidia.com/gpu: 1 # Each replica gets 1/4 of actual GPU
memory: 4Gi
cpu: 2
env:
- name: CUDA_VISIBLE_DEVICES
value: "0"
- name: TRITON_MODEL_REPOSITORY
value: "/models"
livenessProbe:
httpGet:
path: /v2/health/live
port: 8000
initialDelaySeconds: 30
readinessProbe:
httpGet:
path: /v2/health/ready
port: 8000
initialDelaySeconds: 5
Resource Quotas and Limits
Best Practice 8: GPU Resource Quotas
Take the GPU resource as an example, if the resource name is nvidia.com/gpu, and you want to limit the total number of GPUs requested in a namespace to 4, you can define a quota as follows
Namespace Resource Quotas
# gpu-resource-quotas.yaml
apiVersion: v1
kind: ResourceQuota
metadata:
name: gpu-quota
namespace: ml-training
spec:
hard:
requests.nvidia.com/gpu: "8" # Max 8 GPUs total
limits.nvidia.com/gpu: "8" # Must match requests
requests.nvidia.com/mig-1g.5gb: "4" # Max 4 MIG 1g.5gb slices
requests.nvidia.com/mig-2g.10gb: "2" # Max 2 MIG 2g.10gb slices
requests.cpu: "64" # CPU limits
requests.memory: "256Gi" # Memory limits
persistentvolumeclaims: "10" # PVC limits
pods: "20" # Pod limits
---
apiVersion: v1
kind: ResourceQuota
metadata:
name: inference-quota
namespace: ml-inference
spec:
hard:
requests.nvidia.com/gpu: "4" # Smaller quota for inference
limits.nvidia.com/gpu: "4"
requests.cpu: "32"
requests.memory: "128Gi"
persistentvolumeclaims: "5"
pods: "50" # More pods for inference
LimitRange for GPU Workloads
# gpu-limit-ranges.yaml
apiVersion: v1
kind: LimitRange
metadata:
name: gpu-limits
namespace: ml-training
spec:
limits:
- type: Container
default:
nvidia.com/gpu: "1"
memory: "8Gi"
cpu: "4"
defaultRequest:
nvidia.com/gpu: "1"
memory: "4Gi"
cpu: "2"
max:
nvidia.com/gpu: "8" # Max GPUs per container
memory: "64Gi"
cpu: "32"
min:
nvidia.com/gpu: "1" # Min GPUs per container
memory: "1Gi"
cpu: "1"
- type: Pod
max:
nvidia.com/gpu: "8" # Max GPUs per pod
memory: "128Gi"
cpu: "64"
Node Affinity and Scheduling
Best Practice 9: Advanced GPU Scheduling
Node Affinity for GPU Types
# gpu-node-affinity.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: gpu-training-a100
spec:
replicas: 2
selector:
matchLabels:
app: training-a100
template:
metadata:
labels:
app: training-a100
spec:
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: accelerator
operator: In
values:
- nvidia-tesla-a100
- key: gpu-memory
operator: In
values:
- "80Gi"
- key: nvidia.com/gpu.compute-capability
operator: Gt
values:
- "7.5"
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
preference:
matchExpressions:
- key: nvidia.com/gpu.family
operator: In
values:
- ampere
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 50
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- training-a100
topologyKey: kubernetes.io/hostname
containers:
- name: training
image: nvcr.io/nvidia/pytorch:24.01-py3
resources:
limits:
nvidia.com/gpu: 4
memory: 64Gi
cpu: 32
Priority Classes for GPU Workloads
# gpu-priority-classes.yaml
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority-gpu
value: 1000
globalDefault: false
description: "High priority for critical GPU workloads"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: medium-priority-gpu
value: 500
globalDefault: false
description: "Medium priority for standard GPU workloads"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: low-priority-gpu
value: 100
globalDefault: false
description: "Low priority for batch GPU workloads"
High-Priority GPU Workload
# priority-gpu-workload.yaml
apiVersion: batch/v1
kind: Job
metadata:
name: urgent-training-job
spec:
template:
spec:
priorityClassName: high-priority-gpu
restartPolicy: Never
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
containers:
- name: training
image: nvcr.io/nvidia/tensorflow:24.01-tf2-py3
resources:
limits:
nvidia.com/gpu: 2
memory: 32Gi
requests:
nvidia.com/gpu: 2
memory: 16Gi
command: ["python"]
args: ["/workspace/train.py", "--epochs=100", "--batch-size=64"]
Monitoring and Observability
Best Practice 10: DCGM Monitoring Setup
NVIDIA DCGM is a set of tools for managing and monitoring NVIDIA GPUs in large-scale, Linux-based cluster environments. It’s a low overhead tool that can perform a variety of functions including active health monitoring, diagnostics, system validation, policies, power and clock management, group configuration, and accounting.
DCGM Exporter Configuration
# dcgm-exporter-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: dcgm-exporter-metrics
namespace: gpu-operator
data:
dcp-metrics-included.csv: |
# Format
# If line starts with a '#' it is considered a comment
# DCGM FIELD, Prometheus metric type, help message
# Clocks
DCGM_FI_DEV_SM_CLOCK, gauge, SM clock frequency (in MHz).
DCGM_FI_DEV_MEM_CLOCK, gauge, Memory clock frequency (in MHz).
# Temperature
DCGM_FI_DEV_GPU_TEMP, gauge, GPU temperature (in C).
DCGM_FI_DEV_MEMORY_TEMP, gauge, Memory temperature (in C).
# Power
DCGM_FI_DEV_POWER_USAGE, gauge, Power draw (in W).
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION, counter, Total energy consumption since boot (in mJ).
# Utilization
DCGM_FI_DEV_GPU_UTIL, gauge, GPU utilization (in %).
DCGM_FI_DEV_MEM_COPY_UTIL, gauge, Memory utilization (in %).
DCGM_FI_PROF_GR_ENGINE_ACTIVE, gauge, Graphics/Compute engine activity (in %).
DCGM_FI_PROF_SM_ACTIVE, gauge, Streaming Multiprocessor activity (in %).
DCGM_FI_PROF_SM_OCCUPANCY, gauge, Streaming Multiprocessor occupancy (in %).
# Memory
DCGM_FI_DEV_FB_FREE, gauge, Framebuffer memory free (in MiB).
DCGM_FI_DEV_FB_USED, gauge, Framebuffer memory used (in MiB).
DCGM_FI_DEV_FB_TOTAL, gauge, Total framebuffer memory (in MiB).
# XID errors
DCGM_FI_DEV_XID_ERRORS, gauge, Value of the last XID error encountered.
# PCIe
DCGM_FI_DEV_PCIE_REPLAY_COUNTER, counter, Total number of PCIe retries.
# NVLink
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL, counter, Total number of NVLink bandwidth counters for all lanes.
Prometheus ServiceMonitor
# dcgm-servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: dcgm-exporter
namespace: gpu-operator
labels:
app.kubernetes.io/name: dcgm-exporter
spec:
selector:
matchLabels:
app.kubernetes.io/name: dcgm-exporter
endpoints:
- port: metrics
interval: 30s
path: /metrics
relabelings:
- sourceLabels: [__meta_kubernetes_pod_node_name]
targetLabel: node
- sourceLabels: [__meta_kubernetes_pod_name]
targetLabel: pod
- sourceLabels: [__meta_kubernetes_namespace]
targetLabel: namespace
GPU Alerts Configuration
# gpu-alerts.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: gpu-alerts
namespace: gpu-operator
spec:
groups:
- name: gpu.rules
interval: 30s
rules:
- alert: HighGPUMemoryUsage
expr: (DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_TOTAL) * 100 > 90
for: 5m
labels:
severity: warning
annotations:
summary: "High GPU memory usage on {{ $labels.node }}"
description: "GPU {{ $labels.gpu }} on node {{ $labels.node }} has {{ $value }}% memory usage"
- alert: HighGPUUtilization
expr: DCGM_FI_DEV_GPU_UTIL > 95
for: 10m
labels:
severity: info
annotations:
summary: "High GPU utilization on {{ $labels.node }}"
description: "GPU {{ $labels.gpu }} on node {{ $labels.node }} has {{ $value }}% utilization"
- alert: GPUTemperatureHigh
expr: DCGM_FI_DEV_GPU_TEMP > 85
for: 2m
labels:
severity: critical
annotations:
summary: "GPU temperature high on {{ $labels.node }}"
description: "GPU {{ $labels.gpu }} on node {{ $labels.node }} temperature is {{ $value }}°C"
- alert: GPUXIDErrors
expr: increase(DCGM_FI_DEV_XID_ERRORS[5m]) > 0
for: 0m
labels:
severity: critical
annotations:
summary: "GPU XID errors detected on {{ $labels.node }}"
description: "GPU {{ $labels.gpu }} on node {{ $labels.node }} has XID errors"
- alert: LowGPUUtilization
expr: DCGM_FI_DEV_GPU_UTIL < 10
for: 15m
labels:
severity: warning
annotations:
summary: "Low GPU utilization on {{ $labels.node }}"
description: "GPU {{ $labels.gpu }} on node {{ $labels.node }} has only {{ $value }}% utilization"
Production Deployment Patterns
Best Practice 11: Multi-Tenant GPU Cluster
yaml
# multi-tenant-setup.yaml
apiVersion: v1
kind: Namespace
metadata:
name: team-research
labels:
gpu-tier: "premium"
cost-center: "research"
---
apiVersion: v1
kind: Namespace
metadata:
name: team-development
labels:
gpu-tier: "standard"
cost-center: "engineering"
---
apiVersion: v1
kind: Namespace
metadata:
name: team-inference
labels:
gpu-tier: "shared"
cost-center: "production"
---
# Research team gets dedicated A100 nodes
apiVersion: v1
kind: ResourceQuota
metadata:
name: research-gpu-quota
namespace: team-research
spec:
hard:
requests.nvidia.com/gpu: "16"
limits.nvidia.com/gpu: "16"
requests.cpu: "128"
requests.memory: "512Gi"
---
# Development team gets mixed GPU access
apiVersion: v1
kind: ResourceQuota
metadata:
name: dev-gpu-quota
namespace: team-development
spec:
hard:
requests.nvidia.com/gpu: "8"
limits.nvidia.com/gpu: "8"
requests.cpu: "64"
requests.memory: "256Gi"
---
# Inference team gets time-sliced GPUs
apiVersion: v1
kind: ResourceQuota
metadata:
name: inference-gpu-quota
namespace: team-inference
spec:
hard:
requests.nvidia.com/gpu: "32" # Higher due to time-slicing
limits.nvidia.com/gpu: "32"
requests.cpu: "128"
requests.memory: "512Gi"
Best Practice 12: AutoScaling GPU Workloads
# gpu-hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: gpu-inference-hpa
namespace: team-inference
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: inference-service
minReplicas: 2
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
- type: Pods
pods:
metric:
name: gpu_utilization
target:
type: AverageValue
averageValue: "75"
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 50
periodSeconds: 60
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Percent
value: 100
periodSeconds: 30
Best Practice 13: Cluster Autoscaling for GPU Nodes
# cluster-autoscaler-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: cluster-autoscaler-status
namespace: kube-system
data:
nodes.max: "100"
nodes.min: "3"
scale-down-delay-after-add: "10m"
scale-down-unneeded-time: "5m"
scale-down-gpu-utilization-threshold: "0.5"
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: cluster-autoscaler-gpu
namespace: kube-system
spec:
replicas: 1
selector:
matchLabels:
app: cluster-autoscaler-gpu
template:
metadata:
labels:
app: cluster-autoscaler-gpu
spec:
containers:
- image: k8s.gcr.io/autoscaling/cluster-autoscaler:v1.27.0
name: cluster-autoscaler
resources:
limits:
cpu: 100m
memory: 300Mi
requests:
cpu: 100m
memory: 300Mi
command:
- ./cluster-autoscaler
- --v=4
- --stderrthreshold=info
- --cloud-provider=aws # Adjust for your cloud provider
- --skip-nodes-with-local-storage=false
- --expander=least-waste
- --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/gpu-cluster
- --balance-similar-node-groups
- --scale-down-enabled=true
- --scale-down-delay-after-add=10m
- --scale-down-unneeded-time=5m
- --scale-down-gpu-utilization-threshold=0.5
env:
- name: AWS_REGION
value: us-west-2
Troubleshooting and Optimization
Best Practice 14: Common Issues and Solutions
GPU Driver Issues
# Check GPU driver installation
kubectl logs -n gpu-operator -l app=nvidia-driver-daemonset
# Verify GPU discovery
kubectl get nodes -o json | jq '.items[] | {name: .metadata.name, gpus: .status.allocatable["nvidia.com/gpu"]}'
# Check device plugin status
kubectl logs -n gpu-operator -l app=nvidia-device-plugin-daemonset
# Verify GPU is visible in pod
kubectl exec -it <pod-name> -- nvidia-smi
Resource Allocation Debugging
# debug-gpu-pod.yaml
apiVersion: v1
kind: Pod
metadata:
name: gpu-debug
spec:
restartPolicy: Never
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
containers:
- name: gpu-debug
image: nvcr.io/nvidia/cuda:12.3-runtime-ubuntu22.04
command: ["/bin/bash"]
args: ["-c", "while true; do nvidia-smi; sleep 30; done"]
resources:
limits:
nvidia.com/gpu: 1
env:
- name: NVIDIA_VISIBLE_DEVICES
value: "all"
- name: NVIDIA_DRIVER_CAPABILITIES
value: "compute,utility,graphics"
Performance Optimization Script
bash
#!/bin/bash
# gpu-optimization.sh
echo "=== GPU Cluster Optimization Report ==="
# Check GPU utilization across cluster
echo "GPU Utilization by Node:"
kubectl get pods -o wide | grep nvidia-dcgm-exporter | while read line; do
POD=$(echo $line | awk '{print $1}')
NODE=$(echo $line | awk '{print $7}')
echo "Node: $NODE"
kubectl exec $POD -- curl -s localhost:9400/metrics | grep "DCGM_FI_DEV_GPU_UTIL" | head -5
echo "---"
done
# Check pending GPU pods
echo "Pending GPU Pods:"
kubectl get pods --all-namespaces -o wide | grep Pending | while read line; do
NAMESPACE=$(echo $line | awk '{print $1}')
POD=$(echo $line | awk '{print $2}')
kubectl describe pod $POD -n $NAMESPACE | grep "nvidia.com/gpu" > /dev/null
if [ $? -eq 0 ]; then
echo "GPU Pod Pending: $NAMESPACE/$POD"
kubectl describe pod $POD -n $NAMESPACE | grep -A 5 "Events:"
fi
done
# Check GPU node capacity
echo "GPU Node Capacity:"
kubectl describe nodes | grep -A 5 -B 5 "nvidia.com/gpu"
# Check time-slicing configuration
echo "Time-Slicing Status:"
kubectl get configmap -n gpu-operator | grep time-slicing
Best Practice 15: Performance Tuning
GPU Memory Optimization
yaml
# memory-optimized-workload.yaml
apiVersion: v1
kind: Pod
metadata:
name: memory-optimized-training
spec:
containers:
- name: training
image: nvcr.io/nvidia/pytorch:24.01-py3
resources:
limits:
nvidia.com/gpu: 1
memory: 32Gi
env:
- name: PYTORCH_CUDA_ALLOC_CONF
value: "max_split_size_mb:128"
- name: CUDA_LAUNCH_BLOCKING
value: "0"
- name: CUDA_CACHE_MAXSIZE
value: "2147483647"
- name: PYTHONUNBUFFERED
value: "1"
command: ["python"]
args:
- "-c"
- |
import torch
import gc
# Enable memory optimization
torch.backends.cudnn.benchmark = True
torch.backends.cudnn.deterministic = False
# Use memory efficient attention
torch.backends.cuda.enable_flash_sdp(True)
# Your training code with memory management
device = torch.cuda.current_device()
torch.cuda.set_per_process_memory_fraction(0.95, device)
# Training loop with periodic cleanup
for epoch in range(100):
# Training code here
if epoch % 10 == 0:
gc.collect()
torch.cuda.empty_cache()
Conclusion
Implementing proper GPU resource management in Kubernetes requires careful attention to hardware configuration, software setup, resource allocation, and monitoring. The best practices outlined in this guide provide a comprehensive framework for organizations to:
- Efficiently provision and manage NVIDIA GPUs using the GPU Operator
- Optimize resource utilization through MIG and time-slicing strategies
- Implement proper scheduling with node affinity and tolerations
- Monitor performance and health using DCGM and Prometheus
- Scale workloads effectively while maintaining cost control
As GPU acceleration continues to provide 10-100x performance improvements over CPU-only processing, following these validated best practices ensures your Kubernetes clusters can efficiently support demanding AI/ML workloads while maximizing hardware investments.
For the latest updates and community discussions, refer to the official Kubernetes GPU documentation and the NVIDIA GPU Operator documentation