Understanding GPU Scheduling in Kubernetes
As artificial intelligence and machine learning workloads continue to dominate enterprise computing, Kubernetes has emerged as the de facto platform for orchestrating GPU-accelerated applications. With ‘Kubernetes AI’ experiencing a 300% increase in search volume in 2025 and 48% of organizations now running AI/ML workloads on Kubernetes, understanding GPU scheduling and resource management has become critical for DevOps engineers, platform teams, and ML practitioners.
This comprehensive guide explores the evolution of GPU support in Kubernetes, from basic device plugins to advanced Dynamic Resource Allocation (DRA), covering practical implementations, optimization strategies, and real-world patterns that organizations are using to maximize their GPU infrastructure investment.
1. The GPU Revolution in Kubernetes
1.1 Why GPUs Matter for Modern Workloads
The explosion of AI/ML workloads has fundamentally transformed how organizations approach infrastructure. GPUs provide 10-100x performance improvements over CPU-only processing for specific workloads, making them indispensable for:
- Large Language Model (LLM) training and inference
- Computer vision and image processing pipelines
- Real-time recommendation systems
- Scientific computing and simulations
- Video transcoding and rendering
1.2 Current State of GPU Adoption
According to industry reports, the state of GPU adoption in Kubernetes has reached critical mass:
Metric | Value (2025) |
|---|---|
Organizations using K8s for AI/ML | 48% |
Expected AI workload growth (12 months) | 90% |
Edge K8s in production | 50% |
GPU acceleration performance gain | 10-100x |
2. Understanding GPU Architecture in Kubernetes
2.1 The Device Plugin Framework
Kubernetes uses a device plugin framework to expose specialized hardware resources like GPUs to pods. The architecture consists of several key components that work together to enable GPU scheduling.
Core Components
Device Plugin: A gRPC server that runs on each node and advertises GPU resources to the kubelet. NVIDIA’s k8s-device-plugin is the reference implementation.
Kubelet: Manages device allocation at the node level, maintaining a socket connection with device plugins and tracking available resources.
Scheduler: Makes pod placement decisions based on GPU resource requests and node availability.
Container Runtime: Configures containers to access allocated GPU devices through NVIDIA Container Toolkit integration.
2.2 Resource Model
GPUs are exposed as extended resources in Kubernetes using the nvidia.com/gpu resource type. Here’s how resource requests work:
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
spec:
containers:
- name: cuda-container
image: nvcr.io/nvidia/cuda:12.0-base
resources:
limits:
nvidia.com/gpu: 1 # Request 1 GPU
command: ["nvidia-smi"]
Key characteristics of the GPU resource model:
- GPUs are non-compressible resources (cannot be overcommitted)
- Requests must equal limits for GPU resources
- GPUs are allocated as whole units by default
- Memory and compute are tied to the allocated GPU
3. NVIDIA GPU Operator Deep Dive
3.1 What is the GPU Operator?
The NVIDIA GPU Operator automates the management of all NVIDIA software components needed to provision GPUs in Kubernetes. Instead of manually installing drivers, container runtime, and device plugins, the operator handles everything through Kubernetes-native resources.
3.2 Architecture Components
Component | Purpose |
|---|---|
NVIDIA Driver | Kernel module for GPU hardware communication |
Container Toolkit | Enables containers to access GPU devices |
Device Plugin | Advertises GPUs to Kubernetes scheduler |
DCGM Exporter | Exports GPU metrics to Prometheus |
GPU Feature Discovery | Labels nodes with GPU properties |
MIG Manager | Manages Multi-Instance GPU partitioning |
3.3 Installation Guide
Install the GPU Operator using Helm:
# Add the NVIDIA Helm repository
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
# Install the GPU Operator
helm install --wait --generate-name \
-n gpu-operator --create-namespace \
nvidia/gpu-operator \
--set driver.enabled=true \
--set toolkit.enabled=true \
--set devicePlugin.enabled=true \
--set dcgmExporter.enabled=true \
--set migManager.enabled=true \
--set gfd.enabled=true
3.4 Verifying Installation
After installation, verify that all components are running:
# Check GPU Operator pods
kubectl get pods -n gpu-operator
# Verify GPU resources are advertised
kubectl describe nodes | grep nvidia.com/gpu
# Run a test workload
kubectl run gpu-test --image=nvcr.io/nvidia/cuda:12.0-base \
--restart=Never --rm -it \
--limits=nvidia.com/gpu=1 -- nvidia-smi
4. GPU Scheduling Mechanisms
4.1 Default Scheduling Behavior
By default, Kubernetes schedules GPU workloads based on simple resource availability. The scheduler ensures that the requested nvidia.com/gpu count is available on the target node, but it doesn’t consider GPU topology, memory, or compute capability.
4.2 Topology-Aware Scheduling
For multi-GPU workloads, topology awareness is critical for performance. GPUs connected via NVLink or within the same PCIe tree communicate faster. The topology-aware scheduler plugin enables intelligent GPU placement.
apiVersion: v1
kind: ConfigMap
metadata:
name: topology-config
namespace: gpu-operator
data:
config.yaml: |
version: v1
sharing:
timeSlicing:
renameByDefault: false
resources:
- name: nvidia.com/gpu
replicas: 4 # Allow 4 time-sliced shares per GPU
flags:
migStrategy: mixed
failOnInitError: true
4.3 Node Affinity and GPU Selection
Use node selectors and affinity rules to target specific GPU types:
apiVersion: v1
kind: Pod
metadata:
name: a100-workload
spec:
nodeSelector:
nvidia.com/gpu.product: "NVIDIA-A100-SXM4-80GB"
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: nvidia.com/gpu.compute.major
operator: Gt
values: ["7"] # Ampere or newer
containers:
- name: training
image: nvcr.io/nvidia/pytorch:23.10-py3
resources:
limits:
nvidia.com/gpu: 4
4.4 GPU Feature Discovery Labels
GPU Feature Discovery (GFD) automatically labels nodes with GPU properties. These labels enable sophisticated scheduling decisions:
Label | Example Value |
|---|---|
nvidia.com/gpu.product | NVIDIA-A100-SXM4-80GB |
nvidia.com/gpu.memory | 81920 |
nvidia.com/gpu.compute.major | 8 |
nvidia.com/mig.capable | true |
5. Dynamic Resource Allocation (DRA)
5.1 Introduction to DRA
Dynamic Resource Allocation (DRA) represents the future of GPU resource management in Kubernetes. Introduced as alpha in Kubernetes 1.26 and progressing to beta in 1.31, DRA provides a more flexible and powerful way to allocate specialized hardware resources.
5.2 Key Benefits Over Device Plugins
- Structured Parameters: DRA uses CEL expressions for precise device selection
- Claim-Based Model: Resources are claimed explicitly, improving tracking
- Network Preparation: Allows pre-allocation setup for complex resources
- Multiple Claims per Pod: Pods can request different GPU types
- Admin Controls: DeviceClass allows cluster-wide policies
5.3 DRA Implementation Example
Here’s a complete example of using DRA for GPU allocation:
# DeviceClass defines available GPU types
apiVersion: resource.k8s.io/v1alpha3
kind: DeviceClass
metadata:
name: nvidia-gpu
spec:
selectors:
- cel:
expression: 'device.driver == "nvidia.com"'
---
# ResourceClaim requests a specific GPU
apiVersion: resource.k8s.io/v1alpha3
kind: ResourceClaim
metadata:
name: gpu-claim
spec:
devices:
requests:
- name: gpu
deviceClassName: nvidia-gpu
selectors:
- cel:
expression: 'device.attributes["compute.major"] >= 8'
---
# Pod references the claim
apiVersion: v1
kind: Pod
metadata:
name: dra-gpu-pod
spec:
resourceClaims:
- name: gpu-claim
resourceClaimName: gpu-claim
containers:
- name: training
image: nvcr.io/nvidia/pytorch:23.10-py3
resources:
claims:
- name: gpu-claim
5.4 DRA vs Device Plugin Comparison
Feature | Device Plugin | DRA |
|---|---|---|
Device Selection | Count only | CEL expressions |
Resource Visibility | Node capacity | Claim objects |
Preparation | None | Network setup supported |
Maturity | Stable | Beta (1.31) |
6. Fractional GPU Sharing Strategies
6.1 Why Share GPUs?
GPU utilization in many inference workloads averages only 10-30%. Sharing GPUs across multiple workloads can significantly reduce costs while maintaining acceptable performance for non-latency-critical applications.
6.2 Time-Slicing
Time-slicing allows multiple pods to share a GPU by rapidly switching between them. Configure time-slicing through the device plugin ConfigMap:
apiVersion: v1
kind: ConfigMap
metadata:
name: device-plugin-config
namespace: gpu-operator
data:
config.yaml: |
version: v1
sharing:
timeSlicing:
renameByDefault: true
failRequestsGreaterThanOne: false
resources:
- name: nvidia.com/gpu
replicas: 4 # Each GPU appears as 4 resources
After applying this configuration, each physical GPU is advertised as 4 nvidia.com/gpu resources. Pods requesting 1 GPU will share the physical device with up to 3 other pods.
6.3 GPU Memory Limits
For tighter control, use CUDA_MPS or set memory limits via environment variables:
apiVersion: v1
kind: Pod
metadata:
name: memory-limited-gpu
spec:
containers:
- name: inference
image: my-inference-app:latest
env:
- name: NVIDIA_VISIBLE_DEVICES
value: "0"
- name: NVIDIA_MPS_ACTIVE_THREAD_PERCENTAGE
value: "25" # Limit to 25% of GPU compute
resources:
limits:
nvidia.com/gpu: 1
6.4 vGPU (Virtual GPU)
NVIDIA vGPU provides hardware-level isolation for GPU sharing. It requires a licensed vGPU software stack but offers stronger isolation guarantees than time-slicing.
7. Multi-Instance GPU (MIG)
7.1 Understanding MIG
Multi-Instance GPU (MIG) is a feature available on NVIDIA A100, A30, and H100 GPUs that enables hardware-level partitioning of a single GPU into multiple isolated instances. Each instance has dedicated compute resources, memory bandwidth, and L2 cache.
7.2 MIG Profiles
A100 80GB supports various MIG configurations:
Profile | Memory | SM Count | Max Instances |
|---|---|---|---|
1g.10gb | 10 GB | 14 | 7 |
2g.20gb | 20 GB | 28 | 3 |
3g.40gb | 40 GB | 42 | 2 |
7g.80gb | 80 GB | 98 | 1 |
7.3 Enabling MIG in Kubernetes
# Label nodes for MIG strategy
kubectl label nodes gpu-node-1 nvidia.com/mig.config=all-1g.10gb
# MIG ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
name: mig-parted-config
namespace: gpu-operator
data:
config.yaml: |
version: v1
mig-configs:
all-1g.10gb:
- devices: all
mig-enabled: true
mig-devices:
"1g.10gb": 7
all-3g.40gb:
- devices: all
mig-enabled: true
mig-devices:
"3g.40gb": 2
7.4 Requesting MIG Devices
apiVersion: v1
kind: Pod
metadata:
name: mig-workload
spec:
containers:
- name: inference
image: nvcr.io/nvidia/tritonserver:23.10-py3
resources:
limits:
nvidia.com/mig-1g.10gb: 1 # Request a 1g.10gb MIG instance
8. GPU Monitoring & Observability
8.1 DCGM Exporter
The NVIDIA Data Center GPU Manager (DCGM) Exporter provides comprehensive GPU metrics for Prometheus. It’s automatically deployed by the GPU Operator.
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: dcgm-exporter
namespace: gpu-operator
spec:
selector:
matchLabels:
app: dcgm-exporter
template:
metadata:
labels:
app: dcgm-exporter
spec:
hostNetwork: true
hostPID: true
containers:
- name: dcgm-exporter
image: nvcr.io/nvidia/k8s/dcgm-exporter:3.3.5-3.4.1-ubuntu22.04
ports:
- name: metrics
containerPort: 9400
env:
- name: DCGM_EXPORTER_LISTEN
value: ":9400"
- name: DCGM_EXPORTER_KUBERNETES
value: "true"
securityContext:
runAsNonRoot: false
runAsUser: 0
volumeMounts:
- name: proc
mountPath: /host/proc
readOnly: true
volumes:
- name: proc
hostPath:
path: /proc
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
8.2 Key Metrics to Monitor
Metric | Description |
|---|---|
DCGM_FI_DEV_GPU_UTIL | GPU compute utilization percentage |
DCGM_FI_DEV_MEM_COPY_UTIL | Memory copy engine utilization |
DCGM_FI_DEV_FB_USED | Framebuffer (GPU memory) used |
DCGM_FI_DEV_GPU_TEMP | GPU temperature in Celsius |
DCGM_FI_DEV_POWER_USAGE | Current power draw in Watts |
DCGM_FI_DEV_SM_CLOCK | Streaming multiprocessor clock speed |
8.3 Grafana Dashboard Example
Create a comprehensive GPU monitoring dashboard with these PromQL queries:
# GPU Utilization per Pod
sum by (pod, GPU_I_ID) (
DCGM_FI_DEV_GPU_UTIL{namespace="$namespace"}
)
# GPU Memory Usage
sum by (pod, GPU_I_ID) (
DCGM_FI_DEV_FB_USED{namespace="$namespace"}
) / sum by (pod, GPU_I_ID) (
DCGM_FI_DEV_FB_FREE{namespace="$namespace"} +
DCGM_FI_DEV_FB_USED{namespace="$namespace"}
) * 100
# Power Efficiency (TFLOPS per Watt)
sum by (node) (rate(DCGM_FI_PROF_GR_ENGINE_ACTIVE[5m])) /
sum by (node) (DCGM_FI_DEV_POWER_USAGE)
9. Cost Optimization Strategies
9.1 Right-Sizing GPU Workloads
GPU costs can quickly spiral out of control without proper management. Here are proven strategies for optimization:
- Profile workloads to understand actual GPU utilization patterns
- Use MIG for inference workloads that don’t need full GPU
- Implement time-slicing for batch processing
- Consider spot/preemptible instances for fault-tolerant training
9.2 Cluster Autoscaling for GPUs
Configure Karpenter for intelligent GPU node provisioning:
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
name: gpu-nodepool
spec:
template:
spec:
requirements:
- key: "karpenter.k8s.aws/instance-category"
operator: In
values: ["p", "g"] # P and G series GPU instances
- key: "karpenter.k8s.aws/instance-gpu-count"
operator: Gt
values: ["0"]
- key: "kubernetes.io/arch"
operator: In
values: ["amd64"]
nodeClassRef:
name: gpu-nodes
limits:
cpu: 1000
nvidia.com/gpu: 100
disruption:
consolidationPolicy: WhenEmpty
consolidateAfter: 30s
9.3 GPU Cost Attribution with Kubecost
Track GPU costs per namespace, team, or application:
# Install Kubecost with GPU cost tracking
helm install kubecost cost-analyzer \
--repo https://kubecost.github.io/cost-analyzer/ \
--namespace kubecost --create-namespace \
--set kubecostProductConfigs.gpuCostEnabled=true \
--set prometheus.server.global.external_labels.cluster_id=prod-gpu
10. Production Best Practices
10.1 Resource Quotas for GPU
Implement quotas to prevent GPU resource exhaustion:
apiVersion: v1
kind: ResourceQuota
metadata:
name: gpu-quota
namespace: ml-training
spec:
hard:
requests.nvidia.com/gpu: "8"
limits.nvidia.com/gpu: "8"
persistentvolumeclaims: "20"
---
apiVersion: v1
kind: LimitRange
metadata:
name: gpu-limits
namespace: ml-training
spec:
limits:
- type: Container
max:
nvidia.com/gpu: "4"
default:
nvidia.com/gpu: "1"
10.2 Pod Priority and Preemption
Define priority classes to ensure critical GPU workloads get resources:
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: gpu-critical
value: 1000000
globalDefault: false
description: "Critical GPU training jobs"
preemptionPolicy: PreemptLowerPriority
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: gpu-batch
value: 100000
globalDefault: false
description: "Batch inference workloads"
preemptionPolicy: Never
10.3 Health Checks for GPU Pods
apiVersion: v1
kind: Pod
metadata:
name: gpu-app
spec:
containers:
- name: app
image: my-gpu-app:latest
resources:
limits:
nvidia.com/gpu: 1
livenessProbe:
exec:
command:
- nvidia-smi
- --query-gpu=gpu_name
- --format=csv,noheader
initialDelaySeconds: 30
periodSeconds: 60
readinessProbe:
exec:
command:
- python
- -c
- "import torch; assert torch.cuda.is_available()"
initialDelaySeconds: 10
periodSeconds: 10
10.4 Security Considerations
- Run GPU pods with non-root users where possible
- Use Pod Security Standards to restrict device access
- Implement network policies for GPU workload isolation
- Regularly update GPU drivers and container toolkit
- Monitor for GPU-specific vulnerabilities (e.g., side-channel attacks)
Conclusion
GPU scheduling and resource management in Kubernetes has evolved dramatically, transforming from simple device counting to sophisticated allocation mechanisms like DRA and MIG. As AI/ML workloads continue to dominate enterprise computing, mastering these concepts becomes essential for platform engineers and DevOps teams.
Key takeaways from this guide:
- The NVIDIA GPU Operator simplifies deployment but requires understanding of underlying components
- Dynamic Resource Allocation (DRA) represents the future of GPU scheduling with superior flexibility
- GPU sharing strategies (time-slicing, MIG) can significantly reduce costs for appropriate workloads
- Comprehensive monitoring with DCGM is essential for optimization and troubleshooting
Production deployments require careful attention to quotas, priorities, and security
Start with the basics—deploy the GPU Operator, verify your workloads, and progressively implement advanced features as your requirements evolve. The convergence of Kubernetes orchestration and GPU acceleration will continue to unlock unprecedented possibilities for machine learning initiatives.