Understanding GPU Allocation in Kubernetes
Understanding how Kubernetes allocates GPUs to workloads is crucial for anyone working with AI/ML applications or high-performance computing. This comprehensive guide explores the intricate mechanisms behind GPU allocation in Kubernetes, from the device plugin framework to the complete allocation process.
Overview: The GPU Allocation Challenge
Traditional Kubernetes was designed for CPU and memory resources, which are divisible and can be shared easily. GPUs, however, present unique challenges:
- Indivisible Resources: By default, GPUs are allocated as whole units
- Vendor-Specific Drivers: Each GPU vendor requires specific drivers and runtime configurations
- Specialized Hardware: GPUs need vendor-specific initialization and setup
- Complex Runtime Integration: Container runtimes need special configuration to access GPU hardware
The Device Plugin Framework: Foundation of GPU Allocation
Kubernetes solves GPU allocation through the Device Plugin Framework, which allows vendors to extend Kubernetes without modifying core code.
Architecture Components

Device Plugin Registration Process
When a GPU device plugin starts, it follows a specific registration process:
# Device Plugin DaemonSet Example
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: nvidia-device-plugin-daemonset
namespace: kube-system
spec:
selector:
matchLabels:
name: nvidia-device-plugin-ds
template:
metadata:
labels:
name: nvidia-device-plugin-ds
spec:
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
containers:
- image: nvcr.io/nvidia/k8s-device-plugin:v0.14.0
name: nvidia-device-plugin-ctr
env:
- name: FAIL_ON_INIT_ERROR
value: "false"
- name: DEVICE_LIST_STRATEGY
value: "envvar"
- name: DEVICE_ID_STRATEGY
value: "uuid"
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
volumeMounts:
- name: device-plugin
mountPath: /var/lib/kubelet/device-plugins
volumes:
- name: device-plugin
hostPath:
path: /var/lib/kubelet/device-plugins
nodeSelector:
accelerator: nvidia
Step-by-Step GPU Allocation Process
Step 1: Device Discovery and Registration
The device plugin performs the following operations:
// Simplified Device Plugin Registration (Go code)
func (dp *NvidiaDevicePlugin) Register() error {
// Connect to kubelet's device plugin socket
conn, err := grpc.Dial(
"unix:///var/lib/kubelet/device-plugins/kubelet.sock",
grpc.WithInsecure(),
)
if err != nil {
return err
}
// Register with kubelet
client := pluginapi.NewRegistrationClient(conn)
request := &pluginapi.RegisterRequest{
Version: pluginapi.Version,
Endpoint: "nvidia.sock",
ResourceName: "nvidia.com/gpu",
}
_, err = client.Register(context.Background(), request)
return err
}
Step 2: Resource Advertisement
The device plugin advertises available GPUs to kubelet:
// ListAndWatch reports available devices
func (dp *NvidiaDevicePlugin) ListAndWatch(e *pluginapi.Empty, s pluginapi.DevicePlugin_ListAndWatchServer) error {
devices := dp.getDevices()
response := &pluginapi.ListAndWatchResponse{
Devices: devices,
}
if err := s.Send(response); err != nil {
return err
}
// Continue monitoring for device health changes
for {
select {
case <-dp.health:
// Send updated device list on health changes
devices := dp.getDevices()
response := &pluginapi.ListAndWatchResponse{
Devices: devices,
}
s.Send(response)
}
}
}
func (dp *NvidiaDevicePlugin) getDevices() []*pluginapi.Device {
var devices []*pluginapi.Device
// Use NVML to discover GPUs
count, err := nvml.DeviceGetCount()
if err != nil {
return devices
}
for i := 0; i < count; i++ {
device, err := nvml.DeviceGetHandleByIndex(i)
if err != nil {
continue
}
uuid, err := device.GetUUID()
if err != nil {
continue
}
devices = append(devices, &pluginapi.Device{
ID: uuid,
Health: pluginapi.Healthy,
})
}
return devices
}
Step 3: Pod Specification and Scheduling
When a user creates a pod with GPU requirements:
apiVersion: v1
kind: Pod
metadata:
name: gpu-workload
labels:
app: ai-training
spec:
containers:
- name: training-container
image: nvcr.io/nvidia/pytorch:23.10-py3
resources:
limits:
nvidia.com/gpu: 2 # Request 2 GPUs
memory: 16Gi
cpu: 8
requests:
nvidia.com/gpu: 2 # Must equal limits for GPUs
memory: 8Gi
cpu: 4
env:
- name: CUDA_VISIBLE_DEVICES
value: "all"
nodeSelector:
accelerator: nvidia-tesla-v100
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
Important GPU Resource Constraints:
- GPUs must be specified in the
limitssection requestsmust equallimitsif both are specified- GPU resources are only allocated as integers (no fractional GPUs by default)
Step 4: Kubernetes Scheduler Decision
The scheduler performs resource matching:
# Scheduler evaluates nodes based on:
# 1. Available nvidia.com/gpu resources
# 2. Node selectors and affinity rules
# 3. Taints and tolerations
# 4. Resource constraints
# Node capacity after device plugin registration
apiVersion: v1
kind: Node
metadata:
name: gpu-node-1
labels:
accelerator: nvidia-tesla-v100
gpu-count: "8"
spec:
taints:
- key: nvidia.com/gpu
value: "true"
effect: NoSchedule
status:
capacity:
nvidia.com/gpu: "8" # Total GPUs on node
cpu: "64"
memory: "256Gi"
allocatable:
nvidia.com/gpu: "8" # Available for scheduling
cpu: "62"
memory: "250Gi"
Step 5: Kubelet Device Allocation
When kubelet needs to create a container with GPU resources:
// Allocate method called by kubelet
func (dp *NvidiaDevicePlugin) Allocate(ctx context.Context, reqs *pluginapi.AllocateRequest) (*pluginapi.AllocateResponse, error) {
var responses []*pluginapi.ContainerAllocateResponse
for _, req := range reqs.ContainerRequests {
response := &pluginapi.ContainerAllocateResponse{}
// Device-specific preparations
for _, deviceID := range req.DevicesIDs {
// Add device to container
response.Devices = append(response.Devices, &pluginapi.DeviceSpec{
ContainerPath: fmt.Sprintf("/dev/nvidia%d", getDeviceIndex(deviceID)),
HostPath: fmt.Sprintf("/dev/nvidia%d", getDeviceIndex(deviceID)),
Permissions: "rwm",
})
}
// Set environment variables for NVIDIA runtime
response.Envs = map[string]string{
"NVIDIA_VISIBLE_DEVICES": strings.Join(req.DevicesIDs, ","),
"NVIDIA_DRIVER_CAPABILITIES": "compute,utility",
}
// Mount driver directories
response.Mounts = append(response.Mounts, &pluginapi.Mount{
ContainerPath: "/usr/local/nvidia",
HostPath: "/usr/local/nvidia",
Readonly: true,
})
responses = append(responses, response)
}
return &pluginapi.AllocateResponse{
ContainerResponses: responses,
}, nil
}
Step 6: Container Runtime Integration
The container runtime (containerd/CRI-O) must be configured with GPU support:
# /etc/containerd/config.toml
version = 2
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
BinaryName = "/usr/bin/nvidia-container-runtime"
Runtime = "/usr/bin/nvidia-container-runtime"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
SystemdCgroup = true
[plugins."io.containerd.grpc.v1.cri".containerd.default_runtime_name]
runtime_name = "nvidia"
Step 7: Final Container Creation
The complete flow results in a container with GPU access:
# Inside the container, GPUs are visible
$ nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.60.13 Driver Version: 525.60.13 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000000:00:1E.0 Off | 0 |
| N/A 35C P0 54W / 300W | 0MiB / 16384MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... On | 00000000:00:1F.0 Off | 0 |
| N/A 34C P0 53W / 300W | 0MiB / 16384MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
Advanced GPU Allocation Scenarios
Time-Slicing for GPU Sharing
Modern device plugins support GPU sharing through time-slicing:
apiVersion: v1
kind: ConfigMap
metadata:
name: time-slicing-config
namespace: gpu-operator
data:
any: |-
version: v1
flags:
migStrategy: none
sharing:
timeSlicing:
renameByDefault: false
failRequestsGreaterThanOne: false
resources:
- name: nvidia.com/gpu
replicas: 4 # Each GPU appears as 4 shareable resources
---
# Apply configuration to nodes
apiVersion: v1
kind: Node
metadata:
name: gpu-node-1
labels:
nvidia.com/device-plugin.config: time-slicing
Multi-Instance GPU (MIG) Support
For NVIDIA A100 GPUs, MIG enables hardware partitioning:
apiVersion: v1
kind: ConfigMap
metadata:
name: mig-config
data:
config.yaml: |
version: v1
mig-configs:
all-1g.5gb:
- devices: all
mig-enabled: true
mig-devices:
1g.5gb: 7 # Create 7 MIG instances per GPU
all-2g.10gb:
- devices: all
mig-enabled: true
mig-devices:
2g.10gb: 3 # Create 3 larger MIG instances per GPU
---
# Pod requesting MIG instance
apiVersion: v1
kind: Pod
metadata:
name: mig-workload
spec:
containers:
- name: inference
image: nvcr.io/nvidia/tensorflow:23.02-tf2-py3
resources:
limits:
nvidia.com/mig-1g.5gb: 1 # Request specific MIG slice
Dynamic Resource Allocation (DRA) – Future Direction
DRA represents the next evolution of GPU allocation:
# DRA enables more flexible resource allocation
apiVersion: resource.k8s.io/v1alpha3
kind: ResourceClaim
metadata:
name: gpu-claim
spec:
devices:
requests:
- name: gpu
deviceClassName: nvidia-gpu
selectors:
- cel:
expression: 'device.attributes["memory"] >= 16000' # 16GB minimum
---
apiVersion: v1
kind: Pod
metadata:
name: dra-workload
spec:
resourceClaims:
- name: gpu-claim
resourceClaimName: gpu-claim
containers:
- name: training
image: pytorch/pytorch:latest
resources:
claims:
- name: gpu-claim
GPU Allocation State Management
Health Monitoring and Recovery
The device plugin continuously monitors GPU health:
func (dp *NvidiaDevicePlugin) healthCheck() {
for {
select {
case <-time.After(30 * time.Second):
unhealthyDevices := dp.checkDeviceHealth()
if len(unhealthyDevices) > 0 {
dp.notifyUnhealthyDevices(unhealthyDevices)
}
case <-dp.stop:
return
}
}
}
func (dp *NvidiaDevicePlugin) checkDeviceHealth() []string {
var unhealthy []string
count, err := nvml.DeviceGetCount()
if err != nil {
return unhealthy
}
for i := 0; i < count; i++ {
device, err := nvml.DeviceGetHandleByIndex(i)
if err != nil {
continue
}
// Check various health indicators
memInfo, err := device.GetMemoryInfo()
if err != nil {
uuid, _ := device.GetUUID()
unhealthy = append(unhealthy, uuid)
continue
}
// Additional health checks...
temperature, err := device.GetTemperature(nvml.TEMPERATURE_GPU)
if err != nil || temperature > 90 { // Temperature threshold
uuid, _ := device.GetUUID()
unhealthy = append(unhealthy, uuid)
}
}
return unhealthy
}
Resource Cleanup and Deallocation
When a pod is deleted, resources are automatically freed:
# Pod deletion triggers cleanup
apiVersion: v1
kind: Pod
metadata:
name: gpu-workload
finalizers:
- gpu-cleanup.example.com/finalizer
spec:
containers:
- name: training
image: pytorch/pytorch:latest
resources:
limits:
nvidia.com/gpu: 1
---
# The device plugin automatically:
# 1. Receives notification of pod deletion
# 2. Releases GPU from allocation list
# 3. Makes GPU available for new allocations
# 4. Updates node capacity information
Troubleshooting GPU Allocation Issues
Common Allocation Problems
# Check node GPU capacity
kubectl describe node gpu-node-1 | grep nvidia.com/gpu
# Check device plugin status
kubectl get pods -n kube-system | grep nvidia-device-plugin
# Check device plugin logs
kubectl logs -n kube-system nvidia-device-plugin-daemonset-xxxxx
# Verify GPU drivers on node
kubectl exec -it debug-pod -- nvidia-smi
# Check container runtime configuration
kubectl exec -it debug-pod -- cat /etc/docker/daemon.json
Debugging Allocation Failures
# Debug pod for GPU troubleshooting
apiVersion: v1
kind: Pod
metadata:
name: gpu-debug
spec:
containers:
- name: debug
image: nvidia/cuda:11.8-devel-ubuntu22.04
command: ["/bin/bash", "-c", "sleep 3600"]
resources:
limits:
nvidia.com/gpu: 1
env:
- name: NVIDIA_VISIBLE_DEVICES
value: "all"
- name: NVIDIA_DRIVER_CAPABILITIES
value: "compute,utility"
nodeSelector:
accelerator: nvidia
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
Performance Optimization
# Optimize GPU allocation with node affinity
apiVersion: v1
kind: Pod
metadata:
name: optimized-gpu-workload
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: gpu-memory
operator: Gt
values: ["32000"] # Require >32GB GPU memory
- key: gpu-generation
operator: In
values: ["ampere", "hopper"] # Modern GPU architectures
containers:
- name: training
image: nvcr.io/nvidia/pytorch:23.10-py3
resources:
limits:
nvidia.com/gpu: 8
cpu: 32
memory: 128Gi
env:
- name: NCCL_DEBUG
value: "INFO"
- name: CUDA_DEVICE_ORDER
value: "PCI_BUS_ID"
Best Practices for GPU Allocation
1. Resource Planning
# Use resource quotas to manage GPU allocation
apiVersion: v1
kind: ResourceQuota
metadata:
name: gpu-quota
namespace: ml-workloads
spec:
hard:
nvidia.com/gpu: "16" # Limit total GPU usage
requests.memory: "512Gi"
requests.cpu: "128"
2. Monitoring and Alerting
# Prometheus rule for GPU utilization
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: gpu-alerts
spec:
groups:
- name: gpu
rules:
- alert: LowGPUUtilization
expr: DCGM_FI_DEV_GPU_UTIL < 20
for: 10m
labels:
severity: warning
annotations:
summary: "GPU utilization is low"
description: "GPU {{ $labels.gpu }} on {{ $labels.instance }} has utilization below 20% for 10 minutes"
3. Security Considerations
# Secure GPU workload with Pod Security Standards
apiVersion: v1
kind: Pod
metadata:
name: secure-gpu-workload
labels:
pod-security.kubernetes.io/enforce: restricted
spec:
securityContext:
runAsNonRoot: true
runAsUser: 1000
runAsGroup: 1000
fsGroup: 1000
seccompProfile:
type: RuntimeDefault
containers:
- name: secure-training
image: nvcr.io/nvidia/pytorch:23.10-py3
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop:
- ALL
resources:
limits:
nvidia.com/gpu: 1
memory: 8Gi
cpu: 4
Conclusion
GPU allocation in Kubernetes is a sophisticated process involving multiple components working in harmony:
- Device Plugin Framework provides the foundation for GPU discovery and allocation
- Registration Process establishes communication between device plugins and kubelet
- Resource Advertisement makes GPUs available as schedulable resources
- Scheduler Integration matches workloads with appropriate GPU-enabled nodes
- Runtime Configuration ensures containers can access allocated GPUs
- Health Monitoring maintains resource availability and reliability
Understanding this mechanism is crucial for:
- Optimizing GPU utilization in Kubernetes clusters
- Troubleshooting allocation issues effectively
- Implementing advanced features like GPU sharing and dynamic allocation
- Ensuring security and reliability of GPU workloads
As Kubernetes continues to evolve, features like Dynamic Resource Allocation (DRA) will provide even more flexibility in GPU resource management, making it easier to efficiently utilize these expensive and powerful resources in cloud-native environments.
Key Takeaways:
- GPU allocation relies on the device plugin framework for vendor extensibility
- The process involves kubelet, device plugins, container runtime, and GPU drivers
- Proper configuration at each layer is essential for successful GPU allocation
- Modern features like time-slicing and MIG enable better resource utilization
- Monitoring and security considerations are crucial for production deployments