Join our Discord Server
Collabnix Team The Collabnix Team is a diverse collective of Docker, Kubernetes, and IoT experts united by a passion for cloud-native technologies. With backgrounds spanning across DevOps, platform engineering, cloud architecture, and container orchestration, our contributors bring together decades of combined experience from various industries and technical domains.

GPU Scheduling in Kubernetes: A Complete Guide

7 min read

Understanding GPU Scheduling in Kubernetes

As artificial intelligence and machine learning workloads continue to dominate enterprise computing, Kubernetes has emerged as the de facto platform for orchestrating GPU-accelerated applications. With ‘Kubernetes AI’ experiencing a 300% increase in search volume in 2025 and 48% of organizations now running AI/ML workloads on Kubernetes, understanding GPU scheduling and resource management has become critical for DevOps engineers, platform teams, and ML practitioners.

This comprehensive guide explores the evolution of GPU support in Kubernetes, from basic device plugins to advanced Dynamic Resource Allocation (DRA), covering practical implementations, optimization strategies, and real-world patterns that organizations are using to maximize their GPU infrastructure investment.

1. The GPU Revolution in Kubernetes

1.1 Why GPUs Matter for Modern Workloads

The explosion of AI/ML workloads has fundamentally transformed how organizations approach infrastructure. GPUs provide 10-100x performance improvements over CPU-only processing for specific workloads, making them indispensable for:

  • Large Language Model (LLM) training and inference
  • Computer vision and image processing pipelines
  • Real-time recommendation systems
  • Scientific computing and simulations
  • Video transcoding and rendering

1.2 Current State of GPU Adoption

According to industry reports, the state of GPU adoption in Kubernetes has reached critical mass:

Metric

Value (2025)

Organizations using K8s for AI/ML

48%

Expected AI workload growth (12 months)

90%

Edge K8s in production

50%

GPU acceleration performance gain

10-100x

2. Understanding GPU Architecture in Kubernetes

2.1 The Device Plugin Framework

Kubernetes uses a device plugin framework to expose specialized hardware resources like GPUs to pods. The architecture consists of several key components that work together to enable GPU scheduling.

Core Components

Device Plugin: A gRPC server that runs on each node and advertises GPU resources to the kubelet. NVIDIA’s k8s-device-plugin is the reference implementation.

Kubelet: Manages device allocation at the node level, maintaining a socket connection with device plugins and tracking available resources.

Scheduler: Makes pod placement decisions based on GPU resource requests and node availability.

Container Runtime: Configures containers to access allocated GPU devices through NVIDIA Container Toolkit integration.

2.2 Resource Model

GPUs are exposed as extended resources in Kubernetes using the nvidia.com/gpu resource type. Here’s how resource requests work:

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  containers:
  - name: cuda-container
    image: nvcr.io/nvidia/cuda:12.0-base
    resources:
      limits:
        nvidia.com/gpu: 1  # Request 1 GPU
    command: ["nvidia-smi"]

Key characteristics of the GPU resource model:

  • GPUs are non-compressible resources (cannot be overcommitted)
  • Requests must equal limits for GPU resources
  • GPUs are allocated as whole units by default
  • Memory and compute are tied to the allocated GPU

3. NVIDIA GPU Operator Deep Dive

3.1 What is the GPU Operator?

The NVIDIA GPU Operator automates the management of all NVIDIA software components needed to provision GPUs in Kubernetes. Instead of manually installing drivers, container runtime, and device plugins, the operator handles everything through Kubernetes-native resources.

3.2 Architecture Components

Component

Purpose

NVIDIA Driver

Kernel module for GPU hardware communication

Container Toolkit

Enables containers to access GPU devices

Device Plugin

Advertises GPUs to Kubernetes scheduler

DCGM Exporter

Exports GPU metrics to Prometheus

GPU Feature Discovery

Labels nodes with GPU properties

MIG Manager

Manages Multi-Instance GPU partitioning

3.3 Installation Guide

Install the GPU Operator using Helm:

# Add the NVIDIA Helm repository
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
 
# Install the GPU Operator
helm install --wait --generate-name \
  -n gpu-operator --create-namespace \
  nvidia/gpu-operator \
  --set driver.enabled=true \
  --set toolkit.enabled=true \
  --set devicePlugin.enabled=true \
  --set dcgmExporter.enabled=true \
  --set migManager.enabled=true \
  --set gfd.enabled=true

3.4 Verifying Installation

After installation, verify that all components are running:

# Check GPU Operator pods
kubectl get pods -n gpu-operator
 
# Verify GPU resources are advertised
kubectl describe nodes | grep nvidia.com/gpu
 
# Run a test workload
kubectl run gpu-test --image=nvcr.io/nvidia/cuda:12.0-base \
  --restart=Never --rm -it \
  --limits=nvidia.com/gpu=1 -- nvidia-smi

4. GPU Scheduling Mechanisms

4.1 Default Scheduling Behavior

By default, Kubernetes schedules GPU workloads based on simple resource availability. The scheduler ensures that the requested nvidia.com/gpu count is available on the target node, but it doesn’t consider GPU topology, memory, or compute capability.

4.2 Topology-Aware Scheduling

For multi-GPU workloads, topology awareness is critical for performance. GPUs connected via NVLink or within the same PCIe tree communicate faster. The topology-aware scheduler plugin enables intelligent GPU placement.

apiVersion: v1
kind: ConfigMap
metadata:
  name: topology-config
  namespace: gpu-operator
data:
  config.yaml: |
    version: v1
    sharing:
      timeSlicing:
        renameByDefault: false
        resources:
        - name: nvidia.com/gpu
          replicas: 4  # Allow 4 time-sliced shares per GPU
    flags:
      migStrategy: mixed
      failOnInitError: true

4.3 Node Affinity and GPU Selection

Use node selectors and affinity rules to target specific GPU types:

apiVersion: v1
kind: Pod
metadata:
  name: a100-workload
spec:
  nodeSelector:
    nvidia.com/gpu.product: "NVIDIA-A100-SXM4-80GB"
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: nvidia.com/gpu.compute.major
            operator: Gt
            values: ["7"]  # Ampere or newer
  containers:
  - name: training
    image: nvcr.io/nvidia/pytorch:23.10-py3
    resources:
      limits:
        nvidia.com/gpu: 4

4.4 GPU Feature Discovery Labels

GPU Feature Discovery (GFD) automatically labels nodes with GPU properties. These labels enable sophisticated scheduling decisions:

Label

Example Value

nvidia.com/gpu.product

NVIDIA-A100-SXM4-80GB

nvidia.com/gpu.memory

81920

nvidia.com/gpu.compute.major

8

nvidia.com/mig.capable

true

5. Dynamic Resource Allocation (DRA)

5.1 Introduction to DRA

Dynamic Resource Allocation (DRA) represents the future of GPU resource management in Kubernetes. Introduced as alpha in Kubernetes 1.26 and progressing to beta in 1.31, DRA provides a more flexible and powerful way to allocate specialized hardware resources.

5.2 Key Benefits Over Device Plugins

  1. Structured Parameters: DRA uses CEL expressions for precise device selection
  2. Claim-Based Model: Resources are claimed explicitly, improving tracking
  3. Network Preparation: Allows pre-allocation setup for complex resources
  4. Multiple Claims per Pod: Pods can request different GPU types
  5. Admin Controls: DeviceClass allows cluster-wide policies

5.3 DRA Implementation Example


Here’s a complete example of using DRA for GPU allocation:


# DeviceClass defines available GPU types
apiVersion: resource.k8s.io/v1alpha3
kind: DeviceClass
metadata:
  name: nvidia-gpu
spec:
  selectors:
  - cel:
      expression: 'device.driver == "nvidia.com"'
---
# ResourceClaim requests a specific GPU
apiVersion: resource.k8s.io/v1alpha3
kind: ResourceClaim
metadata:
  name: gpu-claim
spec:
  devices:
    requests:
    - name: gpu
      deviceClassName: nvidia-gpu
      selectors:
      - cel:
          expression: 'device.attributes["compute.major"] >= 8'
---
# Pod references the claim
apiVersion: v1
kind: Pod
metadata:
  name: dra-gpu-pod
spec:
  resourceClaims:
  - name: gpu-claim
    resourceClaimName: gpu-claim
  containers:
  - name: training
    image: nvcr.io/nvidia/pytorch:23.10-py3
    resources:
      claims:
      - name: gpu-claim

5.4 DRA vs Device Plugin Comparison

Feature

Device Plugin

DRA

Device Selection

Count only

CEL expressions

Resource Visibility

Node capacity

Claim objects

Preparation

None

Network setup supported

Maturity

Stable

Beta (1.31)

6. Fractional GPU Sharing Strategies

6.1 Why Share GPUs?

GPU utilization in many inference workloads averages only 10-30%. Sharing GPUs across multiple workloads can significantly reduce costs while maintaining acceptable performance for non-latency-critical applications.

6.2 Time-Slicing

Time-slicing allows multiple pods to share a GPU by rapidly switching between them. Configure time-slicing through the device plugin ConfigMap:

apiVersion: v1
kind: ConfigMap
metadata:
  name: device-plugin-config
  namespace: gpu-operator
data:
  config.yaml: |
    version: v1
    sharing:
      timeSlicing:
        renameByDefault: true
        failRequestsGreaterThanOne: false
        resources:
        - name: nvidia.com/gpu
          replicas: 4  # Each GPU appears as 4 resources

After applying this configuration, each physical GPU is advertised as 4 nvidia.com/gpu resources. Pods requesting 1 GPU will share the physical device with up to 3 other pods.

6.3 GPU Memory Limits

For tighter control, use CUDA_MPS or set memory limits via environment variables:

apiVersion: v1
kind: Pod
metadata:
  name: memory-limited-gpu
spec:
  containers:
  - name: inference
    image: my-inference-app:latest
    env:
    - name: NVIDIA_VISIBLE_DEVICES
      value: "0"
    - name: NVIDIA_MPS_ACTIVE_THREAD_PERCENTAGE
      value: "25"  # Limit to 25% of GPU compute
    resources:
      limits:
        nvidia.com/gpu: 1

6.4 vGPU (Virtual GPU)

NVIDIA vGPU provides hardware-level isolation for GPU sharing. It requires a licensed vGPU software stack but offers stronger isolation guarantees than time-slicing.

7. Multi-Instance GPU (MIG)

7.1 Understanding MIG

Multi-Instance GPU (MIG) is a feature available on NVIDIA A100, A30, and H100 GPUs that enables hardware-level partitioning of a single GPU into multiple isolated instances. Each instance has dedicated compute resources, memory bandwidth, and L2 cache.

7.2 MIG Profiles

A100 80GB supports various MIG configurations:

Profile

Memory

SM Count

Max Instances

1g.10gb

10 GB

14

7

2g.20gb

20 GB

28

3

3g.40gb

40 GB

42

2

7g.80gb

80 GB

98

1

7.3 Enabling MIG in Kubernetes

# Label nodes for MIG strategy
kubectl label nodes gpu-node-1 nvidia.com/mig.config=all-1g.10gb
 
# MIG ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
  name: mig-parted-config
  namespace: gpu-operator
data:
  config.yaml: |
    version: v1
    mig-configs:
      all-1g.10gb:
        - devices: all
          mig-enabled: true
          mig-devices:
            "1g.10gb": 7
      all-3g.40gb:
        - devices: all
          mig-enabled: true
          mig-devices:
            "3g.40gb": 2

7.4 Requesting MIG Devices

apiVersion: v1
kind: Pod
metadata:
  name: mig-workload
spec:
  containers:
  - name: inference
    image: nvcr.io/nvidia/tritonserver:23.10-py3
    resources:
      limits:
        nvidia.com/mig-1g.10gb: 1  # Request a 1g.10gb MIG instance

8. GPU Monitoring & Observability

8.1 DCGM Exporter

The NVIDIA Data Center GPU Manager (DCGM) Exporter provides comprehensive GPU metrics for Prometheus. It’s automatically deployed by the GPU Operator.

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: dcgm-exporter
  namespace: gpu-operator
spec:
  selector:
    matchLabels:
      app: dcgm-exporter
  template:
    metadata:
      labels:
        app: dcgm-exporter
    spec:
      hostNetwork: true
      hostPID: true
      containers:
      - name: dcgm-exporter
        image: nvcr.io/nvidia/k8s/dcgm-exporter:3.3.5-3.4.1-ubuntu22.04
        ports:
        - name: metrics
          containerPort: 9400
        env:
        - name: DCGM_EXPORTER_LISTEN
          value: ":9400"
        - name: DCGM_EXPORTER_KUBERNETES
          value: "true"
        securityContext:
          runAsNonRoot: false
          runAsUser: 0
        volumeMounts:
        - name: proc
          mountPath: /host/proc
          readOnly: true
      volumes:
      - name: proc
        hostPath:
          path: /proc
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule

8.2 Key Metrics to Monitor

Metric

Description

DCGM_FI_DEV_GPU_UTIL

GPU compute utilization percentage

DCGM_FI_DEV_MEM_COPY_UTIL

Memory copy engine utilization

DCGM_FI_DEV_FB_USED

Framebuffer (GPU memory) used

DCGM_FI_DEV_GPU_TEMP

GPU temperature in Celsius

DCGM_FI_DEV_POWER_USAGE

Current power draw in Watts

DCGM_FI_DEV_SM_CLOCK

Streaming multiprocessor clock speed

8.3 Grafana Dashboard Example

Create a comprehensive GPU monitoring dashboard with these PromQL queries:

# GPU Utilization per Pod
sum by (pod, GPU_I_ID) (
  DCGM_FI_DEV_GPU_UTIL{namespace="$namespace"}
)
 
# GPU Memory Usage
sum by (pod, GPU_I_ID) (
  DCGM_FI_DEV_FB_USED{namespace="$namespace"}
) / sum by (pod, GPU_I_ID) (
  DCGM_FI_DEV_FB_FREE{namespace="$namespace"} + 
  DCGM_FI_DEV_FB_USED{namespace="$namespace"}
) * 100
 
# Power Efficiency (TFLOPS per Watt)
sum by (node) (rate(DCGM_FI_PROF_GR_ENGINE_ACTIVE[5m])) /
sum by (node) (DCGM_FI_DEV_POWER_USAGE)

9. Cost Optimization Strategies

9.1 Right-Sizing GPU Workloads

GPU costs can quickly spiral out of control without proper management. Here are proven strategies for optimization:

  1. Profile workloads to understand actual GPU utilization patterns
  2. Use MIG for inference workloads that don’t need full GPU
  3. Implement time-slicing for batch processing
  4. Consider spot/preemptible instances for fault-tolerant training

9.2 Cluster Autoscaling for GPUs

Configure Karpenter for intelligent GPU node provisioning:

apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: gpu-nodepool
spec:
  template:
    spec:
      requirements:
      - key: "karpenter.k8s.aws/instance-category"
        operator: In
        values: ["p", "g"]  # P and G series GPU instances
      - key: "karpenter.k8s.aws/instance-gpu-count"
        operator: Gt
        values: ["0"]
      - key: "kubernetes.io/arch"
        operator: In
        values: ["amd64"]
      nodeClassRef:
        name: gpu-nodes
  limits:
    cpu: 1000
    nvidia.com/gpu: 100
  disruption:
    consolidationPolicy: WhenEmpty
    consolidateAfter: 30s

9.3 GPU Cost Attribution with Kubecost

Track GPU costs per namespace, team, or application:

# Install Kubecost with GPU cost tracking 

  helm install kubecost cost-analyzer \

    --repo https://kubecost.github.io/cost-analyzer/ \

    --namespace kubecost --create-namespace \

    --set kubecostProductConfigs.gpuCostEnabled=true \

    --set prometheus.server.global.external_labels.cluster_id=prod-gpu

10. Production Best Practices

10.1 Resource Quotas for GPU

Implement quotas to prevent GPU resource exhaustion:

apiVersion: v1 

  kind: ResourceQuota

  metadata:

    name: gpu-quota

    namespace: ml-training

  spec:

    hard:

      requests.nvidia.com/gpu: "8"

      limits.nvidia.com/gpu: "8"

      persistentvolumeclaims: "20"

  ---

  apiVersion: v1

  kind: LimitRange

  metadata:

    name: gpu-limits

    namespace: ml-training

  spec:

    limits:

    - type: Container

      max:

        nvidia.com/gpu: "4"

      default:

        nvidia.com/gpu: "1"

10.2 Pod Priority and Preemption

Define priority classes to ensure critical GPU workloads get resources:

apiVersion: scheduling.k8s.io/v1 

  kind: PriorityClass

  metadata:

    name: gpu-critical

  value: 1000000

  globalDefault: false

  description: "Critical GPU training jobs"

  preemptionPolicy: PreemptLowerPriority

  ---

  apiVersion: scheduling.k8s.io/v1

  kind: PriorityClass

  metadata:

    name: gpu-batch

  value: 100000

  globalDefault: false

  description: "Batch inference workloads"

  preemptionPolicy: Never

10.3 Health Checks for GPU Pods

apiVersion: v1 

  kind: Pod

  metadata:

    name: gpu-app

  spec:

    containers:

    - name: app

      image: my-gpu-app:latest

      resources:

        limits:

          nvidia.com/gpu: 1

      livenessProbe:

        exec:

          command:

          - nvidia-smi

          - --query-gpu=gpu_name

          - --format=csv,noheader

        initialDelaySeconds: 30

        periodSeconds: 60

      readinessProbe:

        exec:

          command:

          - python

          - -c

          - "import torch; assert torch.cuda.is_available()"

        initialDelaySeconds: 10

        periodSeconds: 10

10.4 Security Considerations

  • Run GPU pods with non-root users where possible
  • Use Pod Security Standards to restrict device access
  • Implement network policies for GPU workload isolation
  • Regularly update GPU drivers and container toolkit
  • Monitor for GPU-specific vulnerabilities (e.g., side-channel attacks)

Conclusion

GPU scheduling and resource management in Kubernetes has evolved dramatically, transforming from simple device counting to sophisticated allocation mechanisms like DRA and MIG. As AI/ML workloads continue to dominate enterprise computing, mastering these concepts becomes essential for platform engineers and DevOps teams.

Key takeaways from this guide:

  1. The NVIDIA GPU Operator simplifies deployment but requires understanding of underlying components
  2. Dynamic Resource Allocation (DRA) represents the future of GPU scheduling with superior flexibility
  3. GPU sharing strategies (time-slicing, MIG) can significantly reduce costs for appropriate workloads
  4. Comprehensive monitoring with DCGM is essential for optimization and troubleshooting
    Production deployments require careful attention to quotas, priorities, and security

Start with the basics—deploy the GPU Operator, verify your workloads, and progressively implement advanced features as your requirements evolve. The convergence of Kubernetes orchestration and GPU acceleration will continue to unlock unprecedented possibilities for machine learning initiatives.

Additional Resources

Have Queries? Join https://launchpass.com/collabnix

Collabnix Team The Collabnix Team is a diverse collective of Docker, Kubernetes, and IoT experts united by a passion for cloud-native technologies. With backgrounds spanning across DevOps, platform engineering, cloud architecture, and container orchestration, our contributors bring together decades of combined experience from various industries and technical domains.
Join our Discord Server
Index