Join our Discord Server
Collabnix Team The Collabnix Team is a diverse collective of Docker, Kubernetes, and IoT experts united by a passion for cloud-native technologies. With backgrounds spanning across DevOps, platform engineering, cloud architecture, and container orchestration, our contributors bring together decades of combined experience from various industries and technical domains.

Kubernetes and GPU: The Complete 2025 Guide to AI/ML Acceleration

6 min read

As we advance through 2025, the convergence of Kubernetes and GPU acceleration has become the cornerstone of modern AI/ML infrastructure. With “Kubernetes AI” emerging as the most searched term (experiencing a 300% increase in search volume), organizations are rapidly adopting GPU-enabled Kubernetes clusters to power their machine learning workloads. This comprehensive guide explores the trending topics, practical implementations, and optimization strategies that are shaping the future of AI infrastructure.

Why Kubernetes + GPU is Dominating 2025

The explosive growth in AI/ML workloads has created unprecedented demand for GPU resources. According to recent industry reports:

  • 48% of organizations now use Kubernetes for AI/ML workloads
  • GPU acceleration provides 10-100x performance improvements over CPU-only processing
  • Training large language models can require thousands of GPU hours
  • Companies like OpenAI scale from hundreds to thousands of GPUs in weeks using Kubernetes

1. Understanding GPU Architecture in Kubernetes

The Device Plugin Framework

Kubernetes manages GPUs through the device plugin framework, which enables specialized hardware exposure to containers:

# Basic GPU resource request
apiVersion: v1
kind: Pod
metadata:
  name: gpu-workload
spec:
  containers:
  - name: ai-training
    image: tensorflow/tensorflow:latest-gpu
    resources:
      limits:
        nvidia.com/gpu: 1  # Request 1 whole GPU
      requests:
        nvidia.com/gpu: 1

NVIDIA GPU Operator vs Device Plugin

The choice between NVIDIA GPU Operator and Device Plugin represents a fundamental architectural decision:

Device Plugin Approach:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin-daemonset
  namespace: kube-system
spec:
  selector:
    matchLabels:
      name: nvidia-device-plugin-ds
  template:
    metadata:
      labels:
        name: nvidia-device-plugin-ds
    spec:
      containers:
      - image: nvcr.io/nvidia/k8s-device-plugin:v0.14.0
        name: nvidia-device-plugin-ctr
        env:
        - name: FAIL_ON_INIT_ERROR
          value: "false"
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop: ["ALL"]
        volumeMounts:
        - name: device-plugin
          mountPath: /var/lib/kubelet/device-plugins

GPU Operator Approach:

apiVersion: v1
kind: Namespace
metadata:
  name: gpu-operator
---
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
  name: gpu-operator-group
  namespace: gpu-operator
spec:
  targetNamespaces:
  - gpu-operator
---
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: gpu-operator-certified
  namespace: gpu-operator
spec:
  channel: stable
  name: gpu-operator-certified
  source: certified-operators
  sourceNamespace: openshift-marketplace

2. GPU Sharing Strategies: Maximizing Resource Utilization

Multi-Instance GPU (MIG)

MIG enables hardware-level partitioning of NVIDIA A100 GPUs:

apiVersion: v1
kind: ConfigMap
metadata:
  name: mig-config
data:
  config.yaml: |
    version: v1
    mig-configs:
      all-1g.5gb:
        - devices: all
          mig-enabled: true
          mig-devices:
            1g.5gb: 7
      all-2g.10gb:
        - devices: all
          mig-enabled: true
          mig-devices:
            2g.10gb: 3
---
apiVersion: v1
kind: Pod
metadata:
  name: mig-workload
spec:
  containers:
  - name: inference
    image: nvcr.io/nvidia/tensorflow:23.02-tf2-py3
    resources:
      limits:
        nvidia.com/mig-1g.5gb: 1  # Request 1/7th of A100

NVIDIA Multi-Process Service (MPS)

MPS enables time-sharing of GPUs with software-level isolation:

apiVersion: v1
kind: ConfigMap
metadata:
  name: mps-config
data:
  mps-config.yaml: |
    sharing:
      timeSlicing:
        resources:
        - name: nvidia.com/gpu
          replicas: 4  # Allow 4 containers per GPU
---
apiVersion: v1
kind: Pod
metadata:
  name: shared-gpu-workload
spec:
  containers:
  - name: model-inference
    image: pytorch/pytorch:latest
    resources:
      limits:
        nvidia.com/gpu: 1
    env:
    - name: CUDA_MPS_PIPE_DIRECTORY
      value: "/tmp/nvidia-mps"
    - name: CUDA_MPS_LOG_DIRECTORY
      value: "/tmp/nvidia-log"

Dynamic Resource Allocation (DRA)

DRA represents the future of GPU resource management in Kubernetes:

apiVersion: resource.k8s.io/v1alpha3
kind: ResourceClaim
metadata:
  name: gpu-claim
spec:
  devices:
    requests:
    - name: gpu
      deviceClassName: nvidia-gpu
      selectors:
      - cel:
          expression: 'device.attributes["compute.major"] >= 8'
---
apiVersion: v1
kind: Pod
metadata:
  name: dra-gpu-pod
spec:
  resourceClaims:
  - name: gpu-claim
    resourceClaimName: gpu-claim
  containers:
  - name: training
    image: nvcr.io/nvidia/pytorch:23.10-py3
    resources:
      claims:
      - name: gpu-claim

3. AI/ML Workload Patterns and Best Practices

Distributed Training with Kubeflow

apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
  name: distributed-training
spec:
  tfReplicaSpecs:
    Chief:
      replicas: 1
      template:
        spec:
          containers:
          - name: tensorflow
            image: tensorflow/tensorflow:latest-gpu
            resources:
              limits:
                nvidia.com/gpu: 1
            env:
            - name: TF_CONFIG
              valueFrom:
                configMapKeyRef:
                  name: tf-config
                  key: tf-config.json
    Worker:
      replicas: 3
      template:
        spec:
          containers:
          - name: tensorflow
            image: tensorflow/tensorflow:latest-gpu
            resources:
              limits:
                nvidia.com/gpu: 1
            volumeMounts:
            - name: training-data
              mountPath: /data
          volumes:
          - name: training-data
            persistentVolumeClaim:
              claimName: training-data-pvc

Model Serving with Triton Inference Server

apiVersion: apps/v1
kind: Deployment
metadata:
  name: triton-inference-server
spec:
  replicas: 2
  selector:
    matchLabels:
      app: triton
  template:
    metadata:
      labels:
        app: triton
    spec:
      containers:
      - name: triton
        image: nvcr.io/nvidia/tritonserver:23.10-py3
        ports:
        - containerPort: 8000  # HTTP
        - containerPort: 8001  # GRPC
        - containerPort: 8002  # Metrics
        resources:
          limits:
            nvidia.com/gpu: 1
            memory: 8Gi
          requests:
            memory: 4Gi
        env:
        - name: CUDA_VISIBLE_DEVICES
          value: "0"
        volumeMounts:
        - name: model-repository
          mountPath: /models
        livenessProbe:
          httpGet:
            path: /v2/health/live
            port: 8000
          initialDelaySeconds: 30
        readinessProbe:
          httpGet:
            path: /v2/health/ready
            port: 8000
          initialDelaySeconds: 5
      volumes:
      - name: model-repository
        persistentVolumeClaim:
          claimName: model-repository-pvc
---
apiVersion: v1
kind: Service
metadata:
  name: triton-service
spec:
  selector:
    app: triton
  ports:
  - name: http
    port: 8000
    targetPort: 8000
  - name: grpc
    port: 8001
    targetPort: 8001
  type: LoadBalancer

4. Advanced Scheduling and Resource Management

GPU Node Affinity and Taints

# Taint GPU nodes to dedicated GPU workloads
apiVersion: v1
kind: Node
metadata:
  name: gpu-node-1
spec:
  taints:
  - key: nvidia.com/gpu
    value: "true"
    effect: NoSchedule
---
# Schedule pods with GPU requirements to tainted nodes
apiVersion: v1
kind: Pod
metadata:
  name: gpu-workload
spec:
  tolerations:
  - key: nvidia.com/gpu
    operator: Equal
    value: "true"
    effect: NoSchedule
  nodeSelector:
    accelerator: nvidia-tesla-v100
  containers:
  - name: training
    image: pytorch/pytorch:latest
    resources:
      limits:
        nvidia.com/gpu: 1

Priority Classes for GPU Workloads

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: gpu-high-priority
value: 1000
globalDefault: false
description: "High priority class for critical GPU workloads"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: gpu-low-priority
value: 100
globalDefault: false
description: "Low priority class for batch GPU workloads"
---
apiVersion: v1
kind: Pod
metadata:
  name: critical-training
spec:
  priorityClassName: gpu-high-priority
  containers:
  - name: training
    image: tensorflow/tensorflow:latest-gpu
    resources:
      limits:
        nvidia.com/gpu: 2

5. Monitoring and Observability

GPU Metrics with NVIDIA DCGM

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: dcgm-exporter
  namespace: gpu-operator
spec:
  selector:
    matchLabels:
      app: dcgm-exporter
  template:
    metadata:
      labels:
        app: dcgm-exporter
    spec:
      hostNetwork: true
      hostPID: true
      containers:
      - name: dcgm-exporter
        image: nvcr.io/nvidia/k8s/dcgm-exporter:3.1.8-3.1.5-ubuntu22.04
        ports:
        - name: metrics
          containerPort: 9400
        env:
        - name: DCGM_EXPORTER_LISTEN
          value: ":9400"
        - name: DCGM_EXPORTER_KUBERNETES
          value: "true"
        securityContext:
          runAsNonRoot: false
          runAsUser: 0
        volumeMounts:
        - name: proc
          mountPath: /host/proc
          readOnly: true
        - name: sys
          mountPath: /host/sys
          readOnly: true
      volumes:
      - name: proc
        hostPath:
          path: /proc
      - name: sys
        hostPath:
          path: /sys
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule

Custom GPU Monitoring Dashboard

apiVersion: v1
kind: ConfigMap
metadata:
  name: gpu-dashboard
data:
  dashboard.json: |
    {
      "dashboard": {
        "title": "GPU Utilization Dashboard",
        "panels": [
          {
            "title": "GPU Utilization %",
            "type": "graph",
            "targets": [
              {
                "expr": "DCGM_FI_DEV_GPU_UTIL",
                "legendFormat": "GPU {{gpu}} - {{pod}}"
              }
            ]
          },
          {
            "title": "GPU Memory Usage",
            "type": "graph",
            "targets": [
              {
                "expr": "DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_TOTAL * 100",
                "legendFormat": "GPU {{gpu}} Memory %"
              }
            ]
          }
        ]
      }
    }

6. Cost Optimization Strategies

Cluster Autoscaler with GPU Nodes

apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-autoscaler-status
  namespace: kube-system
data:
  nodes.max: "10"
  nodes.min: "1"
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: cluster-autoscaler
  namespace: kube-system
spec:
  selector:
    matchLabels:
      app: cluster-autoscaler
  template:
    metadata:
      labels:
        app: cluster-autoscaler
    spec:
      containers:
      - image: registry.k8s.io/autoscaling/cluster-autoscaler:v1.27.0
        name: cluster-autoscaler
        command:
        - ./cluster-autoscaler
        - --v=4
        - --stderrthreshold=info
        - --cloud-provider=gce
        - --skip-nodes-with-local-storage=false
        - --expander=least-waste
        - --node-group-auto-discovery=mig:name=gpu-pool
        - --scale-down-enabled=true
        - --scale-down-delay-after-add=10m
        - --scale-down-unneeded-time=10m
        env:
        - name: AWS_REGION
          value: us-west-2

Spot Instance Integration

apiVersion: v1
kind: ConfigMap
metadata:
  name: spot-config
data:
  config.yaml: |
    spotConfig:
      enabled: true
      maxSpotPercentage: 70
      spotInstanceTypes:
        - g4dn.xlarge
        - g4dn.2xlarge
        - p3.2xlarge
      fallbackOnDemand: true
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: spot-gpu-workload
spec:
  replicas: 3
  selector:
    matchLabels:
      app: spot-training
  template:
    metadata:
      labels:
        app: spot-training
    spec:
      nodeSelector:
        karpenter.sh/capacity-type: spot
      tolerations:
      - key: karpenter.sh/disruption
        operator: Exists
        effect: NoSchedule
      containers:
      - name: training
        image: pytorch/pytorch:latest
        resources:
          limits:
            nvidia.com/gpu: 1
        env:
        - name: CHECKPOINT_INTERVAL
          value: "300"  # Checkpoint every 5 minutes for spot resilience

7. Security Best Practices

GPU Workload Security

apiVersion: v1
kind: SecurityContext
metadata:
  name: gpu-security-context
spec:
  securityContext:
    runAsNonRoot: true
    runAsUser: 1000
    runAsGroup: 1000
    fsGroup: 1000
    seccompProfile:
      type: RuntimeDefault
    capabilities:
      drop:
      - ALL
      add:
      - SYS_ADMIN  # Required for GPU access
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: gpu-workload-netpol
spec:
  podSelector:
    matchLabels:
      tier: gpu-training
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          name: monitoring
    ports:
    - protocol: TCP
      port: 8080
  egress:
  - to:
    - namespaceSelector:
        matchLabels:
          name: data-storage
    ports:
    - protocol: TCP
      port: 443

8. Real-World Implementation Examples

Complete AI Training Pipeline

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  name: ai-training-pipeline
spec:
  entrypoint: training-pipeline
  templates:
  - name: training-pipeline
    dag:
      tasks:
      - name: data-preprocessing
        template: preprocess-data
      - name: model-training
        template: train-model
        dependencies: [data-preprocessing]
      - name: model-validation
        template: validate-model
        dependencies: [model-training]
      - name: model-deployment
        template: deploy-model
        dependencies: [model-validation]
  
  - name: train-model
    container:
      image: nvcr.io/nvidia/pytorch:23.10-py3
      command: [python]
      args: ["/app/train.py", "--epochs", "100", "--batch-size", "32"]
      resources:
        limits:
          nvidia.com/gpu: 4
          memory: 32Gi
        requests:
          nvidia.com/gpu: 4
          memory: 16Gi
      volumeMounts:
      - name: training-data
        mountPath: /data
      - name: model-output
        mountPath: /models
    volumes:
    - name: training-data
      persistentVolumeClaim:
        claimName: training-data-pvc
    - name: model-output
      persistentVolumeClaim:
        claimName: model-output-pvc

9. Performance Optimization Tips

Memory and Compute Optimization

python

# Python code for optimal GPU memory usage
import torch
import torch.nn as nn
from torch.nn.parallel import DistributedDataParallel as DDP

def optimize_gpu_memory():
    # Enable memory efficient attention
    torch.backends.cuda.matmul.allow_tf32 = True
    torch.backends.cudnn.allow_tf32 = True
    
    # Use gradient checkpointing for large models
    model = MyLargeModel()
    model.gradient_checkpointing_enable()
    
    # Optimize batch size dynamically
    optimal_batch_size = find_optimal_batch_size(model)
    
    return model, optimal_batch_size

# Kubernetes Job with optimized settings

apiVersion: batch/v1
kind: Job
metadata:
  name: optimized-training
spec:
  template:
    spec:
      containers:
      - name: training
        image: pytorch/pytorch:latest
        env:
        - name: CUDA_DEVICE_ORDER
          value: "PCI_BUS_ID"
        - name: NCCL_IB_DISABLE
          value: "1"
        - name: NCCL_SOCKET_IFNAME
          value: "eth0"
        - name: OMP_NUM_THREADS
          value: "8"
        resources:
          limits:
            nvidia.com/gpu: 8
            memory: 128Gi
            cpu: 32
          requests:
            nvidia.com/gpu: 8
            memory: 64Gi
            cpu: 16
        volumeMounts:
        - name: shm
          mountPath: /dev/shm
      volumes:
      - name: shm
        emptyDir:
          medium: Memory
          sizeLimit: 32Gi

10. Future Trends and Roadmap

Emerging Technologies in 2025

  1. WebAssembly (WASM) for GPU: Portable GPU computations across different environments
  2. Confidential Computing: Secure GPU workloads with hardware-based encryption
  3. Edge AI: Kubernetes at the edge with specialized GPU hardware
  4. Quantum-GPU Hybrid: Integration of quantum computing with traditional GPU workloads

# Example: Edge AI deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: edge-ai-inference
spec:
  replicas: 1
  selector:
    matchLabels:
      app: edge-inference
  template:
    metadata:
      labels:
        app: edge-inference
    spec:
      nodeSelector:
        edge-location: retail-store
        gpu-type: jetson-nano
      containers:
      - name: inference
        image: nvcr.io/nvidia/l4t-pytorch:r32.7.1-pth1.10-py3
        resources:
          limits:
            nvidia.com/gpu: 1
        env:
        - name: INFERENCE_MODE
          value: "edge-optimized"

Conclusion

As we progress through 2025, the combination of Kubernetes and GPU acceleration continues to evolve rapidly. The key trends shaping this space include:

  1. Improved GPU sharing through MIG, MPS, and DRA
  2. Enhanced AI/ML workflow automation with Kubeflow and Argo
  3. Better cost optimization through spot instances and intelligent scheduling
  4. Advanced monitoring with real-time GPU metrics
  5. Security hardening for sensitive AI workloads

Organizations that master these technologies will gain significant competitive advantages in deploying scalable, cost-effective AI/ML infrastructure.

The future belongs to those who can efficiently orchestrate GPU resources at scale, and Kubernetes provides the perfect platform to achieve this goal. Start with the basics, experiment with GPU sharing strategies, and gradually implement advanced features as your requirements evolve.


Ready to accelerate your AI/ML workloads? Begin with the NVIDIA GPU Operator installation and progressively implement the optimization techniques outlined in this guide. The convergence of Kubernetes orchestration and GPU acceleration will unlock unprecedented possibilities for your machine learning initiatives.

Have Queries? Join https://launchpass.com/collabnix

Collabnix Team The Collabnix Team is a diverse collective of Docker, Kubernetes, and IoT experts united by a passion for cloud-native technologies. With backgrounds spanning across DevOps, platform engineering, cloud architecture, and container orchestration, our contributors bring together decades of combined experience from various industries and technical domains.
Join our Discord Server
Index