Join our Discord Server
Collabnix Team The Collabnix Team is a diverse collective of Docker, Kubernetes, and IoT experts united by a passion for cloud-native technologies. With backgrounds spanning across DevOps, platform engineering, cloud architecture, and container orchestration, our contributors bring together decades of combined experience from various industries and technical domains.

Kubernetes for Generative AI: Complete Guide to Deploying LLMs at Scale

6 min read

The explosion of Generative AI has transformed how we build applications, but deploying Large Language Models (LLMs) at scale presents unique challenges. Kubernetes has emerged as the de facto platform for orchestrating these AI workloads, offering the scalability, resource management, and operational excellence needed for production Generative AI applications.

In this comprehensive guide, we’ll explore how Kubernetes solves the critical challenges of deploying Generative AI workloads, from GPU scheduling to model serving patterns, with real-world examples you can implement today.

Why Kubernetes for Generative AI?

Generative AI workloads are fundamentally different from traditional applications. They require:

  • Expensive GPU resources that need efficient utilization
  • Dynamic scaling based on inference requests
  • Model versioning and A/B testing capabilities
  • Multi-tenancy support for different models and teams
  • Cost optimization given the high compute requirements

Kubernetes addresses these challenges through:

# Resource abstraction and scheduling
resources:
  limits:
    nvidia.com/gpu: 1
    memory: "32Gi"
  requests:
    nvidia.com/gpu: 1
    memory: "32Gi"

The Economics of Kubernetes for AI

Consider a typical LLM deployment scenario:

  • Without Kubernetes: Static GPU allocation leads to 40-60% idle time
  • With Kubernetes: Dynamic scheduling achieves 80-90% GPU utilization
  • Cost impact: 2-3x ROI on GPU infrastructure investment

Key Challenges in Deploying Generative AI

Before diving into solutions, let’s understand the unique challenges:

1. Resource Intensity

Large Language Models like GPT, LLaMA, or Mistral require:

  • Multiple GPUs (4-8 for 7B models, 16+ for 70B+ models)
  • High memory bandwidth (A100s with 80GB VRAM)
  • Fast storage for model weights (100GB+ per model)

2. Latency Requirements

Real-time AI applications demand:

  • Sub-second inference for interactive applications
  • Batch processing optimization for non-interactive workloads
  • Efficient request queuing and batching

3. Model Management

Organizations need to:

  • Deploy multiple model versions simultaneously
  • Implement canary deployments for new models
  • Handle model rollbacks seamlessly
  • Manage model artifacts efficiently

4. Cost Optimization

GPU costs dominate AI infrastructure:

  • A100 GPUs cost $1-3/hour on cloud platforms
  • Idle GPUs waste thousands of dollars monthly
  • Efficient bin-packing of workloads is critical

Kubernetes Components for AI Workloads

Essential Kubernetes Resources

1. GPU Device Plugin

The NVIDIA GPU device plugin enables Kubernetes to discover and allocate GPU resources:

# Deploy NVIDIA device plugin
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/main/nvidia-device-plugin.yml

# Verify GPU nodes
kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia.com/gpu"

2. StatefulSets for Model Serving

StatefulSets provide stable network identities for model servers:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: llm-server
spec:
  serviceName: "llm-service"
  replicas: 3
  selector:
    matchLabels:
      app: llm-server
  template:
    metadata:
      labels:
        app: llm-server
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:latest
        args:
          - "--model=meta-llama/Llama-2-7b-chat-hf"
          - "--tensor-parallel-size=1"
          - "--max-model-len=4096"
        resources:
          limits:
            nvidia.com/gpu: 1
            memory: "32Gi"
          requests:
            nvidia.com/gpu: 1
            memory: "32Gi"
        volumeMounts:
        - name: model-cache
          mountPath: /root/.cache/huggingface
      volumes:
      - name: model-cache
        persistentVolumeClaim:
          claimName: model-cache-pvc

3. Horizontal Pod Autoscaler (HPA)

Scale based on custom metrics like queue depth or GPU utilization:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: llm-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: llm-server
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Pods
    pods:
      metric:
        name: inference_queue_depth
      target:
        type: AverageValue
        averageValue: "10"

GPU Scheduling and Management {#gpu-scheduling}

Efficient GPU scheduling is critical for cost-effective Generative AI deployments.

Multi-Instance GPU (MIG) Support

NVIDIA A100 and H100 GPUs support partitioning:

# Example: Deploy multiple small models on a single A100
apiVersion: v1
kind: Pod
metadata:
  name: llm-small-1
spec:
  containers:
  - name: model-server
    image: my-llm-server:latest
    resources:
      limits:
        nvidia.com/mig-1g.5gb: 1  # 1/7th of A100

Time-Slicing for GPU Sharing

For development environments or smaller models:

# ConfigMap for time-sliced GPUs
apiVersion: v1
kind: ConfigMap
metadata:
  name: gpu-sharing-config
data:
  config.yaml: |
    version: v1
    sharing:
      timeSlicing:
        resources:
        - name: nvidia.com/gpu
          replicas: 4  # Allow 4 pods per GPU

GPU Node Affinity

Ensure models land on appropriate GPU types:

affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
      - matchExpressions:
        - key: nvidia.com/gpu.product
          operator: In
          values:
          - NVIDIA-A100-SXM4-80GB
          - NVIDIA-H100-80GB-HBM3

Model Serving Patterns

Pattern 1: KServe for Production ML

KServe provides production-grade model serving:

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: llama2-7b
spec:
  predictor:
    model:
      modelFormat:
        name: pytorch
      storageUri: "s3://models/llama2-7b"
      resources:
        limits:
          nvidia.com/gpu: 1
          memory: 32Gi
    minReplicas: 1
    maxReplicas: 5
    scaleTarget: 80  # Target 80% GPU utilization

Pattern 2: vLLM for High-Throughput Inference

vLLM optimizes LLM inference with PagedAttention:

apiVersion: v1
kind: Service
metadata:
  name: vllm-service
spec:
  selector:
    app: vllm
  ports:
  - port: 8000
    targetPort: 8000
  type: LoadBalancer
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-deployment
spec:
  replicas: 2
  selector:
    matchLabels:
      app: vllm
  template:
    metadata:
      labels:
        app: vllm
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:latest
        command:
          - python3
          - -m
          - vllm.entrypoints.openai.api_server
          - --model=mistralai/Mistral-7B-Instruct-v0.1
          - --tensor-parallel-size=1
          - --gpu-memory-utilization=0.9
        ports:
        - containerPort: 8000
        resources:
          limits:
            nvidia.com/gpu: 1

Pattern 3: Ray Serve for Complex AI Applications

Ray Serve enables multi-model deployments and complex inference graphs:

apiVersion: ray.io/v1alpha1
kind: RayService
metadata:
  name: multi-model-service
spec:
  serviceUnhealthySecondThreshold: 300
  deploymentUnhealthySecondThreshold: 300
  serveConfig:
    importPath: multi_model:deployment
    runtimeEnv: |
      env_vars:
        MODEL_PATH: "/models"
    deployments:
      - name: text-generation
        numReplicas: 2
        rayActorOptions:
          numGpus: 1
      - name: embedding
        numReplicas: 3
        rayActorOptions:
          numGpus: 0.5

Scaling Strategies for LLMs {#scaling-strategies}

Vertical Pod Autoscaling (VPA)

Optimize resource requests based on actual usage:

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: llm-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: llm-server
  updatePolicy:
    updateMode: "Auto"
  resourcePolicy:
    containerPolicies:
    - containerName: model-server
      minAllowed:
        memory: "16Gi"
        nvidia.com/gpu: 1
      maxAllowed:
        memory: "80Gi"
        nvidia.com/gpu: 2

Cluster Autoscaling

Automatically add GPU nodes during high demand:

# GKE example
apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-autoscaler-priority-expander
data:
  priorities: |-
    10:
      - .*-a100-.*
    5:
      - .*-t4-.*

Queue-Based Scaling

Implement request queuing with KEDA:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: llm-inference-scaler
spec:
  scaleTargetRef:
    name: llm-deployment
  minReplicaCount: 1
  maxReplicaCount: 20
  triggers:
  - type: redis
    metadata:
      address: redis-service:6379
      listName: inference_queue
      listLength: "5"

Best Practices and Security

1. Model Caching Strategy

Reduce cold start times with persistent volumes:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: model-cache-pvc
spec:
  accessModes:
    - ReadWriteMany
  storageClassName: fast-ssd
  resources:
    requests:
      storage: 500Gi

2. Network Policies for Multi-Tenancy

Isolate different model deployments:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: llm-isolation
spec:
  podSelector:
    matchLabels:
      app: llm-server
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - podSelector:
        matchLabels:
          role: api-gateway
    ports:
    - protocol: TCP
      port: 8000

3. Resource Quotas

Prevent resource exhaustion:

apiVersion: v1
kind: ResourceQuota
metadata:
  name: gpu-quota
  namespace: ai-workloads
spec:
  hard:
    requests.nvidia.com/gpu: "8"
    limits.nvidia.com/gpu: "8"

4. Model Artifact Security

Use sealed secrets for API keys:

apiVersion: bitnami.com/v1alpha1
kind: SealedSecret
metadata:
  name: model-credentials
spec:
  encryptedData:
    hf_token: AgBvN2...  # Encrypted Hugging Face token

Real-World Implementation

Let’s put it all together with a production-ready deployment of a Generative AI application.

Complete Example: Multi-Model RAG System

# Namespace for AI workloads
apiVersion: v1
kind: Namespace
metadata:
  name: rag-system

---
# Persistent storage for models
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: model-storage
  namespace: rag-system
spec:
  accessModes:
    - ReadWriteMany
  storageClassName: fast-ssd
  resources:
    requests:
      storage: 1Ti

---
# Embedding model deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: embedding-service
  namespace: rag-system
spec:
  replicas: 3
  selector:
    matchLabels:
      app: embedding-service
  template:
    metadata:
      labels:
        app: embedding-service
    spec:
      containers:
      - name: sentence-transformers
        image: sentence-transformers/all-mpnet-base-v2:latest
        ports:
        - containerPort: 8080
        resources:
          requests:
            memory: "4Gi"
            cpu: "2"
          limits:
            memory: "8Gi"
            cpu: "4"
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10

---
# LLM service with GPU
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: llm-service
  namespace: rag-system
spec:
  serviceName: "llm"
  replicas: 2
  selector:
    matchLabels:
      app: llm-service
  template:
    metadata:
      labels:
        app: llm-service
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:v0.2.6
        command:
          - python3
          - -m
          - vllm.entrypoints.openai.api_server
          - --model=mistralai/Mistral-7B-Instruct-v0.2
          - --tensor-parallel-size=1
          - --max-model-len=8192
        ports:
        - containerPort: 8000
          name: http
        resources:
          limits:
            nvidia.com/gpu: 1
            memory: "32Gi"
          requests:
            nvidia.com/gpu: 1
            memory: "32Gi"
        volumeMounts:
        - name: model-storage
          mountPath: /root/.cache
      volumes:
      - name: model-storage
        persistentVolumeClaim:
          claimName: model-storage

---
# Service exposure
apiVersion: v1
kind: Service
metadata:
  name: llm-service
  namespace: rag-system
spec:
  selector:
    app: llm-service
  ports:
  - port: 8000
    targetPort: 8000
    name: http
  type: ClusterIP

---
# HPA for dynamic scaling
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: embedding-hpa
  namespace: rag-system
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: embedding-service
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70

---
# Ingress for external access
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: rag-ingress
  namespace: rag-system
  annotations:
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
    cert-manager.io/cluster-issuer: "letsencrypt-prod"
spec:
  ingressClassName: nginx
  tls:
  - hosts:
    - rag.example.com
    secretName: rag-tls
  rules:
  - host: rag.example.com
    http:
      paths:
      - path: /v1
        pathType: Prefix
        backend:
          service:
            name: llm-service
            port:
              number: 8000

Cost Optimization Strategies

1. Spot Instances for Non-Critical Workloads

nodeSelector:
  kubernetes.io/lifecycle: spot
tolerations:
- key: "spot"
  operator: "Equal"
  value: "true"
  effect: "NoSchedule"

2. Pod Disruption Budgets

Ensure availability during node maintenance:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: llm-pdb
spec:
  minAvailable: 1
  selector:
    matchLabels:
      app: llm-service

3. Cluster Autoscaler with Priority

priorityClassName: high-priority-ai
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority-ai
value: 1000
globalDefault: false
description: "High priority for critical AI workloads"

Monitoring and Observability

Prometheus Metrics for GPU Utilization

apiVersion: v1
kind: ServiceMonitor
metadata:
  name: dcgm-exporter
spec:
  selector:
    matchLabels:
      app: nvidia-dcgm-exporter
  endpoints:
  - port: metrics
    interval: 30s

Key Metrics to Track

  1. GPU Utilization: Target 80-90% for cost efficiency
  2. Inference Latency: P50, P95, P99 percentiles
  3. Queue Depth: Request backlog monitoring
  4. Token Throughput: Tokens/second per GPU
  5. Model Load Time: Cold start performance

Future Trends and Considerations

1. Multi-Cloud GPU Federation

Kubernetes enables workload portability across cloud providers:

  • AWS with EKS and EC2 P4/P5 instances
  • GCP with GKE and A2/A3 instances
  • Azure with AKS and ND-series VMs

2. Edge AI with K3s

Lightweight Kubernetes for edge deployments:

  • Reduced latency for inference
  • Data privacy compliance
  • Offline operation capability

3. Model Mesh Architectures

Sophisticated routing between multiple models:

  • Load balancing across model versions
  • Fallback strategies for model failures
  • Cost-based routing (expensive vs. cheap models)

Conclusion

Kubernetes has become the backbone of production Generative AI deployments, offering the orchestration, scalability, and operational efficiency needed to run LLMs at scale. The key takeaways:

  1. GPU scheduling is critical for cost optimization
  2. Model serving patterns vary based on workload requirements
  3. Autoscaling strategies must account for both compute and cost
  4. Security and multi-tenancy are essential for enterprise deployments
  5. Observability drives continuous optimization

As Generative AI continues to evolve, Kubernetes provides the flexible foundation needed to adapt to new model architectures, hardware innovations, and deployment patterns.

Getting Started

Ready to deploy your first Generative AI workload on Kubernetes? Start with these steps:

  1. Set up a Kubernetes cluster with GPU nodes
  2. Install the NVIDIA GPU device plugin
  3. Deploy a simple model using vLLM or KServe
  4. Implement monitoring and autoscaling
  5. Optimize costs based on actual usage patterns

The future of AI is distributed, scalable, and orchestrated by Kubernetes.


Have questions about deploying Generative AI on Kubernetes? Join the Collabnix community and connect with thousands of developers building the future of AI infrastructure.

Related Resources:

  • Docker for AI Development
  • GPU Optimization Guide
  • Multi-Agent AI Systems

Have Queries? Join https://launchpass.com/collabnix

Collabnix Team The Collabnix Team is a diverse collective of Docker, Kubernetes, and IoT experts united by a passion for cloud-native technologies. With backgrounds spanning across DevOps, platform engineering, cloud architecture, and container orchestration, our contributors bring together decades of combined experience from various industries and technical domains.
Join our Discord Server
Index