Join our Discord Server
Tanvir Kour Tanvir Kour is a passionate technical blogger and open source enthusiast. She is a graduate in Computer Science and Engineering and has 4 years of experience in providing IT solutions. She is well-versed with Linux, Docker and Cloud-Native application. You can connect to her via Twitter https://x.com/tanvirkour

Enable GPU Support in Kubernetes: A Complete Guide

13 min read

How to Enable GPU Support in Kubernetes

Running Large Language Models (LLMs) like Ollama in Kubernetes requires GPU acceleration for optimal performance. This comprehensive guide walks you through enabling NVIDIA and AMD GPU support in Kubernetes clusters, deploying Ollama with GPU resources, and building a sample AI application.

Prerequisites

Before enabling GPU support for Ollama in Kubernetes, ensure you have:

  • A Kubernetes cluster (v1.24+)
  • kubectl configured and connected to your cluster
  • Nodes with NVIDIA or AMD GPUs
  • Administrative access to cluster nodes
  • Docker or containerd runtime

Understanding GPU Support in Kubernetes

Kubernetes treats GPUs as extended resources using the device plugin framework. GPU vendors provide device plugins that:

  1. Discover GPUs on cluster nodes
  2. Advertise GPU resources to the Kubernetes API server
  3. Allocate GPUs to pods requesting them
  4. Monitor GPU health and availability

Enabling NVIDIA GPU Support

Step 1: Install NVIDIA Container Toolkit on Nodes

First, install the NVIDIA Container Toolkit on all GPU nodes. This toolkit enables containers to access NVIDIA GPUs.

# Add NVIDIA package repositories
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

# Install the toolkit
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit

# Configure Docker runtime
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Explanation:

  • Line 2: Detects your Linux distribution to fetch the correct packages
  • Line 3-4: Adds NVIDIA’s GPG key for package verification
  • Line 5-7: Adds NVIDIA’s container toolkit repository to your package manager
  • Line 10: Installs the NVIDIA Container Toolkit
  • Line 13: Configures Docker to use the NVIDIA runtime for GPU access
  • Line 14: Restarts Docker to apply the configuration

Step 2: Deploy NVIDIA Device Plugin

The NVIDIA device plugin daemonset runs on all GPU nodes and advertises GPU resources to Kubernetes.

# nvidia-device-plugin.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin-daemonset
  namespace: kube-system
spec:
  selector:
    matchLabels:
      name: nvidia-device-plugin-ds
  updateStrategy:
    type: RollingUpdate
  template:
    metadata:
      labels:
        name: nvidia-device-plugin-ds
    spec:
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      priorityClassName: system-node-critical
      containers:
      - image: nvcr.io/nvidia/k8s-device-plugin:v0.14.5
        name: nvidia-device-plugin-ctr
        args: ["--fail-on-init-error=false"]
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop: ["ALL"]
        volumeMounts:
        - name: device-plugin
          mountPath: /var/lib/kubelet/device-plugins
      volumes:
      - name: device-plugin
        hostPath:
          path: /var/lib/kubelet/device-plugins

Explanation:

  • Lines 2-6: Defines a DaemonSet that ensures one pod runs on every GPU node
  • Lines 18-21: Tolerations allow the plugin to run on nodes tainted with nvidia.com/gpu
  • Line 22: system-node-critical ensures the plugin stays running even under resource pressure
  • Line 24: Uses NVIDIA’s official device plugin container image
  • Line 26: --fail-on-init-error=false prevents crashes if GPUs aren’t immediately available
  • Lines 31-33: Mounts the device plugin socket directory for communication with kubelet
  • Lines 35-37: Exposes the host’s device plugin directory to the container

Deploy the plugin:

kubectl apply -f nvidia-device-plugin.yaml

Verify GPU availability:

kubectl get nodes -o=custom-columns=NAME:.metadata.name,GPUs:.status.capacity.nvidia\\.com/gpu

Explanation: This command lists all nodes and shows how many NVIDIA GPUs each node advertises to Kubernetes.

Enabling AMD GPU Support

Step 1: Install AMD GPU Drivers and ROCm

Install AMD ROCm (Radeon Open Compute) platform on GPU nodes:

# Add AMD ROCm repository
wget https://repo.radeon.com/amdgpu-install/6.0/ubuntu/jammy/amdgpu-install_6.0.60000-1_all.deb
sudo apt-get install ./amdgpu-install_6.0.60000-1_all.deb

# Install ROCm
sudo amdgpu-install --usecase=rocm --no-dkms

# Add user to video and render groups
sudo usermod -a -G video,render $USER

Explanation:

  • Line 2: Downloads AMD GPU installation package
  • Line 3: Installs the AMD GPU package manager
  • Line 6: Installs ROCm runtime without DKMS (Dynamic Kernel Module Support)
  • Line 9: Adds current user to groups needed for GPU access

Step 2: Deploy AMD Device Plugin

# amd-device-plugin.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: amdgpu-device-plugin-daemonset
  namespace: kube-system
spec:
  selector:
    matchLabels:
      name: amdgpu-dp-ds
  template:
    metadata:
      labels:
        name: amdgpu-dp-ds
    spec:
      priorityClassName: system-node-critical
      tolerations:
      - key: amd.com/gpu
        operator: Exists
        effect: NoSchedule
      containers:
      - name: amdgpu-dp-cntr
        image: rocm/k8s-device-plugin:latest
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop: ["ALL"]
        volumeMounts:
        - name: dp
          mountPath: /var/lib/kubelet/device-plugins
        - name: sys
          mountPath: /sys
      volumes:
      - name: dp
        hostPath:
          path: /var/lib/kubelet/device-plugins
      - name: sys
        hostPath:
          path: /sys

Explanation:

  • Lines 8-10: Label selector ensures pods are associated with this DaemonSet
  • Lines 17-20: Tolerations allow scheduling on nodes with AMD GPU taints
  • Line 23: Uses AMD’s official ROCm device plugin image
  • Lines 28-30: Mounts device plugin directory for kubelet communication
  • Lines 31-32: Mounts /sys for GPU hardware discovery
  • Lines 37-39: Exposes host system information to detect AMD GPUs

Deploy the AMD plugin:

kubectl apply -f amd-device-plugin.yaml

# Verify AMD GPU availability
kubectl get nodes -o=custom-columns=NAME:.metadata.name,GPUs:.status.capacity.amd\\.com/gpu

Deploying Ollama with GPU Support

Step 3: Create Ollama Namespace and Deployment

First, create a dedicated namespace:

kubectl create namespace ollama

Explanation: Creates an isolated namespace for Ollama resources, providing organizational separation and resource management.

Step 4: Deploy Ollama with GPU Resources

# ollama-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama
  namespace: ollama
spec:
  replicas: 1
  selector:
    matchLabels:
      app: ollama
  template:
    metadata:
      labels:
        app: ollama
    spec:
      containers:
      - name: ollama
        image: ollama/ollama:latest
        ports:
        - containerPort: 11434
          name: http
        env:
        - name: OLLAMA_HOST
          value: "0.0.0.0:11434"
        resources:
          requests:
            memory: "4Gi"
            cpu: "2"
            nvidia.com/gpu: "1"  # For NVIDIA GPUs
            # amd.com/gpu: "1"    # Uncomment for AMD GPUs
          limits:
            memory: "8Gi"
            cpu: "4"
            nvidia.com/gpu: "1"
            # amd.com/gpu: "1"
        volumeMounts:
        - name: ollama-data
          mountPath: /root/.ollama
      volumes:
      - name: ollama-data
        persistentVolumeClaim:
          claimName: ollama-pvc
---
apiVersion: v1
kind: Service
metadata:
  name: ollama-service
  namespace: ollama
spec:
  selector:
    app: ollama
  ports:
  - protocol: TCP
    port: 11434
    targetPort: 11434
  type: ClusterIP

Explanation:

  • Line 8: Single replica ensures only one Ollama instance runs (LLMs are resource-intensive)
  • Line 19: Uses official Ollama container image
  • Lines 20-22: Exposes port 11434 for Ollama API
  • Lines 24-25: Environment variable tells Ollama to listen on all interfaces
  • Lines 27-36: Resource requests and limits:
    • requests: Guaranteed resources Kubernetes reserves for the pod
    • limits: Maximum resources the pod can consume
    • nvidia.com/gpu: “1”: Requests one NVIDIA GPU
    • GPUs are always requested in whole numbers (no fractional GPUs)
  • Lines 38-40: Mounts persistent storage for model data
  • Lines 42-44: References a PersistentVolumeClaim for model storage
  • Lines 47-59: Service definition exposing Ollama internally within the cluster

Step 5: Create Persistent Storage

# ollama-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: ollama-pvc
  namespace: ollama
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 50Gi
  storageClassName: standard  # Adjust based on your cluster

Explanation:

  • Line 9: ReadWriteOnce means volume can be mounted read-write by a single node
  • Line 12: Requests 50GB storage (LLM models can be large: Llama 7B ≈ 4GB, Llama 70B ≈ 40GB)
  • Line 13: Storage class determines the type of storage provisioned (SSD, HDD, cloud provider-specific)

Deploy all resources:

kubectl apply -f ollama-pvc.yaml
kubectl apply -f ollama-deployment.yaml

Verify deployment:

# Check pod status
kubectl get pods -n ollama

# Check GPU allocation
kubectl describe pod -n ollama -l app=ollama | grep -A 5 "Limits"

# View logs
kubectl logs -n ollama -l app=ollama

Explanation:

  • Line 2: Lists all pods in the ollama namespace
  • Line 5: Describes pod details and filters for resource limits to verify GPU allocation
  • Line 8: Shows Ollama container logs for troubleshooting

Step 6: Load a Model into Ollama

# Port-forward to access Ollama locally
kubectl port-forward -n ollama svc/ollama-service 11434:11434 &

# Pull a model (using Llama 2 as example)
curl http://localhost:11434/api/pull -d '{
  "name": "llama2:7b"
}'

# Verify model is loaded
curl http://localhost:11434/api/tags

Explanation:

  • Line 2: Creates a tunnel from your local machine to the Ollama service in Kubernetes
  • Lines 5-7: Pulls the Llama 2 7B model into Ollama
  • Line 10: Lists all available models in Ollama

Building a Sample GPU-Powered Application

Let’s build a Python application that uses Ollama for text generation with GPU acceleration.

Application Code

# ollama-client.py
import requests
import json
import time

class OllamaClient:
    def __init__(self, base_url="http://ollama-service.ollama.svc.cluster.local:11434"):
        """
        Initialize Ollama client
        
        Args:
            base_url: Ollama service URL (using Kubernetes DNS)
        """
        self.base_url = base_url
        self.generate_endpoint = f"{base_url}/api/generate"
        self.chat_endpoint = f"{base_url}/api/chat"
        
    def generate(self, model, prompt, stream=False):
        """
        Generate text using Ollama
        
        Args:
            model: Model name (e.g., "llama2:7b")
            prompt: Input prompt for generation
            stream: Whether to stream the response
            
        Returns:
            Generated text response
        """
        payload = {
            "model": model,
            "prompt": prompt,
            "stream": stream
        }
        
        try:
            response = requests.post(
                self.generate_endpoint, 
                json=payload,
                timeout=300  # 5 minutes timeout for large models
            )
            response.raise_for_status()
            
            if stream:
                return self._handle_stream(response)
            else:
                return response.json()['response']
                
        except requests.exceptions.RequestException as e:
            print(f"Error communicating with Ollama: {e}")
            return None
    
    def _handle_stream(self, response):
        """Handle streaming responses from Ollama"""
        full_response = ""
        for line in response.iter_lines():
            if line:
                json_response = json.loads(line)
                if 'response' in json_response:
                    full_response += json_response['response']
                if json_response.get('done', False):
                    break
        return full_response
    
    def chat(self, model, messages):
        """
        Chat with Ollama using conversation history
        
        Args:
            model: Model name
            messages: List of message dictionaries with 'role' and 'content'
            
        Returns:
            Assistant's response
        """
        payload = {
            "model": model,
            "messages": messages,
            "stream": False
        }
        
        try:
            response = requests.post(self.chat_endpoint, json=payload, timeout=300)
            response.raise_for_status()
            return response.json()['message']['content']
        except requests.exceptions.RequestException as e:
            print(f"Error in chat: {e}")
            return None

# Example usage
if __name__ == "__main__":
    client = OllamaClient()
    
    # Simple generation example
    print("=== Simple Generation ===")
    start_time = time.time()
    response = client.generate(
        model="llama2:7b",
        prompt="Explain how GPUs accelerate machine learning in simple terms."
    )
    end_time = time.time()
    
    print(f"Response: {response}")
    print(f"Time taken: {end_time - start_time:.2f} seconds")
    
    # Chat example with conversation history
    print("\n=== Chat Example ===")
    messages = [
        {"role": "user", "content": "What is Kubernetes?"},
        {"role": "assistant", "content": "Kubernetes is an open-source container orchestration platform."},
        {"role": "user", "content": "How does it help with GPU workloads?"}
    ]
    
    chat_response = client.chat(model="llama2:7b", messages=messages)
    print(f"Chat Response: {chat_response}")

Explanation:

  • Lines 6-15: __init__ method initializes the client with Ollama’s Kubernetes service URL
    • Uses Kubernetes DNS format: service-name.namespace.svc.cluster.local
  • Lines 17-48: generate() method sends prompts to Ollama
    • Line 38: 300-second timeout accommodates slow GPU inference on large models
    • Line 43: Handles streaming responses chunk by chunk
    • Line 45: Returns complete generated text for non-streaming
  • Lines 50-61: _handle_stream() processes streaming responses line by line
    • Concatenates text chunks as they arrive
    • Stops when done: true is received
  • Lines 63-83: chat() method maintains conversation context
    • Accepts message history to preserve context across turns
  • Lines 86-115: Example usage demonstrating both generation modes
    • Times inference to show GPU acceleration impact
    • Demonstrates conversation history management

Kubernetes Deployment for Client App

# ollama-client-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama-client
  namespace: ollama
spec:
  replicas: 2
  selector:
    matchLabels:
      app: ollama-client
  template:
    metadata:
      labels:
        app: ollama-client
    spec:
      containers:
      - name: ollama-client
        image: python:3.11-slim
        command: ["/bin/sh"]
        args:
        - -c
        - |
          pip install requests flask
          python /app/ollama-client.py
        volumeMounts:
        - name: app-code
          mountPath: /app
        env:
        - name: OLLAMA_HOST
          value: "http://ollama-service.ollama.svc.cluster.local:11434"
        resources:
          requests:
            memory: "256Mi"
            cpu: "250m"
          limits:
            memory: "512Mi"
            cpu: "500m"
      volumes:
      - name: app-code
        configMap:
          name: ollama-client-code
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: ollama-client-code
  namespace: ollama
data:
  ollama-client.py: |
    # Paste the Python code from above here

Explanation:

  • Line 8: Two replicas for high availability (client doesn’t need GPU)
  • Lines 20-25: Custom container command:
    • Installs required Python packages
    • Runs the application
  • Lines 26-28: Mounts application code from ConfigMap
  • Lines 30-31: Environment variable for Ollama service URL
  • Lines 32-38: Resource limits:
    • Modest resources since client doesn’t do GPU-intensive work
    • Requests guarantee minimum resources
    • Limits prevent resource hogging
  • Lines 40-42: ConfigMap volume provides application code
  • Lines 45-52: ConfigMap stores application source code
    • Decouples code from container image
    • Allows easy updates without rebuilding images

REST API Service

# api-service.py
from flask import Flask, request, jsonify
import requests
import json

app = Flask(__name__)
OLLAMA_URL = "http://ollama-service.ollama.svc.cluster.local:11434"

@app.route('/health', methods=['GET'])
def health():
    """Health check endpoint"""
    return jsonify({"status": "healthy"}), 200

@app.route('/generate', methods=['POST'])
def generate():
    """
    Generate text from prompt
    
    Request body:
    {
        "model": "llama2:7b",
        "prompt": "Your prompt here",
        "temperature": 0.7,
        "max_tokens": 500
    }
    """
    try:
        data = request.get_json()
        
        # Validate required fields
        if 'model' not in data or 'prompt' not in data:
            return jsonify({"error": "Missing required fields: model and prompt"}), 400
        
        # Prepare Ollama request
        ollama_payload = {
            "model": data['model'],
            "prompt": data['prompt'],
            "stream": False,
            "options": {
                "temperature": data.get('temperature', 0.7),
                "num_predict": data.get('max_tokens', 500)
            }
        }
        
        # Call Ollama
        response = requests.post(
            f"{OLLAMA_URL}/api/generate",
            json=ollama_payload,
            timeout=300
        )
        response.raise_for_status()
        
        result = response.json()
        return jsonify({
            "model": data['model'],
            "response": result['response'],
            "total_duration": result.get('total_duration', 0),
            "load_duration": result.get('load_duration', 0),
            "prompt_eval_count": result.get('prompt_eval_count', 0),
            "eval_count": result.get('eval_count', 0)
        }), 200
        
    except requests.exceptions.RequestException as e:
        return jsonify({"error": f"Ollama service error: {str(e)}"}), 500
    except Exception as e:
        return jsonify({"error": f"Internal server error: {str(e)}"}), 500

@app.route('/models', methods=['GET'])
def list_models():
    """List available models"""
    try:
        response = requests.get(f"{OLLAMA_URL}/api/tags", timeout=30)
        response.raise_for_status()
        return jsonify(response.json()), 200
    except Exception as e:
        return jsonify({"error": str(e)}), 500

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=8080)

Explanation:

  • Lines 9-12: Health check endpoint for Kubernetes liveness/readiness probes
  • Lines 14-63: /generate endpoint:
    • Lines 30-32: Validates required fields in request
    • Lines 35-43: Prepares payload for Ollama with configurable parameters
    • Line 40: Temperature controls randomness (0 = deterministic, 1 = creative)
    • Line 41: num_predict limits maximum tokens generated
    • Lines 45-50: Forwards request to Ollama service
    • Lines 53-60: Returns response with performance metrics:
      • total_duration: Total inference time
      • load_duration: Model loading time
      • prompt_eval_count: Number of prompt tokens processed
      • eval_count: Number of tokens generated
  • Lines 68-75: Lists all available models in Ollama
  • Line 78: Exposes API on all interfaces for Kubernetes access

Complete Deployment with Service

# api-service-deployment.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: api-service-code
  namespace: ollama
data:
  api-service.py: |
    # Paste the Flask API code from above
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-service
  namespace: ollama
spec:
  replicas: 3
  selector:
    matchLabels:
      app: api-service
  template:
    metadata:
      labels:
        app: api-service
    spec:
      containers:
      - name: api-service
        image: python:3.11-slim
        command: ["/bin/sh"]
        args:
        - -c
        - |
          pip install flask requests
          python /app/api-service.py
        ports:
        - containerPort: 8080
          name: http
        volumeMounts:
        - name: code
          mountPath: /app
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5
        resources:
          requests:
            memory: "256Mi"
            cpu: "200m"
          limits:
            memory: "512Mi"
            cpu: "500m"
      volumes:
      - name: code
        configMap:
          name: api-service-code
---
apiVersion: v1
kind: Service
metadata:
  name: api-service
  namespace: ollama
spec:
  selector:
    app: api-service
  ports:
  - protocol: TCP
    port: 80
    targetPort: 8080
  type: LoadBalancer  # Change to ClusterIP for internal-only access

Explanation:

  • Lines 2-9: ConfigMap stores the Flask application code
  • Line 17: Three replicas for load balancing and high availability
  • Lines 29-34: Installation command runs on container startup
  • Lines 35-37: Exposes port 8080 for API traffic
  • Lines 41-52: Health probes:
    • livenessProbe: Kubernetes restarts container if this fails
    • readinessProbe: Kubernetes routes traffic only when this succeeds
    • initialDelaySeconds: Delay before first check (allows app startup)
    • periodSeconds: Frequency of health checks
  • Lines 53-59: Conservative resource limits (API is lightweight)
  • Lines 66-76: Service configuration:
    • type: LoadBalancer: Exposes API externally (cloud providers assign external IP)
    • Alternative: Use ClusterIP for internal-only access

Deploy and test:

# Deploy the API service
kubectl apply -f api-service-deployment.yaml

# Get service URL (for LoadBalancer)
kubectl get svc -n ollama api-service

# Test the API
export API_URL=$(kubectl get svc -n ollama api-service -o jsonpath='{.status.loadBalancer.ingress[0].ip}')

curl -X POST http://$API_URL/generate \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama2:7b",
    "prompt": "Write a haiku about Kubernetes and GPUs",
    "temperature": 0.8,
    "max_tokens": 100
  }'

# List available models
curl http://$API_URL/models

Explanation:

  • Line 2: Deploys API service to Kubernetes
  • Line 5: Lists services to find external IP
  • Line 8: Extracts LoadBalancer IP programmatically
  • Lines 10-17: Tests text generation with custom parameters
  • Line 20: Lists models to verify Ollama integration

Monitoring GPU Usage

Deploy NVIDIA DCGM Exporter (for NVIDIA GPUs)

# dcgm-exporter.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: dcgm-exporter
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: dcgm-exporter
  template:
    metadata:
      labels:
        app: dcgm-exporter
    spec:
      nodeSelector:
        nvidia.com/gpu.present: "true"
      containers:
      - name: dcgm-exporter
        image: nvcr.io/nvidia/k8s/dcgm-exporter:3.1.8-3.1.5-ubuntu20.04
        ports:
        - containerPort: 9400
          name: metrics
        securityContext:
          runAsNonRoot: false
          runAsUser: 0
          capabilities:
            add: ["SYS_ADMIN"]
        volumeMounts:
        - name: pod-resources
          mountPath: /var/lib/kubelet/pod-resources
        env:
        - name: DCGM_EXPORTER_LISTEN
          value: ":9400"
        - name: DCGM_EXPORTER_KUBERNETES
          value: "true"
      volumes:
      - name: pod-resources
        hostPath:
          path: /var/lib/kubelet/pod-resources
---
apiVersion: v1
kind: Service
metadata:
  name: dcgm-exporter
  namespace: monitoring
  labels:
    app: dcgm-exporter
spec:
  selector:
    app: dcgm-exporter
  ports:
  - port: 9400
    targetPort: 9400
    name: metrics

Explanation:

  • Line 16-17: nodeSelector ensures DCGM runs only on GPU nodes
  • Line 20: NVIDIA’s Data Center GPU Manager exports GPU metrics
  • Lines 24-28: Security context grants privileges needed to access GPU information
  • Lines 29-31: Pod resources directory provides GPU allocation details
  • Lines 32-36: Environment variables configure Prometheus metrics export
  • Lines 43-55: Service exposes metrics for Prometheus scraping

Query GPU metrics:

# Port-forward to access metrics
kubectl port-forward -n monitoring svc/dcgm-exporter 9400:9400

# View GPU utilization
curl http://localhost:9400/metrics | grep DCGM_FI_DEV_GPU_UTIL

# View GPU memory usage
curl http://localhost:9400/metrics | grep DCGM_FI_DEV_FB_USED

Explanation:

  • Line 2: Creates local access to DCGM metrics
  • Line 5: Filters for GPU utilization percentage (0-100%)
  • Line 8: Filters for GPU memory usage in MB

Check GPU Usage Directly

# Execute nvidia-smi inside Ollama pod
kubectl exec -n ollama -it $(kubectl get pod -n ollama -l app=ollama -o jsonpath='{.items[0].metadata.name}') -- nvidia-smi

# Watch GPU usage in real-time
kubectl exec -n ollama -it $(kubectl get pod -n ollama -l app=ollama -o jsonpath='{.items[0].metadata.name}') -- watch -n 1 nvidia-smi

Explanation:

  • Line 2: Runs nvidia-smi inside the Ollama container to show GPU status
    • Shows GPU utilization, memory usage, temperature, power draw
  • Line 5: Continuously monitors GPU with 1-second refresh
    • Useful for observing GPU activity during inference

Troubleshooting Common Issues

Issue 1: GPUs Not Detected

# Check if device plugin is running
kubectl get pods -n kube-system | grep device-plugin

# Check node GPU labels
kubectl describe node <node-name> | grep -i gpu

# Verify container runtime configuration
docker info | grep -i nvidia  # For Docker

Solution: Ensure NVIDIA Container Toolkit is installed and Docker/containerd is configured with GPU support.

Issue 2: Pod Stuck in Pending State

# Check why pod is pending
kubectl describe pod -n ollama <pod-name> | grep -A 10 Events

Common causes:

  • Insufficient GPU resources: No nodes have available GPUs
  • Node selector mismatch: Pod requires GPU node but none match labels
  • Resource limits too high: Requested resources exceed node capacity

Solution: Check GPU availability:

kubectl get nodes -o json | jq '.items[] | {name:.metadata.name, gpus:.status.allocatable}'

Issue 3: Out of Memory Errors

# Check pod memory usage
kubectl top pod -n ollama

# Increase memory limits in deployment
kubectl edit deployment -n ollama ollama

Solution: Large models require significant memory. Adjust resources.limits.memory based on model size:

  • 7B models: 8-12GB RAM
  • 13B models: 16-24GB RAM
  • 70B models: 64GB+ RAM

Issue 4: Slow Inference Speed

# Verify GPU is actually being used
kubectl exec -n ollama -it <ollama-pod> -- nvidia-smi

# Check if model is loaded in GPU memory
kubectl logs -n ollama <ollama-pod> | grep -i "loaded"

Solution: Ensure:

  • GPU device plugin is running correctly
  • Container has GPU allocated (nvidia.com/gpu: "1" in resources)
  • Model is appropriate size for your GPU VRAM

Best Practices

1. Resource Management

# Use resource quotas to prevent GPU over-allocation
apiVersion: v1
kind: ResourceQuota
metadata:
  name: gpu-quota
  namespace: ollama
spec:
  hard:
    requests.nvidia.com/gpu: "4"  # Maximum 4 GPUs in namespace
    limits.nvidia.com/gpu: "4"

Explanation:

  • Prevents single namespace from consuming all GPU resources
  • Enforces fair sharing across teams/applications
  • Adjust based on cluster capacity

2. Node Affinity for GPU Pods

spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: nvidia.com/gpu.product
            operator: In
            values:
            - Tesla-V100-SXM2-32GB
            - NVIDIA-A100-SXM4-40GB

Explanation:

  • Targets specific GPU models for optimal performance
  • Useful when cluster has mixed GPU types
  • Ensures workloads run on appropriate hardware

3. Model Caching Strategy

# Use init container to pre-download models
initContainers:
- name: model-loader
  image: ollama/ollama:latest
  command:
  - /bin/sh
  - -c
  - |
    ollama pull llama2:7b
    ollama pull codellama:7b
  volumeMounts:
  - name: ollama-data
    mountPath: /root/.ollama

Explanation:

  • Pre-loads models before main container starts
  • Reduces startup time and API latency
  • Ensures models are ready when service receives requests

4. Auto-scaling Considerations

Note: GPU pods don’t auto-scale well due to GPU allocation granularity. Instead:

# Use Horizontal Pod Autoscaler for CPU-based replicas
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-service-hpa
  namespace: ollama
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-service  # Scale the API layer, not Ollama
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70

Explanation:

  • Scales stateless API frontend, not GPU backend
  • GPU pods remain static or manually scaled
  • Distributes request load across multiple API replicas
  • Triggers scale-up when CPU exceeds 70% average

Conclusion

You’ve successfully configured GPU support for Ollama in Kubernetes with:

NVIDIA and AMD GPU device plugins for hardware discovery
Ollama deployment with GPU resource allocation
Persistent storage for model caching
REST API service for external access
Monitoring with DCGM and nvidia-smi
Best practices for production deployments

Key Takeaways

  1. GPU resources are discrete: Always request whole GPUs (nvidia.com/gpu: "1")
  2. Memory matters: Match RAM allocation to model size
  3. Persistent storage is critical: Models can be 40GB+
  4. Monitor GPU utilization: Use DCGM or nvidia-smi for observability
  5. Scale the API layer: Keep GPU pods static, scale stateless components

Next Steps

  • Implement GPU time-slicing for better utilization across multiple workloads
  • Add Prometheus and Grafana for comprehensive GPU monitoring dashboards
  • Explore MIG (Multi-Instance GPU) for NVIDIA A100/H100 to partition GPUs
  • Implement model caching strategies to reduce cold start times
  • Set up CI/CD pipelines for automated model deployment

For production deployments, consider:

  • Security: Network policies, pod security standards, secrets management
  • Cost optimization: GPU spot instances, time-based scaling
  • Model versioning: GitOps workflows for model updates
  • Observability: Distributed tracing for request flows

Happy GPU-accelerated AI inference with Ollama on Kubernetes! 🚀


Related Resources:

Have Queries? Join https://launchpass.com/collabnix

Tanvir Kour Tanvir Kour is a passionate technical blogger and open source enthusiast. She is a graduate in Computer Science and Engineering and has 4 years of experience in providing IT solutions. She is well-versed with Linux, Docker and Cloud-Native application. You can connect to her via Twitter https://x.com/tanvirkour
Join our Discord Server
Index