How to Enable GPU Support in Kubernetes
Running Large Language Models (LLMs) like Ollama in Kubernetes requires GPU acceleration for optimal performance. This comprehensive guide walks you through enabling NVIDIA and AMD GPU support in Kubernetes clusters, deploying Ollama with GPU resources, and building a sample AI application.
Prerequisites
Before enabling GPU support for Ollama in Kubernetes, ensure you have:
- A Kubernetes cluster (v1.24+)
- kubectl configured and connected to your cluster
- Nodes with NVIDIA or AMD GPUs
- Administrative access to cluster nodes
- Docker or containerd runtime
Understanding GPU Support in Kubernetes
Kubernetes treats GPUs as extended resources using the device plugin framework. GPU vendors provide device plugins that:
- Discover GPUs on cluster nodes
- Advertise GPU resources to the Kubernetes API server
- Allocate GPUs to pods requesting them
- Monitor GPU health and availability
Enabling NVIDIA GPU Support
Step 1: Install NVIDIA Container Toolkit on Nodes
First, install the NVIDIA Container Toolkit on all GPU nodes. This toolkit enables containers to access NVIDIA GPUs.
# Add NVIDIA package repositories
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
# Install the toolkit
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
# Configure Docker runtime
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
Explanation:
- Line 2: Detects your Linux distribution to fetch the correct packages
- Line 3-4: Adds NVIDIA’s GPG key for package verification
- Line 5-7: Adds NVIDIA’s container toolkit repository to your package manager
- Line 10: Installs the NVIDIA Container Toolkit
- Line 13: Configures Docker to use the NVIDIA runtime for GPU access
- Line 14: Restarts Docker to apply the configuration
Step 2: Deploy NVIDIA Device Plugin
The NVIDIA device plugin daemonset runs on all GPU nodes and advertises GPU resources to Kubernetes.
# nvidia-device-plugin.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: nvidia-device-plugin-daemonset
namespace: kube-system
spec:
selector:
matchLabels:
name: nvidia-device-plugin-ds
updateStrategy:
type: RollingUpdate
template:
metadata:
labels:
name: nvidia-device-plugin-ds
spec:
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
priorityClassName: system-node-critical
containers:
- image: nvcr.io/nvidia/k8s-device-plugin:v0.14.5
name: nvidia-device-plugin-ctr
args: ["--fail-on-init-error=false"]
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
volumeMounts:
- name: device-plugin
mountPath: /var/lib/kubelet/device-plugins
volumes:
- name: device-plugin
hostPath:
path: /var/lib/kubelet/device-plugins
Explanation:
- Lines 2-6: Defines a DaemonSet that ensures one pod runs on every GPU node
- Lines 18-21: Tolerations allow the plugin to run on nodes tainted with
nvidia.com/gpu - Line 22:
system-node-criticalensures the plugin stays running even under resource pressure - Line 24: Uses NVIDIA’s official device plugin container image
- Line 26:
--fail-on-init-error=falseprevents crashes if GPUs aren’t immediately available - Lines 31-33: Mounts the device plugin socket directory for communication with kubelet
- Lines 35-37: Exposes the host’s device plugin directory to the container
Deploy the plugin:
kubectl apply -f nvidia-device-plugin.yaml
Verify GPU availability:
kubectl get nodes -o=custom-columns=NAME:.metadata.name,GPUs:.status.capacity.nvidia\\.com/gpu
Explanation: This command lists all nodes and shows how many NVIDIA GPUs each node advertises to Kubernetes.
Enabling AMD GPU Support
Step 1: Install AMD GPU Drivers and ROCm
Install AMD ROCm (Radeon Open Compute) platform on GPU nodes:
# Add AMD ROCm repository
wget https://repo.radeon.com/amdgpu-install/6.0/ubuntu/jammy/amdgpu-install_6.0.60000-1_all.deb
sudo apt-get install ./amdgpu-install_6.0.60000-1_all.deb
# Install ROCm
sudo amdgpu-install --usecase=rocm --no-dkms
# Add user to video and render groups
sudo usermod -a -G video,render $USER
Explanation:
- Line 2: Downloads AMD GPU installation package
- Line 3: Installs the AMD GPU package manager
- Line 6: Installs ROCm runtime without DKMS (Dynamic Kernel Module Support)
- Line 9: Adds current user to groups needed for GPU access
Step 2: Deploy AMD Device Plugin
# amd-device-plugin.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: amdgpu-device-plugin-daemonset
namespace: kube-system
spec:
selector:
matchLabels:
name: amdgpu-dp-ds
template:
metadata:
labels:
name: amdgpu-dp-ds
spec:
priorityClassName: system-node-critical
tolerations:
- key: amd.com/gpu
operator: Exists
effect: NoSchedule
containers:
- name: amdgpu-dp-cntr
image: rocm/k8s-device-plugin:latest
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
volumeMounts:
- name: dp
mountPath: /var/lib/kubelet/device-plugins
- name: sys
mountPath: /sys
volumes:
- name: dp
hostPath:
path: /var/lib/kubelet/device-plugins
- name: sys
hostPath:
path: /sys
Explanation:
- Lines 8-10: Label selector ensures pods are associated with this DaemonSet
- Lines 17-20: Tolerations allow scheduling on nodes with AMD GPU taints
- Line 23: Uses AMD’s official ROCm device plugin image
- Lines 28-30: Mounts device plugin directory for kubelet communication
- Lines 31-32: Mounts
/sysfor GPU hardware discovery - Lines 37-39: Exposes host system information to detect AMD GPUs
Deploy the AMD plugin:
kubectl apply -f amd-device-plugin.yaml
# Verify AMD GPU availability
kubectl get nodes -o=custom-columns=NAME:.metadata.name,GPUs:.status.capacity.amd\\.com/gpu
Deploying Ollama with GPU Support
Step 3: Create Ollama Namespace and Deployment
First, create a dedicated namespace:
kubectl create namespace ollama
Explanation: Creates an isolated namespace for Ollama resources, providing organizational separation and resource management.
Step 4: Deploy Ollama with GPU Resources
# ollama-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: ollama
namespace: ollama
spec:
replicas: 1
selector:
matchLabels:
app: ollama
template:
metadata:
labels:
app: ollama
spec:
containers:
- name: ollama
image: ollama/ollama:latest
ports:
- containerPort: 11434
name: http
env:
- name: OLLAMA_HOST
value: "0.0.0.0:11434"
resources:
requests:
memory: "4Gi"
cpu: "2"
nvidia.com/gpu: "1" # For NVIDIA GPUs
# amd.com/gpu: "1" # Uncomment for AMD GPUs
limits:
memory: "8Gi"
cpu: "4"
nvidia.com/gpu: "1"
# amd.com/gpu: "1"
volumeMounts:
- name: ollama-data
mountPath: /root/.ollama
volumes:
- name: ollama-data
persistentVolumeClaim:
claimName: ollama-pvc
---
apiVersion: v1
kind: Service
metadata:
name: ollama-service
namespace: ollama
spec:
selector:
app: ollama
ports:
- protocol: TCP
port: 11434
targetPort: 11434
type: ClusterIP
Explanation:
- Line 8: Single replica ensures only one Ollama instance runs (LLMs are resource-intensive)
- Line 19: Uses official Ollama container image
- Lines 20-22: Exposes port 11434 for Ollama API
- Lines 24-25: Environment variable tells Ollama to listen on all interfaces
- Lines 27-36: Resource requests and limits:
- requests: Guaranteed resources Kubernetes reserves for the pod
- limits: Maximum resources the pod can consume
- nvidia.com/gpu: “1”: Requests one NVIDIA GPU
- GPUs are always requested in whole numbers (no fractional GPUs)
- Lines 38-40: Mounts persistent storage for model data
- Lines 42-44: References a PersistentVolumeClaim for model storage
- Lines 47-59: Service definition exposing Ollama internally within the cluster
Step 5: Create Persistent Storage
# ollama-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: ollama-pvc
namespace: ollama
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 50Gi
storageClassName: standard # Adjust based on your cluster
Explanation:
- Line 9:
ReadWriteOncemeans volume can be mounted read-write by a single node - Line 12: Requests 50GB storage (LLM models can be large: Llama 7B ≈ 4GB, Llama 70B ≈ 40GB)
- Line 13: Storage class determines the type of storage provisioned (SSD, HDD, cloud provider-specific)
Deploy all resources:
kubectl apply -f ollama-pvc.yaml
kubectl apply -f ollama-deployment.yaml
Verify deployment:
# Check pod status
kubectl get pods -n ollama
# Check GPU allocation
kubectl describe pod -n ollama -l app=ollama | grep -A 5 "Limits"
# View logs
kubectl logs -n ollama -l app=ollama
Explanation:
- Line 2: Lists all pods in the ollama namespace
- Line 5: Describes pod details and filters for resource limits to verify GPU allocation
- Line 8: Shows Ollama container logs for troubleshooting
Step 6: Load a Model into Ollama
# Port-forward to access Ollama locally
kubectl port-forward -n ollama svc/ollama-service 11434:11434 &
# Pull a model (using Llama 2 as example)
curl http://localhost:11434/api/pull -d '{
"name": "llama2:7b"
}'
# Verify model is loaded
curl http://localhost:11434/api/tags
Explanation:
- Line 2: Creates a tunnel from your local machine to the Ollama service in Kubernetes
- Lines 5-7: Pulls the Llama 2 7B model into Ollama
- Line 10: Lists all available models in Ollama
Building a Sample GPU-Powered Application
Let’s build a Python application that uses Ollama for text generation with GPU acceleration.
Application Code
# ollama-client.py
import requests
import json
import time
class OllamaClient:
def __init__(self, base_url="http://ollama-service.ollama.svc.cluster.local:11434"):
"""
Initialize Ollama client
Args:
base_url: Ollama service URL (using Kubernetes DNS)
"""
self.base_url = base_url
self.generate_endpoint = f"{base_url}/api/generate"
self.chat_endpoint = f"{base_url}/api/chat"
def generate(self, model, prompt, stream=False):
"""
Generate text using Ollama
Args:
model: Model name (e.g., "llama2:7b")
prompt: Input prompt for generation
stream: Whether to stream the response
Returns:
Generated text response
"""
payload = {
"model": model,
"prompt": prompt,
"stream": stream
}
try:
response = requests.post(
self.generate_endpoint,
json=payload,
timeout=300 # 5 minutes timeout for large models
)
response.raise_for_status()
if stream:
return self._handle_stream(response)
else:
return response.json()['response']
except requests.exceptions.RequestException as e:
print(f"Error communicating with Ollama: {e}")
return None
def _handle_stream(self, response):
"""Handle streaming responses from Ollama"""
full_response = ""
for line in response.iter_lines():
if line:
json_response = json.loads(line)
if 'response' in json_response:
full_response += json_response['response']
if json_response.get('done', False):
break
return full_response
def chat(self, model, messages):
"""
Chat with Ollama using conversation history
Args:
model: Model name
messages: List of message dictionaries with 'role' and 'content'
Returns:
Assistant's response
"""
payload = {
"model": model,
"messages": messages,
"stream": False
}
try:
response = requests.post(self.chat_endpoint, json=payload, timeout=300)
response.raise_for_status()
return response.json()['message']['content']
except requests.exceptions.RequestException as e:
print(f"Error in chat: {e}")
return None
# Example usage
if __name__ == "__main__":
client = OllamaClient()
# Simple generation example
print("=== Simple Generation ===")
start_time = time.time()
response = client.generate(
model="llama2:7b",
prompt="Explain how GPUs accelerate machine learning in simple terms."
)
end_time = time.time()
print(f"Response: {response}")
print(f"Time taken: {end_time - start_time:.2f} seconds")
# Chat example with conversation history
print("\n=== Chat Example ===")
messages = [
{"role": "user", "content": "What is Kubernetes?"},
{"role": "assistant", "content": "Kubernetes is an open-source container orchestration platform."},
{"role": "user", "content": "How does it help with GPU workloads?"}
]
chat_response = client.chat(model="llama2:7b", messages=messages)
print(f"Chat Response: {chat_response}")
Explanation:
- Lines 6-15:
__init__method initializes the client with Ollama’s Kubernetes service URL- Uses Kubernetes DNS format:
service-name.namespace.svc.cluster.local
- Uses Kubernetes DNS format:
- Lines 17-48:
generate()method sends prompts to Ollama- Line 38: 300-second timeout accommodates slow GPU inference on large models
- Line 43: Handles streaming responses chunk by chunk
- Line 45: Returns complete generated text for non-streaming
- Lines 50-61:
_handle_stream()processes streaming responses line by line- Concatenates text chunks as they arrive
- Stops when
done: trueis received
- Lines 63-83:
chat()method maintains conversation context- Accepts message history to preserve context across turns
- Lines 86-115: Example usage demonstrating both generation modes
- Times inference to show GPU acceleration impact
- Demonstrates conversation history management
Kubernetes Deployment for Client App
# ollama-client-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: ollama-client
namespace: ollama
spec:
replicas: 2
selector:
matchLabels:
app: ollama-client
template:
metadata:
labels:
app: ollama-client
spec:
containers:
- name: ollama-client
image: python:3.11-slim
command: ["/bin/sh"]
args:
- -c
- |
pip install requests flask
python /app/ollama-client.py
volumeMounts:
- name: app-code
mountPath: /app
env:
- name: OLLAMA_HOST
value: "http://ollama-service.ollama.svc.cluster.local:11434"
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "500m"
volumes:
- name: app-code
configMap:
name: ollama-client-code
---
apiVersion: v1
kind: ConfigMap
metadata:
name: ollama-client-code
namespace: ollama
data:
ollama-client.py: |
# Paste the Python code from above here
Explanation:
- Line 8: Two replicas for high availability (client doesn’t need GPU)
- Lines 20-25: Custom container command:
- Installs required Python packages
- Runs the application
- Lines 26-28: Mounts application code from ConfigMap
- Lines 30-31: Environment variable for Ollama service URL
- Lines 32-38: Resource limits:
- Modest resources since client doesn’t do GPU-intensive work
- Requests guarantee minimum resources
- Limits prevent resource hogging
- Lines 40-42: ConfigMap volume provides application code
- Lines 45-52: ConfigMap stores application source code
- Decouples code from container image
- Allows easy updates without rebuilding images
REST API Service
# api-service.py
from flask import Flask, request, jsonify
import requests
import json
app = Flask(__name__)
OLLAMA_URL = "http://ollama-service.ollama.svc.cluster.local:11434"
@app.route('/health', methods=['GET'])
def health():
"""Health check endpoint"""
return jsonify({"status": "healthy"}), 200
@app.route('/generate', methods=['POST'])
def generate():
"""
Generate text from prompt
Request body:
{
"model": "llama2:7b",
"prompt": "Your prompt here",
"temperature": 0.7,
"max_tokens": 500
}
"""
try:
data = request.get_json()
# Validate required fields
if 'model' not in data or 'prompt' not in data:
return jsonify({"error": "Missing required fields: model and prompt"}), 400
# Prepare Ollama request
ollama_payload = {
"model": data['model'],
"prompt": data['prompt'],
"stream": False,
"options": {
"temperature": data.get('temperature', 0.7),
"num_predict": data.get('max_tokens', 500)
}
}
# Call Ollama
response = requests.post(
f"{OLLAMA_URL}/api/generate",
json=ollama_payload,
timeout=300
)
response.raise_for_status()
result = response.json()
return jsonify({
"model": data['model'],
"response": result['response'],
"total_duration": result.get('total_duration', 0),
"load_duration": result.get('load_duration', 0),
"prompt_eval_count": result.get('prompt_eval_count', 0),
"eval_count": result.get('eval_count', 0)
}), 200
except requests.exceptions.RequestException as e:
return jsonify({"error": f"Ollama service error: {str(e)}"}), 500
except Exception as e:
return jsonify({"error": f"Internal server error: {str(e)}"}), 500
@app.route('/models', methods=['GET'])
def list_models():
"""List available models"""
try:
response = requests.get(f"{OLLAMA_URL}/api/tags", timeout=30)
response.raise_for_status()
return jsonify(response.json()), 200
except Exception as e:
return jsonify({"error": str(e)}), 500
if __name__ == '__main__':
app.run(host='0.0.0.0', port=8080)
Explanation:
- Lines 9-12: Health check endpoint for Kubernetes liveness/readiness probes
- Lines 14-63:
/generateendpoint:- Lines 30-32: Validates required fields in request
- Lines 35-43: Prepares payload for Ollama with configurable parameters
- Line 40: Temperature controls randomness (0 = deterministic, 1 = creative)
- Line 41:
num_predictlimits maximum tokens generated - Lines 45-50: Forwards request to Ollama service
- Lines 53-60: Returns response with performance metrics:
total_duration: Total inference timeload_duration: Model loading timeprompt_eval_count: Number of prompt tokens processedeval_count: Number of tokens generated
- Lines 68-75: Lists all available models in Ollama
- Line 78: Exposes API on all interfaces for Kubernetes access
Complete Deployment with Service
# api-service-deployment.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: api-service-code
namespace: ollama
data:
api-service.py: |
# Paste the Flask API code from above
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-service
namespace: ollama
spec:
replicas: 3
selector:
matchLabels:
app: api-service
template:
metadata:
labels:
app: api-service
spec:
containers:
- name: api-service
image: python:3.11-slim
command: ["/bin/sh"]
args:
- -c
- |
pip install flask requests
python /app/api-service.py
ports:
- containerPort: 8080
name: http
volumeMounts:
- name: code
mountPath: /app
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
resources:
requests:
memory: "256Mi"
cpu: "200m"
limits:
memory: "512Mi"
cpu: "500m"
volumes:
- name: code
configMap:
name: api-service-code
---
apiVersion: v1
kind: Service
metadata:
name: api-service
namespace: ollama
spec:
selector:
app: api-service
ports:
- protocol: TCP
port: 80
targetPort: 8080
type: LoadBalancer # Change to ClusterIP for internal-only access
Explanation:
- Lines 2-9: ConfigMap stores the Flask application code
- Line 17: Three replicas for load balancing and high availability
- Lines 29-34: Installation command runs on container startup
- Lines 35-37: Exposes port 8080 for API traffic
- Lines 41-52: Health probes:
- livenessProbe: Kubernetes restarts container if this fails
- readinessProbe: Kubernetes routes traffic only when this succeeds
- initialDelaySeconds: Delay before first check (allows app startup)
- periodSeconds: Frequency of health checks
- Lines 53-59: Conservative resource limits (API is lightweight)
- Lines 66-76: Service configuration:
- type: LoadBalancer: Exposes API externally (cloud providers assign external IP)
- Alternative: Use
ClusterIPfor internal-only access
Deploy and test:
# Deploy the API service
kubectl apply -f api-service-deployment.yaml
# Get service URL (for LoadBalancer)
kubectl get svc -n ollama api-service
# Test the API
export API_URL=$(kubectl get svc -n ollama api-service -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
curl -X POST http://$API_URL/generate \
-H "Content-Type: application/json" \
-d '{
"model": "llama2:7b",
"prompt": "Write a haiku about Kubernetes and GPUs",
"temperature": 0.8,
"max_tokens": 100
}'
# List available models
curl http://$API_URL/models
Explanation:
- Line 2: Deploys API service to Kubernetes
- Line 5: Lists services to find external IP
- Line 8: Extracts LoadBalancer IP programmatically
- Lines 10-17: Tests text generation with custom parameters
- Line 20: Lists models to verify Ollama integration
Monitoring GPU Usage
Deploy NVIDIA DCGM Exporter (for NVIDIA GPUs)
# dcgm-exporter.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: dcgm-exporter
namespace: monitoring
spec:
selector:
matchLabels:
app: dcgm-exporter
template:
metadata:
labels:
app: dcgm-exporter
spec:
nodeSelector:
nvidia.com/gpu.present: "true"
containers:
- name: dcgm-exporter
image: nvcr.io/nvidia/k8s/dcgm-exporter:3.1.8-3.1.5-ubuntu20.04
ports:
- containerPort: 9400
name: metrics
securityContext:
runAsNonRoot: false
runAsUser: 0
capabilities:
add: ["SYS_ADMIN"]
volumeMounts:
- name: pod-resources
mountPath: /var/lib/kubelet/pod-resources
env:
- name: DCGM_EXPORTER_LISTEN
value: ":9400"
- name: DCGM_EXPORTER_KUBERNETES
value: "true"
volumes:
- name: pod-resources
hostPath:
path: /var/lib/kubelet/pod-resources
---
apiVersion: v1
kind: Service
metadata:
name: dcgm-exporter
namespace: monitoring
labels:
app: dcgm-exporter
spec:
selector:
app: dcgm-exporter
ports:
- port: 9400
targetPort: 9400
name: metrics
Explanation:
- Line 16-17: nodeSelector ensures DCGM runs only on GPU nodes
- Line 20: NVIDIA’s Data Center GPU Manager exports GPU metrics
- Lines 24-28: Security context grants privileges needed to access GPU information
- Lines 29-31: Pod resources directory provides GPU allocation details
- Lines 32-36: Environment variables configure Prometheus metrics export
- Lines 43-55: Service exposes metrics for Prometheus scraping
Query GPU metrics:
# Port-forward to access metrics
kubectl port-forward -n monitoring svc/dcgm-exporter 9400:9400
# View GPU utilization
curl http://localhost:9400/metrics | grep DCGM_FI_DEV_GPU_UTIL
# View GPU memory usage
curl http://localhost:9400/metrics | grep DCGM_FI_DEV_FB_USED
Explanation:
- Line 2: Creates local access to DCGM metrics
- Line 5: Filters for GPU utilization percentage (0-100%)
- Line 8: Filters for GPU memory usage in MB
Check GPU Usage Directly
# Execute nvidia-smi inside Ollama pod
kubectl exec -n ollama -it $(kubectl get pod -n ollama -l app=ollama -o jsonpath='{.items[0].metadata.name}') -- nvidia-smi
# Watch GPU usage in real-time
kubectl exec -n ollama -it $(kubectl get pod -n ollama -l app=ollama -o jsonpath='{.items[0].metadata.name}') -- watch -n 1 nvidia-smi
Explanation:
- Line 2: Runs nvidia-smi inside the Ollama container to show GPU status
- Shows GPU utilization, memory usage, temperature, power draw
- Line 5: Continuously monitors GPU with 1-second refresh
- Useful for observing GPU activity during inference
Troubleshooting Common Issues
Issue 1: GPUs Not Detected
# Check if device plugin is running
kubectl get pods -n kube-system | grep device-plugin
# Check node GPU labels
kubectl describe node <node-name> | grep -i gpu
# Verify container runtime configuration
docker info | grep -i nvidia # For Docker
Solution: Ensure NVIDIA Container Toolkit is installed and Docker/containerd is configured with GPU support.
Issue 2: Pod Stuck in Pending State
# Check why pod is pending
kubectl describe pod -n ollama <pod-name> | grep -A 10 Events
Common causes:
- Insufficient GPU resources: No nodes have available GPUs
- Node selector mismatch: Pod requires GPU node but none match labels
- Resource limits too high: Requested resources exceed node capacity
Solution: Check GPU availability:
kubectl get nodes -o json | jq '.items[] | {name:.metadata.name, gpus:.status.allocatable}'
Issue 3: Out of Memory Errors
# Check pod memory usage
kubectl top pod -n ollama
# Increase memory limits in deployment
kubectl edit deployment -n ollama ollama
Solution: Large models require significant memory. Adjust resources.limits.memory based on model size:
- 7B models: 8-12GB RAM
- 13B models: 16-24GB RAM
- 70B models: 64GB+ RAM
Issue 4: Slow Inference Speed
# Verify GPU is actually being used
kubectl exec -n ollama -it <ollama-pod> -- nvidia-smi
# Check if model is loaded in GPU memory
kubectl logs -n ollama <ollama-pod> | grep -i "loaded"
Solution: Ensure:
- GPU device plugin is running correctly
- Container has GPU allocated (
nvidia.com/gpu: "1"in resources) - Model is appropriate size for your GPU VRAM
Best Practices
1. Resource Management
# Use resource quotas to prevent GPU over-allocation
apiVersion: v1
kind: ResourceQuota
metadata:
name: gpu-quota
namespace: ollama
spec:
hard:
requests.nvidia.com/gpu: "4" # Maximum 4 GPUs in namespace
limits.nvidia.com/gpu: "4"
Explanation:
- Prevents single namespace from consuming all GPU resources
- Enforces fair sharing across teams/applications
- Adjust based on cluster capacity
2. Node Affinity for GPU Pods
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: nvidia.com/gpu.product
operator: In
values:
- Tesla-V100-SXM2-32GB
- NVIDIA-A100-SXM4-40GB
Explanation:
- Targets specific GPU models for optimal performance
- Useful when cluster has mixed GPU types
- Ensures workloads run on appropriate hardware
3. Model Caching Strategy
# Use init container to pre-download models
initContainers:
- name: model-loader
image: ollama/ollama:latest
command:
- /bin/sh
- -c
- |
ollama pull llama2:7b
ollama pull codellama:7b
volumeMounts:
- name: ollama-data
mountPath: /root/.ollama
Explanation:
- Pre-loads models before main container starts
- Reduces startup time and API latency
- Ensures models are ready when service receives requests
4. Auto-scaling Considerations
Note: GPU pods don’t auto-scale well due to GPU allocation granularity. Instead:
# Use Horizontal Pod Autoscaler for CPU-based replicas
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-service-hpa
namespace: ollama
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api-service # Scale the API layer, not Ollama
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
Explanation:
- Scales stateless API frontend, not GPU backend
- GPU pods remain static or manually scaled
- Distributes request load across multiple API replicas
- Triggers scale-up when CPU exceeds 70% average
Conclusion
You’ve successfully configured GPU support for Ollama in Kubernetes with:
✅ NVIDIA and AMD GPU device plugins for hardware discovery
✅ Ollama deployment with GPU resource allocation
✅ Persistent storage for model caching
✅ REST API service for external access
✅ Monitoring with DCGM and nvidia-smi
✅ Best practices for production deployments
Key Takeaways
- GPU resources are discrete: Always request whole GPUs (
nvidia.com/gpu: "1") - Memory matters: Match RAM allocation to model size
- Persistent storage is critical: Models can be 40GB+
- Monitor GPU utilization: Use DCGM or nvidia-smi for observability
- Scale the API layer: Keep GPU pods static, scale stateless components
Next Steps
- Implement GPU time-slicing for better utilization across multiple workloads
- Add Prometheus and Grafana for comprehensive GPU monitoring dashboards
- Explore MIG (Multi-Instance GPU) for NVIDIA A100/H100 to partition GPUs
- Implement model caching strategies to reduce cold start times
- Set up CI/CD pipelines for automated model deployment
For production deployments, consider:
- Security: Network policies, pod security standards, secrets management
- Cost optimization: GPU spot instances, time-based scaling
- Model versioning: GitOps workflows for model updates
- Observability: Distributed tracing for request flows
Happy GPU-accelerated AI inference with Ollama on Kubernetes! 🚀
Related Resources: