Ollama has emerged as one of the most popular tools for running large language models (LLMs) locally, providing developers and organizations with a simple way to deploy and interact with models like Llama, Mistral, and CodeLlama without relying on external APIs. By packaging these powerful AI models into an easy-to-use interface, Ollama democratizes access to large language models while maintaining complete control over data privacy and model execution. However, as organizations scale their AI workloads and require high availability, fault tolerance, and resource management, running Ollama on individual machines becomes limiting.
This is where Kubernetes comes into play. By deploying Ollama on Kubernetes, organizations can leverage the platform’s robust orchestration capabilities to scale AI workloads horizontally, ensure high availability through automatic failover and restart mechanisms, and efficiently manage GPU resources across multiple nodes. This blog post explores how to successfully deploy Ollama on Kubernetes, addressing the unique challenges of running GPU-intensive AI workloads in a containerized environment. We’ll cover everything from basic deployment configurations to advanced topics like resource allocation, persistent storage for models, and load balancing strategies that enable you to build a production-ready AI infrastructure that can handle multiple concurrent requests while maintaining optimal performance and cost efficiency.
This comprehensive guide demonstrates how to deploy Ollama on Kubernetes clusters with production-ready configurations, persistent storage, and auto-scaling capabilities.
Prerequisites
Before we begin, ensure you have the following tools installed:
# Install kubectl curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl" chmod +x kubectl sudo mv kubectl /usr/local/bin/ # Verify installation kubectl version --client # Install Helm (for easier deployments) curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash # Verify cluster access kubectl cluster-info
Architecture Overview
Our deployment architecture includes:
- Kubernetes Cluster: The orchestration platform
- Ollama Pods: Running LLM inference workloads
- Persistent Storage: For model persistence
- Load Balancer: For traffic distribution
- Monitoring Stack: For observability
Namespace Setup
Create a dedicated namespace for LLM workloads:
apiVersion: v1
kind: Namespace
metadata:
name: llm-workloads
labels:
name: llm-workloads
purpose: ai-inference
---
apiVersion: v1
kind: ResourceQuota
metadata:
name: llm-quota
namespace: llm-workloads
spec:
hard:
requests.cpu: "20"
requests.memory: 40Gi
limits.cpu: "40"
limits.memory: 80Gi
persistentvolumeclaims: "5"
Storage Configuration
Set up persistent storage for model data:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: ollama-models-pvc
namespace: llm-workloads
labels:
app: ollama
component: storage
spec:
accessModes:
- ReadWriteOnce
storageClassName: fast-ssd
resources:
requests:
storage: 100Gi
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: ollama-cache-pvc
namespace: llm-workloads
spec:
accessModes:
- ReadWriteOnce
storageClassName: standard
resources:
requests:
storage: 50Gi
Security Configuration
Create service account and RBAC:
apiVersion: v1 kind: ServiceAccount metadata: name: ollama-sa namespace: llm-workloads --- apiVersion: rbac.authorization.k8s.io/v1 kind: Role metadata: namespace: llm-workloads name: ollama-role rules: - apiGroups: [""] resources: ["pods", "services", "configmaps"] verbs: ["get", "list", "watch"] - apiGroups: ["apps"] resources: ["deployments"] verbs: ["get", "list", "watch"] --- apiVersion: rbac.authorization.k8s.io/v1 kind: RoleBinding metadata: name: ollama-rolebinding namespace: llm-workloads subjects: - kind: ServiceAccount name: ollama-sa namespace: llm-workloads roleRef: kind: Role name: ollama-role apiGroup: rbac.authorization.k8s.io
Main Deployment Configuration
Create the comprehensive Ollama deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: ollama-deployment
namespace: llm-workloads
labels:
app: ollama
component: inference
version: v1
spec:
replicas: 2
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
selector:
matchLabels:
app: ollama
template:
metadata:
labels:
app: ollama
component: inference
version: v1
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "11434"
prometheus.io/path: "/metrics"
spec:
serviceAccountName: ollama-sa
securityContext:
runAsNonRoot: true
runAsUser: 1000
fsGroup: 1000
containers:
- name: ollama
image: ollama/ollama:latest
imagePullPolicy: Always
ports:
- containerPort: 11434
name: http
protocol: TCP
env:
- name: OLLAMA_HOST
value: "0.0.0.0"
- name: OLLAMA_MODELS
value: "/models"
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
resources:
requests:
memory: "4Gi"
cpu: "2"
limits:
memory: "8Gi"
cpu: "4"
volumeMounts:
- name: ollama-models
mountPath: /models
- name: ollama-cache
mountPath: /tmp
- name: shared-memory
mountPath: /dev/shm
livenessProbe:
httpGet:
path: /api/tags
port: 11434
initialDelaySeconds: 60
periodSeconds: 30
timeoutSeconds: 10
failureThreshold: 3
readinessProbe:
httpGet:
path: /api/tags
port: 11434
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
startupProbe:
httpGet:
path: /api/tags
port: 11434
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 30
volumes:
- name: ollama-models
persistentVolumeClaim:
claimName: ollama-models-pvc
- name: ollama-cache
persistentVolumeClaim:
claimName: ollama-cache-pvc
- name: shared-memory
emptyDir:
medium: Memory
sizeLimit: 2Gi
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- ollama
topologyKey: kubernetes.io/hostname
Service Configuration
apiVersion: v1
kind: Service
metadata:
name: ollama-service
namespace: llm-workloads
labels:
app: ollama
annotations:
service.beta.kubernetes.io/aws-load-balancer-type: nlb
spec:
type: LoadBalancer
ports:
- port: 11434
targetPort: 11434
protocol: TCP
name: http
selector:
app: ollama
---
apiVersion: v1
kind: Service
metadata:
name: ollama-headless
namespace: llm-workloads
labels:
app: ollama
spec:
clusterIP: None
ports:
- port: 11434
targetPort: 11434
protocol: TCP
name: http
selector:
app: ollama
Testing and Validation
Test the deployment with comprehensive checks:
# Check deployment status
kubectl get pods -n llm-workloads -l app=ollama
kubectl get pvc -n llm-workloads
kubectl get svc -n llm-workloads
# Get service endpoint
EXTERNAL_IP=$(kubectl get service ollama-service -n llm-workloads -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
echo "Ollama external IP: $EXTERNAL_IP"
# Port forward for local testing
kubectl port-forward -n llm-workloads service/ollama-service 11434:11434 &
# Wait for port forward
sleep 5
# Test basic connectivity
curl -f http://localhost:11434/api/tags || echo "Service not ready yet"
# Install a model
echo "📥 Installing Llama2 7B model..."
curl -X POST http://localhost:11434/api/pull \
-H "Content-Type: application/json" \
-d '{"name": "llama2:7b"}'
# Test inference
echo "🧠 Testing inference..."
curl -X POST http://localhost:11434/api/generate \
-H "Content-Type: application/json" \
-d '{
"model": "llama2:7b",
"prompt": "Explain Kubernetes in simple terms",
"stream": false
}' | jq '.response'
Horizontal Pod Autoscaler
Configure auto-scaling based on resource usage:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: ollama-hpa
namespace: llm-workloads
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: ollama-deployment
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 10
periodSeconds: 60
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Percent
value: 50
periodSeconds: 60
Troubleshooting Guide
Common issues and solutions:
# Check pod logs kubectl logs -n llm-workloads deployment/ollama-deployment -f # Check resource usage kubectl top pods -n llm-workloads # Describe problematic pods kubectl describe pod -n llm-workloads -l app=ollama # Check PVC status kubectl describe pvc -n llm-workloads # Check events kubectl get events -n llm-workloads --sort-by='.lastTimestamp' # Debug networking kubectl exec -it -n llm-workloads deployment/ollama-deployment -- /bin/bash # Test from within cluster kubectl run debug --image=curlimages/curl -it --rm -- sh curl http://ollama-service.llm-workloads.svc.cluster.local:11434/api/tags
Production Best Practices
For production deployments, consider:
- Resource Management: Set appropriate resource requests and limits
- Storage: Use high-performance SSDs with proper backup strategies
- Security: Implement network policies and pod security standards
- Monitoring: Comprehensive observability with Prometheus and Grafana
- Backup: Regular backup of model data and configurations
- Updates: Rolling update strategy for zero-downtime deployments
Cleanup
To remove the deployment:
# Delete all resources kubectl delete namespace llm-workloads # Or delete individually kubectl delete -f ollama-deployment.yaml kubectl delete -f ollama-service.yaml kubectl delete -f ollama-storage.yaml kubectl delete -f ollama-rbac.yaml
Conclusion
This tutorial provided a comprehensive guide to deploying Ollama on Kubernetes with production-ready configurations. The setup includes persistent storage, security, monitoring, and auto-scaling capabilities.