As large language models (LLMs) become increasingly essential for modern applications, running them efficiently at scale presents unique challenges. Ollama has emerged as a powerful tool for running LLMs locally, and when combined with Kubernetes, it provides a robust platform for deploying multiple models simultaneously. This comprehensive guide walks you through the architecture, deployment strategies, and best practices for running multiple Ollama models on Kubernetes.
Why Run Ollama on Kubernetes?
Before diving into implementation details, let’s understand why Kubernetes is an ideal platform for Ollama deployments:
- Resource Management: Kubernetes provides fine-grained control over CPU, memory, and GPU allocation across multiple models
- Scalability: Horizontal pod autoscaling enables dynamic scaling based on demand
- High Availability: Built-in health checks and automatic restarts ensure model availability
- Multi-tenancy: Run different models with isolated resources and access controls
- Cost Optimization: Efficient resource utilization through node affinity and scheduling policies
Architecture Overview
A typical multi-model Ollama deployment on Kubernetes consists of several key components:
- Persistent Volumes: Store model files that can be shared across pods
- StatefulSets or Deployments: Manage Ollama containers with specific model configurations
- Services: Expose models via ClusterIP or LoadBalancer
- ConfigMaps: Store model configurations and startup scripts
- Resource Quotas: Ensure fair resource distribution across models
Prerequisites
Before proceeding, ensure you have:
- A running Kubernetes cluster (v1.24+)
- kubectl configured with cluster access
- Sufficient storage (50GB+ recommended per model)
- GPU support configured (optional but recommended for production)
- Basic understanding of Kubernetes concepts
Setting Up Persistent Storage
First, create a PersistentVolumeClaim to store Ollama models. This approach allows multiple pods to share the same model files, reducing storage overhead.
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: ollama-models-pvc
namespace: ollama
spec:
accessModes:
- ReadWriteMany
storageClassName: standard
resources:
requests:
storage: 100Gi
---
apiVersion: v1
kind: Namespace
metadata:
name: ollama
Apply the configuration:
kubectl apply -f ollama-storage.yaml
kubectl get pvc -n ollama
Deploying Your First Ollama Model
Let’s deploy Llama 2 as our first model. Create a deployment configuration that pulls the model on initialization:
apiVersion: apps/v1
kind: Deployment
metadata:
name: ollama-llama2
namespace: ollama
spec:
replicas: 2
selector:
matchLabels:
app: ollama
model: llama2
template:
metadata:
labels:
app: ollama
model: llama2
spec:
initContainers:
- name: pull-model
image: ollama/ollama:latest
command: ["/bin/sh", "-c"]
args:
- |
ollama serve &
sleep 10;
ollama pull llama2;
pkill ollama
volumeMounts:
- name: ollama-storage
mountPath: /root/.ollama
containers:
- name: ollama
image: ollama/ollama:latest
ports:
- containerPort: 11434
name: http
env:
- name: OLLAMA_HOST
value: "0.0.0.0"
resources:
requests:
memory: "4Gi"
cpu: "2"
limits:
memory: "8Gi"
cpu: "4"
volumeMounts:
- name: ollama-storage
mountPath: /root/.ollama
livenessProbe:
httpGet:
path: /
port: 11434
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /
port: 11434
initialDelaySeconds: 15
periodSeconds: 5
volumes:
- name: ollama-storage
persistentVolumeClaim:
claimName: ollama-models-pvc
Creating Services for Model Access
Expose each model through a dedicated service for clean separation and routing:
apiVersion: v1
kind: Service
metadata:
name: ollama-llama2-service
namespace: ollama
spec:
selector:
app: ollama
model: llama2
ports:
- protocol: TCP
port: 11434
targetPort: 11434
type: ClusterIP
Deploying Multiple Models Simultaneously
Now let’s deploy additional models like Mistral and CodeLlama. Create a multi-model configuration:
apiVersion: apps/v1
kind: Deployment
metadata:
name: ollama-mistral
namespace: ollama
spec:
replicas: 1
selector:
matchLabels:
app: ollama
model: mistral
template:
metadata:
labels:
app: ollama
model: mistral
spec:
initContainers:
- name: pull-model
image: ollama/ollama:latest
command: ["/bin/sh", "-c"]
args:
- |
ollama serve &
sleep 10;
ollama pull mistral;
pkill ollama
volumeMounts:
- name: ollama-storage
mountPath: /root/.ollama
containers:
- name: ollama
image: ollama/ollama:latest
ports:
- containerPort: 11434
resources:
requests:
memory: "4Gi"
cpu: "2"
limits:
memory: "8Gi"
cpu: "4"
volumeMounts:
- name: ollama-storage
mountPath: /root/.ollama
volumes:
- name: ollama-storage
persistentVolumeClaim:
claimName: ollama-models-pvc
---
apiVersion: v1
kind: Service
metadata:
name: ollama-mistral-service
namespace: ollama
spec:
selector:
app: ollama
model: mistral
ports:
- protocol: TCP
port: 11434
targetPort: 11434
type: ClusterIP
Implementing GPU Support
For production workloads, GPU acceleration is crucial. Here’s how to configure GPU support:
apiVersion: apps/v1
kind: Deployment
metadata:
name: ollama-llama2-gpu
namespace: ollama
spec:
replicas: 1
selector:
matchLabels:
app: ollama
model: llama2-gpu
template:
metadata:
labels:
app: ollama
model: llama2-gpu
spec:
nodeSelector:
accelerator: nvidia-gpu
initContainers:
- name: pull-model
image: ollama/ollama:latest
command: ["/bin/sh", "-c"]
args:
- |
ollama serve &
sleep 10;
ollama pull llama2;
pkill ollama
volumeMounts:
- name: ollama-storage
mountPath: /root/.ollama
containers:
- name: ollama
image: ollama/ollama:latest
ports:
- containerPort: 11434
resources:
requests:
memory: "8Gi"
cpu: "4"
nvidia.com/gpu: "1"
limits:
memory: "16Gi"
cpu: "8"
nvidia.com/gpu: "1"
volumeMounts:
- name: ollama-storage
mountPath: /root/.ollama
volumes:
- name: ollama-storage
persistentVolumeClaim:
claimName: ollama-models-pvc
Testing Your Deployment
Verify that all models are running correctly:
# Check pod status
kubectl get pods -n ollama
# Check services
kubectl get svc -n ollama
# Port forward to test locally
kubectl port-forward -n ollama svc/ollama-llama2-service 11434:11434
Test the model with a curl command:
curl http://localhost:11434/api/generate -d '{
"model": "llama2",
"prompt": "Why is the sky blue?",
"stream": false
}'
Implementing an Ingress Controller
For external access, configure an Ingress resource with path-based routing:
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: ollama-ingress
namespace: ollama
annotations:
nginx.ingress.kubernetes.io/rewrite-target: /
nginx.ingress.kubernetes.io/proxy-body-size: "50m"
spec:
ingressClassName: nginx
rules:
- host: ollama.yourdomain.com
http:
paths:
- path: /llama2
pathType: Prefix
backend:
service:
name: ollama-llama2-service
port:
number: 11434
- path: /mistral
pathType: Prefix
backend:
service:
name: ollama-mistral-service
port:
number: 11434
Implementing Horizontal Pod Autoscaling
Scale your models based on CPU utilization:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: ollama-llama2-hpa
namespace: ollama
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: ollama-llama2
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
Monitoring and Observability
Create a ServiceMonitor for Prometheus integration:
apiVersion: v1
kind: ConfigMap
metadata:
name: ollama-exporter
namespace: ollama
data:
monitor.sh: |
#!/bin/bash
while true; do
curl -s http://localhost:11434/api/tags | jq .
sleep 30
done
Troubleshooting Common Issues
Model Download Failures
If models fail to download during initialization:
# Check init container logs
kubectl logs -n ollama <pod-name> -c pull-model
# Increase init container timeout
kubectl patch deployment ollama-llama2 -n ollama -p '{"spec":{"template":{"spec":{"initContainers":[{"name":"pull-model","command":["/bin/sh","-c","timeout 600 ..."]}]}}}}'
Out of Memory Errors
Adjust resource limits if pods are being OOMKilled:
# Check resource usage
kubectl top pods -n ollama
# Increase memory limits
kubectl set resources deployment ollama-llama2 -n ollama --limits=memory=16Gi
Slow Response Times
Investigate performance issues:
# Check if GPU is being utilized
kubectl exec -n ollama <pod-name> -- nvidia-smi
# Monitor request latency
kubectl logs -n ollama <pod-name> --tail=100
Best Practices for Production
- Resource Isolation: Use separate namespaces for different model categories
- Model Versioning: Tag deployments with model versions for easy rollbacks
- Caching Strategy: Share model storage across pods using ReadWriteMany PVCs
- Security: Implement NetworkPolicies to restrict model access
- Cost Management: Use node affinity to schedule models on appropriate instance types
- Backup Strategy: Regularly backup model configurations and fine-tuned weights
- Rate Limiting: Implement request throttling at the Ingress level
- Health Checks: Configure appropriate liveness and readiness probes
Advanced Configuration: Model Router
Create a simple Python-based router to distribute requests across models:
from flask import Flask, request, jsonify
import requests
import os
app = Flask(__name__)
MODEL_ENDPOINTS = {
"llama2": "http://ollama-llama2-service.ollama.svc.cluster.local:11434",
"mistral": "http://ollama-mistral-service.ollama.svc.cluster.local:11434"
}
@app.route('/api/generate', methods=['POST'])
def generate():
data = request.json
model = data.get('model', 'llama2')
if model not in MODEL_ENDPOINTS:
return jsonify({"error": "Model not found"}), 404
response = requests.post(
f"{MODEL_ENDPOINTS[model]}/api/generate",
json=data
)
return response.json()
if __name__ == '__main__':
app.run(host='0.0.0.0', port=8080)
Conclusion
Running multiple Ollama models on Kubernetes provides a scalable, production-ready platform for deploying LLMs. By leveraging Kubernetes’ orchestration capabilities, you can efficiently manage resources, ensure high availability, and scale based on demand. The configurations and practices outlined in this guide provide a solid foundation for building robust AI inference infrastructure.
As you scale your deployment, consider implementing advanced features like model caching, request queuing, and multi-region deployments. Regular monitoring and optimization will ensure your Ollama deployment continues to meet performance and cost objectives.
Start with a single model deployment, validate your setup, and gradually expand to multiple models as your requirements grow. The flexibility of Kubernetes combined with Ollama’s simplicity creates a powerful platform for modern AI applications.