Getting Started with Ollama Kubernetes Setup
With the rapid adoption of Large Language Models (LLMs) in enterprise applications, running models locally has become crucial for data privacy, cost control, and reduced latency. Ollama simplifies running LLMs locally, while Kubernetes provides the orchestration needed for production deployments.
In this comprehensive guide, we’ll explore how to deploy Ollama on Kubernetes, enabling you to run powerful AI models like Llama 2, CodeLlama, and Mistral in your own infrastructure.
Why Ollama on Kubernetes?
Benefits of This Architecture
Data Privacy: Keep sensitive data within your infrastructure
Cost Efficiency: Eliminate API costs for high-volume applications
Low Latency: Local inference without external API calls
Scalability: Kubernetes auto-scaling for variable workloads
Resource Management: Efficient GPU/CPU allocation across the cluster
Prerequisites
Before we begin, ensure you have:
-
- Kubernetes cluster (1.19+) with GPU support (optional but recommended)
-
- kubectl configured and connected to your cluster
-
- Docker registry access for custom images
-
- Basic understanding of Kubernetes concepts (Pods, Services, Deployments)
Step 1: Creating the Ollama Deployment
Basic Ollama Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: ollama
namespace: ollama-system
spec:
replicas: 1
selector:
matchLabels:
app: ollama
template:
metadata:
labels:
app: ollama
spec:
containers:
- name: ollama
image: ollama/ollama:latest
ports:
- containerPort: 11434
env:
- name: OLLAMA_HOST
value: "0.0.0.0"
volumeMounts:
- name: ollama-data
mountPath: /root/.ollama
resources:
requests:
memory: "2Gi"
cpu: "1000m"
limits:
memory: "8Gi"
cpu: "4000m"
volumes:
- name: ollama-data
persistentVolumeClaim:
claimName: ollama-pvc
GPU-Enabled Deployment
For better performance with larger models:
apiVersion: apps/v1
kind: Deployment
metadata:
name: ollama-gpu
namespace: ollama-system
spec:
replicas: 1
selector:
matchLabels:
app: ollama-gpu
template:
metadata:
labels:
app: ollama-gpu
spec:
containers:
- name: ollama
image: ollama/ollama:latest
ports:
- containerPort: 11434
env:
- name: OLLAMA_HOST
value: "0.0.0.0"
- name: CUDA_VISIBLE_DEVICES
value: "0"
volumeMounts:
- name: ollama-data
mountPath: /root/.ollama
resources:
requests:
nvidia.com/gpu: 1
memory: "4Gi"
cpu: "2000m"
limits:
nvidia.com/gpu: 1
memory: "16Gi"
cpu: "8000m"
nodeSelector:
accelerator: nvidia-tesla-k80
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
volumes:
- name: ollama-data
persistentVolumeClaim:
claimName: ollama-pvc
Step 2: Persistent Storage Configuration
Creating Persistent Volume Claim
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: ollama-pvc
namespace: ollama-system
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 50Gi
storageClassName: fast-ssd
Why Persistent Storage Matters
-
- Model Persistence: Downloaded models persist across pod restarts
-
- Performance: Avoid re-downloading large models (7GB+ for Llama 2)
-
- Cost Efficiency: Reduce egress costs from model repositories
Step 3: Service and Ingress Configuration
ClusterIP Service
apiVersion: v1
kind: Service
metadata:
name: ollama-service
namespace: ollama-system
spec:
selector:
app: ollama
ports:
- protocol: TCP
port: 11434
targetPort: 11434
type: ClusterIP
Ingress for External Access
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: ollama-ingress
namespace: ollama-system
annotations:
nginx.ingress.kubernetes.io/proxy-body-size: "0"
nginx.ingress.kubernetes.io/proxy-read-timeout: "600"
nginx.ingress.kubernetes.io/proxy-send-timeout: "600"
spec:
rules:
- host: ollama.yourdomain.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: ollama-service
port:
number: 11434
Step 4: Deploying to Kubernetes
Create Namespace and Deploy
# Create dedicated namespace
kubectl create namespace ollama-system
Apply all configurations
kubectl apply -f ollama-pvc.yaml
kubectl apply -f ollama-deployment.yaml
kubectl apply -f ollama-service.yaml
kubectl apply -f ollama-ingress.yaml
Verify deployment
kubectl get pods -n ollama-system
kubectl logs -f deployment/ollama -n ollama-system
Step 5: Model Management
Pulling Models via Job
Create a Job to pre-pull popular models:
apiVersion: batch/v1
kind: Job
metadata:
name: ollama-model-loader
namespace: ollama-system
spec:
template:
spec:
containers:
- name: model-loader
image: curlimages/curl:latest
command:
- sh
- -c
- |
# Wait for Ollama service
until curl -f http://ollama-service:11434/api/version; do
echo "Waiting for Ollama..."
sleep 5
done
# Pull essential models
curl -X POST http://ollama-service:11434/api/pull \
-H "Content-Type: application/json" \
-d '{"name": "llama2:7b"}'
curl -X POST http://ollama-service:11434/api/pull \
-H "Content-Type: application/json" \
-d '{"name": "codellama:7b"}'
restartPolicy: OnFailure
backoffLimit: 3
Available Models for Different Use Cases
-
- General Purpose:
llama2:7b
,llama2:13b
- Code Generation:
codellama:7b,
codellama:13b`
- General Purpose:
-
- Lightweight:
mistral:7b
,neural-chat:7b
- Lightweight:
-
- Specialized:
vicuna:7b
,wizard-coder:7b
- Specialized:
Step 6: Scaling and Load Balancing
Horizontal Pod Autoscaler
yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: ollama-hpa
namespace: ollama-system
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: ollama
minReplicas: 1
maxReplicas: 5
metrics:
– type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
– type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
Step 7: Testing Your Deployment
Basic API Test
bash
Port forward for testing
kubectl port-forward service/ollama-service 11434:11434 -n ollama-system
Test API endpoint
curl http://localhost:11434/api/version
Generate text with Llama 2
curl -X POST http://localhost:11434/api/generate \
-H “Content-Type: application/json” \
-d ‘{
“model”: “llama2:7b”,
“prompt”: “Explain Kubernetes in simple terms”,
“stream”: false
}’
Python Client Example
python
import requests
import json
def query_ollama(prompt, model=”llama2:7b”):
url = “http://ollama.yourdomain.com/api/generate”
payload = {
“model”: model,
“prompt”: prompt,
“stream”: False
}
response = requests.post(url, json=payload)
return response.json()[‘response’]
Example usage
result = query_ollama(“Write a Python function to calculate fibonacci numbers”)
print(result)
`
Production Considerations
Security Best Practices
1. Network Policies: Restrict pod-to-pod communication
2. Resource Quotas: Prevent resource exhaustion
3. RBAC: Limit access to Ollama namespaces
4. TLS Termination: Enable HTTPS for external access
Monitoring and Observability
-
- Prometheus Metrics: Monitor resource usage and request latency
-
- Grafana Dashboards: Visualize model performance
-
- Logging: Centralized logging for debugging
-
- Health Checks: Implement readiness and liveness probes
Performance Optimization
-
- Node Affinity: Schedule pods on high-performance nodes
-
- Resource Requests: Right-size CPU/memory allocations
-
- Model Caching: Use shared storage for model persistence
-
- Load Balancing: Distribute requests across multiple replicas
Troubleshooting Common Issues
Pod Stuck in Pending State
-
- Check node resources and scheduling constraints
-
- Verify GPU availability for GPU-enabled deployments
-
- Ensure PVC can be mounted
Out of Memory Errors
-
- Increase memory limits for larger models
-
- Use models appropriate for your hardware
-
- Consider CPU-only variants for memory-constrained environments
Slow Model Loading
-
- Pre-pull models using Jobs
-
- Use faster storage classes (SSD)
-
- Implement model warmup strategies
Conclusion
Deploying Ollama on Kubernetes provides a robust, scalable foundation for running LLMs in production. This setup enables:
-
- Enterprise-grade deployment with high availability
-
- Cost-effective scaling based on demand
-
- Secure, private AI within your infrastructure
- Easy model management and updates
With this foundation, you can build sophisticated AI applications while maintaining full control over your data and infrastructure costs.
Next Steps
1. Implement monitoring with Prometheus and Grafana
2. Add CI/CD pipelines for automated deployments
3. Explore advanced models like Llama 2 70B for complex tasks
4. Build applications using your Kubernetes-hosted LLM API
Start with smaller models and scale up as you validate performance and resource requirements in your environment.