Join our Discord Server
Collabnix Team The Collabnix Team is a diverse collective of Docker, Kubernetes, and IoT experts united by a passion for cloud-native technologies. With backgrounds spanning across DevOps, platform engineering, cloud architecture, and container orchestration, our contributors bring together decades of combined experience from various industries and technical domains.

Running Ollama on Kubernetes: A Complete Guide

3 min read

Getting Started with Ollama Kubernetes Setup

With the rapid adoption of Large Language Models (LLMs) in enterprise applications, running models locally has become crucial for data privacy, cost control, and reduced latency. Ollama simplifies running LLMs locally, while Kubernetes provides the orchestration needed for production deployments.

In this comprehensive guide, we’ll explore how to deploy Ollama on Kubernetes, enabling you to run powerful AI models like Llama 2, CodeLlama, and Mistral in your own infrastructure.

Why Ollama on Kubernetes?

Benefits of This Architecture

Data Privacy: Keep sensitive data within your infrastructure
Cost Efficiency: Eliminate API costs for high-volume applications
Low Latency: Local inference without external API calls
Scalability: Kubernetes auto-scaling for variable workloads
Resource Management: Efficient GPU/CPU allocation across the cluster

Prerequisites

Before we begin, ensure you have:

    • Kubernetes cluster (1.19+) with GPU support (optional but recommended)
    • kubectl configured and connected to your cluster
    • Docker registry access for custom images
    • Basic understanding of Kubernetes concepts (Pods, Services, Deployments)

Step 1: Creating the Ollama Deployment

Basic Ollama Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama
  namespace: ollama-system
spec:
  replicas: 1
  selector:
    matchLabels:
      app: ollama
  template:
    metadata:
      labels:
        app: ollama
    spec:
      containers:
      - name: ollama
        image: ollama/ollama:latest
        ports:
        - containerPort: 11434
        env:
        - name: OLLAMA_HOST
          value: "0.0.0.0"
        volumeMounts:
        - name: ollama-data
          mountPath: /root/.ollama
        resources:
          requests:
            memory: "2Gi"
            cpu: "1000m"
          limits:
            memory: "8Gi"
            cpu: "4000m"
      volumes:
      - name: ollama-data
        persistentVolumeClaim:
          claimName: ollama-pvc

GPU-Enabled Deployment

For better performance with larger models:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama-gpu
  namespace: ollama-system
spec:
  replicas: 1
  selector:
    matchLabels:
      app: ollama-gpu
  template:
    metadata:
      labels:
        app: ollama-gpu
    spec:
      containers:
      - name: ollama
        image: ollama/ollama:latest
        ports:
        - containerPort: 11434
        env:
        - name: OLLAMA_HOST
          value: "0.0.0.0"
        - name: CUDA_VISIBLE_DEVICES
          value: "0"
        volumeMounts:
        - name: ollama-data
          mountPath: /root/.ollama
        resources:
          requests:
            nvidia.com/gpu: 1
            memory: "4Gi"
            cpu: "2000m"
          limits:
            nvidia.com/gpu: 1
            memory: "16Gi"
            cpu: "8000m"
      nodeSelector:
        accelerator: nvidia-tesla-k80
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      volumes:
      - name: ollama-data
        persistentVolumeClaim:
          claimName: ollama-pvc

Step 2: Persistent Storage Configuration

Creating Persistent Volume Claim

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: ollama-pvc
  namespace: ollama-system
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 50Gi
  storageClassName: fast-ssd

Why Persistent Storage Matters

    • Model Persistence: Downloaded models persist across pod restarts
    • Performance: Avoid re-downloading large models (7GB+ for Llama 2)
    • Cost Efficiency: Reduce egress costs from model repositories

Step 3: Service and Ingress Configuration

ClusterIP Service

apiVersion: v1
kind: Service
metadata:
  name: ollama-service
  namespace: ollama-system
spec:
  selector:
    app: ollama
  ports:
  - protocol: TCP
    port: 11434
    targetPort: 11434
  type: ClusterIP

Ingress for External Access

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: ollama-ingress
  namespace: ollama-system
  annotations:
    nginx.ingress.kubernetes.io/proxy-body-size: "0"
    nginx.ingress.kubernetes.io/proxy-read-timeout: "600"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "600"
spec:
  rules:
  - host: ollama.yourdomain.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: ollama-service
            port:
              number: 11434

Step 4: Deploying to Kubernetes

Create Namespace and Deploy

# Create dedicated namespace
kubectl create namespace ollama-system

Apply all configurations


kubectl apply -f ollama-pvc.yaml
kubectl apply -f ollama-deployment.yaml
kubectl apply -f ollama-service.yaml
kubectl apply -f ollama-ingress.yaml

Verify deployment


kubectl get pods -n ollama-system
kubectl logs -f deployment/ollama -n ollama-system

Step 5: Model Management

Pulling Models via Job

Create a Job to pre-pull popular models:

apiVersion: batch/v1
kind: Job
metadata:
  name: ollama-model-loader
  namespace: ollama-system
spec:
  template:
    spec:
      containers:
      - name: model-loader
        image: curlimages/curl:latest
        command:
        - sh
        - -c
        - |
          # Wait for Ollama service
          until curl -f http://ollama-service:11434/api/version; do
            echo "Waiting for Ollama..."
            sleep 5
          done
          
          # Pull essential models
          curl -X POST http://ollama-service:11434/api/pull \
            -H "Content-Type: application/json" \
            -d '{"name": "llama2:7b"}'
          
          curl -X POST http://ollama-service:11434/api/pull \
            -H "Content-Type: application/json" \
            -d '{"name": "codellama:7b"}'
      restartPolicy: OnFailure
  backoffLimit: 3

Available Models for Different Use Cases

    • General Purpose: llama2:7b, llama2:13b
      - Code Generation: codellama:7b, codellama:13b`
    • Lightweight: mistral:7b, neural-chat:7b
    • Specialized: vicuna:7b, wizard-coder:7b

Step 6: Scaling and Load Balancing

Horizontal Pod Autoscaler

yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: ollama-hpa
namespace: ollama-system
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: ollama
minReplicas: 1
maxReplicas: 5
metrics:
– type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
– type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80

 

Step 7: Testing Your Deployment

Basic API Test

 

bash

Port forward for testing

kubectl port-forward service/ollama-service 11434:11434 -n ollama-system

Test API endpoint

curl http://localhost:11434/api/version

Generate text with Llama 2

curl -X POST http://localhost:11434/api/generate \
-H “Content-Type: application/json” \
-d ‘{
“model”: “llama2:7b”,
“prompt”: “Explain Kubernetes in simple terms”,
“stream”: false
}’

 

Python Client Example

 

python
import requests
import json

def query_ollama(prompt, model=”llama2:7b”):
url = “http://ollama.yourdomain.com/api/generate”
payload = {
“model”: model,
“prompt”: prompt,
“stream”: False
}

response = requests.post(url, json=payload)
return response.json()[‘response’]

Example usage

result = query_ollama(“Write a Python function to calculate fibonacci numbers”)
print(result)
`

Production Considerations

Security Best Practices

1. Network Policies: Restrict pod-to-pod communication
2. Resource Quotas: Prevent resource exhaustion
3. RBAC: Limit access to Ollama namespaces
4. TLS Termination: Enable HTTPS for external access

Monitoring and Observability

    • Prometheus Metrics: Monitor resource usage and request latency
    • Grafana Dashboards: Visualize model performance
    • Logging: Centralized logging for debugging
    • Health Checks: Implement readiness and liveness probes

Performance Optimization

    • Node Affinity: Schedule pods on high-performance nodes
    • Resource Requests: Right-size CPU/memory allocations
    • Model Caching: Use shared storage for model persistence
    • Load Balancing: Distribute requests across multiple replicas

Troubleshooting Common Issues

Pod Stuck in Pending State

    • Check node resources and scheduling constraints
    • Verify GPU availability for GPU-enabled deployments
    • Ensure PVC can be mounted

Out of Memory Errors

    • Increase memory limits for larger models
    • Use models appropriate for your hardware
    • Consider CPU-only variants for memory-constrained environments

Slow Model Loading

    • Pre-pull models using Jobs
    • Use faster storage classes (SSD)
    • Implement model warmup strategies

Conclusion

Deploying Ollama on Kubernetes provides a robust, scalable foundation for running LLMs in production. This setup enables:

    • Enterprise-grade deployment with high availability
    • Cost-effective scaling based on demand
    • Secure, private AI within your infrastructure
  • Easy model management and updates

With this foundation, you can build sophisticated AI applications while maintaining full control over your data and infrastructure costs.

Next Steps

1. Implement monitoring with Prometheus and Grafana
2. Add CI/CD pipelines for automated deployments
3. Explore advanced models like Llama 2 70B for complex tasks
4. Build applications using your Kubernetes-hosted LLM API

Start with smaller models and scale up as you validate performance and resource requirements in your environment.

Have Queries? Join https://launchpass.com/collabnix

Collabnix Team The Collabnix Team is a diverse collective of Docker, Kubernetes, and IoT experts united by a passion for cloud-native technologies. With backgrounds spanning across DevOps, platform engineering, cloud architecture, and container orchestration, our contributors bring together decades of combined experience from various industries and technical domains.
Join our Discord Server
Index