Join our Discord Server
Collabnix Team The Collabnix Team is a diverse collective of Docker, Kubernetes, and IoT experts united by a passion for cloud-native technologies. With backgrounds spanning across DevOps, platform engineering, cloud architecture, and container orchestration, our contributors bring together decades of combined experience from various industries and technical domains.

Running Multiple Ollama Models on Kubernetes: Complete Guide

4 min read

As large language models (LLMs) become increasingly essential for modern applications, running them efficiently at scale presents unique challenges. Ollama has emerged as a powerful tool for running LLMs locally, and when combined with Kubernetes, it provides a robust platform for deploying multiple models simultaneously. This comprehensive guide walks you through the architecture, deployment strategies, and best practices for running multiple Ollama models on Kubernetes.

Why Run Ollama on Kubernetes?

Before diving into implementation details, let’s understand why Kubernetes is an ideal platform for Ollama deployments:

  • Resource Management: Kubernetes provides fine-grained control over CPU, memory, and GPU allocation across multiple models
  • Scalability: Horizontal pod autoscaling enables dynamic scaling based on demand
  • High Availability: Built-in health checks and automatic restarts ensure model availability
  • Multi-tenancy: Run different models with isolated resources and access controls
  • Cost Optimization: Efficient resource utilization through node affinity and scheduling policies

Architecture Overview

A typical multi-model Ollama deployment on Kubernetes consists of several key components:

  • Persistent Volumes: Store model files that can be shared across pods
  • StatefulSets or Deployments: Manage Ollama containers with specific model configurations
  • Services: Expose models via ClusterIP or LoadBalancer
  • ConfigMaps: Store model configurations and startup scripts
  • Resource Quotas: Ensure fair resource distribution across models

Prerequisites

Before proceeding, ensure you have:

  • A running Kubernetes cluster (v1.24+)
  • kubectl configured with cluster access
  • Sufficient storage (50GB+ recommended per model)
  • GPU support configured (optional but recommended for production)
  • Basic understanding of Kubernetes concepts

Setting Up Persistent Storage

First, create a PersistentVolumeClaim to store Ollama models. This approach allows multiple pods to share the same model files, reducing storage overhead.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: ollama-models-pvc
  namespace: ollama
spec:
  accessModes:
    - ReadWriteMany
  storageClassName: standard
  resources:
    requests:
      storage: 100Gi
---
apiVersion: v1
kind: Namespace
metadata:
  name: ollama

Apply the configuration:

kubectl apply -f ollama-storage.yaml
kubectl get pvc -n ollama

Deploying Your First Ollama Model

Let’s deploy Llama 2 as our first model. Create a deployment configuration that pulls the model on initialization:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama-llama2
  namespace: ollama
spec:
  replicas: 2
  selector:
    matchLabels:
      app: ollama
      model: llama2
  template:
    metadata:
      labels:
        app: ollama
        model: llama2
    spec:
      initContainers:
      - name: pull-model
        image: ollama/ollama:latest
        command: ["/bin/sh", "-c"]
        args:
          - |
            ollama serve &
            sleep 10;
            ollama pull llama2;
            pkill ollama
        volumeMounts:
        - name: ollama-storage
          mountPath: /root/.ollama
      containers:
      - name: ollama
        image: ollama/ollama:latest
        ports:
        - containerPort: 11434
          name: http
        env:
        - name: OLLAMA_HOST
          value: "0.0.0.0"
        resources:
          requests:
            memory: "4Gi"
            cpu: "2"
          limits:
            memory: "8Gi"
            cpu: "4"
        volumeMounts:
        - name: ollama-storage
          mountPath: /root/.ollama
        livenessProbe:
          httpGet:
            path: /
            port: 11434
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /
            port: 11434
          initialDelaySeconds: 15
          periodSeconds: 5
      volumes:
      - name: ollama-storage
        persistentVolumeClaim:
          claimName: ollama-models-pvc

Creating Services for Model Access

Expose each model through a dedicated service for clean separation and routing:

apiVersion: v1
kind: Service
metadata:
  name: ollama-llama2-service
  namespace: ollama
spec:
  selector:
    app: ollama
    model: llama2
  ports:
  - protocol: TCP
    port: 11434
    targetPort: 11434
  type: ClusterIP

Deploying Multiple Models Simultaneously

Now let’s deploy additional models like Mistral and CodeLlama. Create a multi-model configuration:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama-mistral
  namespace: ollama
spec:
  replicas: 1
  selector:
    matchLabels:
      app: ollama
      model: mistral
  template:
    metadata:
      labels:
        app: ollama
        model: mistral
    spec:
      initContainers:
      - name: pull-model
        image: ollama/ollama:latest
        command: ["/bin/sh", "-c"]
        args:
          - |
            ollama serve &
            sleep 10;
            ollama pull mistral;
            pkill ollama
        volumeMounts:
        - name: ollama-storage
          mountPath: /root/.ollama
      containers:
      - name: ollama
        image: ollama/ollama:latest
        ports:
        - containerPort: 11434
        resources:
          requests:
            memory: "4Gi"
            cpu: "2"
          limits:
            memory: "8Gi"
            cpu: "4"
        volumeMounts:
        - name: ollama-storage
          mountPath: /root/.ollama
      volumes:
      - name: ollama-storage
        persistentVolumeClaim:
          claimName: ollama-models-pvc
---
apiVersion: v1
kind: Service
metadata:
  name: ollama-mistral-service
  namespace: ollama
spec:
  selector:
    app: ollama
    model: mistral
  ports:
  - protocol: TCP
    port: 11434
    targetPort: 11434
  type: ClusterIP

Implementing GPU Support

For production workloads, GPU acceleration is crucial. Here’s how to configure GPU support:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama-llama2-gpu
  namespace: ollama
spec:
  replicas: 1
  selector:
    matchLabels:
      app: ollama
      model: llama2-gpu
  template:
    metadata:
      labels:
        app: ollama
        model: llama2-gpu
    spec:
      nodeSelector:
        accelerator: nvidia-gpu
      initContainers:
      - name: pull-model
        image: ollama/ollama:latest
        command: ["/bin/sh", "-c"]
        args:
          - |
            ollama serve &
            sleep 10;
            ollama pull llama2;
            pkill ollama
        volumeMounts:
        - name: ollama-storage
          mountPath: /root/.ollama
      containers:
      - name: ollama
        image: ollama/ollama:latest
        ports:
        - containerPort: 11434
        resources:
          requests:
            memory: "8Gi"
            cpu: "4"
            nvidia.com/gpu: "1"
          limits:
            memory: "16Gi"
            cpu: "8"
            nvidia.com/gpu: "1"
        volumeMounts:
        - name: ollama-storage
          mountPath: /root/.ollama
      volumes:
      - name: ollama-storage
        persistentVolumeClaim:
          claimName: ollama-models-pvc

Testing Your Deployment

Verify that all models are running correctly:

# Check pod status
kubectl get pods -n ollama

# Check services
kubectl get svc -n ollama

# Port forward to test locally
kubectl port-forward -n ollama svc/ollama-llama2-service 11434:11434

Test the model with a curl command:

curl http://localhost:11434/api/generate -d '{
  "model": "llama2",
  "prompt": "Why is the sky blue?",
  "stream": false
}'

Implementing an Ingress Controller

For external access, configure an Ingress resource with path-based routing:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: ollama-ingress
  namespace: ollama
  annotations:
    nginx.ingress.kubernetes.io/rewrite-target: /
    nginx.ingress.kubernetes.io/proxy-body-size: "50m"
spec:
  ingressClassName: nginx
  rules:
  - host: ollama.yourdomain.com
    http:
      paths:
      - path: /llama2
        pathType: Prefix
        backend:
          service:
            name: ollama-llama2-service
            port:
              number: 11434
      - path: /mistral
        pathType: Prefix
        backend:
          service:
            name: ollama-mistral-service
            port:
              number: 11434

Implementing Horizontal Pod Autoscaling

Scale your models based on CPU utilization:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: ollama-llama2-hpa
  namespace: ollama
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: ollama-llama2
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80

Monitoring and Observability

Create a ServiceMonitor for Prometheus integration:

apiVersion: v1
kind: ConfigMap
metadata:
  name: ollama-exporter
  namespace: ollama
data:
  monitor.sh: |
    #!/bin/bash
    while true; do
      curl -s http://localhost:11434/api/tags | jq .
      sleep 30
    done

Troubleshooting Common Issues

Model Download Failures

If models fail to download during initialization:

# Check init container logs
kubectl logs -n ollama <pod-name> -c pull-model

# Increase init container timeout
kubectl patch deployment ollama-llama2 -n ollama -p '{"spec":{"template":{"spec":{"initContainers":[{"name":"pull-model","command":["/bin/sh","-c","timeout 600 ..."]}]}}}}'

Out of Memory Errors

Adjust resource limits if pods are being OOMKilled:

# Check resource usage
kubectl top pods -n ollama

# Increase memory limits
kubectl set resources deployment ollama-llama2 -n ollama --limits=memory=16Gi

Slow Response Times

Investigate performance issues:

# Check if GPU is being utilized
kubectl exec -n ollama <pod-name> -- nvidia-smi

# Monitor request latency
kubectl logs -n ollama <pod-name> --tail=100

Best Practices for Production

  • Resource Isolation: Use separate namespaces for different model categories
  • Model Versioning: Tag deployments with model versions for easy rollbacks
  • Caching Strategy: Share model storage across pods using ReadWriteMany PVCs
  • Security: Implement NetworkPolicies to restrict model access
  • Cost Management: Use node affinity to schedule models on appropriate instance types
  • Backup Strategy: Regularly backup model configurations and fine-tuned weights
  • Rate Limiting: Implement request throttling at the Ingress level
  • Health Checks: Configure appropriate liveness and readiness probes

Advanced Configuration: Model Router

Create a simple Python-based router to distribute requests across models:

from flask import Flask, request, jsonify
import requests
import os

app = Flask(__name__)

MODEL_ENDPOINTS = {
    "llama2": "http://ollama-llama2-service.ollama.svc.cluster.local:11434",
    "mistral": "http://ollama-mistral-service.ollama.svc.cluster.local:11434"
}

@app.route('/api/generate', methods=['POST'])
def generate():
    data = request.json
    model = data.get('model', 'llama2')
    
    if model not in MODEL_ENDPOINTS:
        return jsonify({"error": "Model not found"}), 404
    
    response = requests.post(
        f"{MODEL_ENDPOINTS[model]}/api/generate",
        json=data
    )
    return response.json()

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=8080)

Conclusion

Running multiple Ollama models on Kubernetes provides a scalable, production-ready platform for deploying LLMs. By leveraging Kubernetes’ orchestration capabilities, you can efficiently manage resources, ensure high availability, and scale based on demand. The configurations and practices outlined in this guide provide a solid foundation for building robust AI inference infrastructure.

As you scale your deployment, consider implementing advanced features like model caching, request queuing, and multi-region deployments. Regular monitoring and optimization will ensure your Ollama deployment continues to meet performance and cost objectives.

Start with a single model deployment, validate your setup, and gradually expand to multiple models as your requirements grow. The flexibility of Kubernetes combined with Ollama’s simplicity creates a powerful platform for modern AI applications.

Have Queries? Join https://launchpass.com/collabnix

Collabnix Team The Collabnix Team is a diverse collective of Docker, Kubernetes, and IoT experts united by a passion for cloud-native technologies. With backgrounds spanning across DevOps, platform engineering, cloud architecture, and container orchestration, our contributors bring together decades of combined experience from various industries and technical domains.
Join our Discord Server
Index