Join our Discord Server
Collabnix Team The Collabnix Team is a diverse collective of Docker, Kubernetes, and IoT experts united by a passion for cloud-native technologies. With backgrounds spanning across DevOps, platform engineering, cloud architecture, and container orchestration, our contributors bring together decades of combined experience from various industries and technical domains.

Getting Started with Ollama on Kubernetes

4 min read

Ollama has emerged as one of the most popular tools for running large language models (LLMs) locally, providing developers and organizations with a simple way to deploy and interact with models like Llama, Mistral, and CodeLlama without relying on external APIs. By packaging these powerful AI models into an easy-to-use interface, Ollama democratizes access to large language models while maintaining complete control over data privacy and model execution. However, as organizations scale their AI workloads and require high availability, fault tolerance, and resource management, running Ollama on individual machines becomes limiting.

This is where Kubernetes comes into play. By deploying Ollama on Kubernetes, organizations can leverage the platform’s robust orchestration capabilities to scale AI workloads horizontally, ensure high availability through automatic failover and restart mechanisms, and efficiently manage GPU resources across multiple nodes. This blog post explores how to successfully deploy Ollama on Kubernetes, addressing the unique challenges of running GPU-intensive AI workloads in a containerized environment. We’ll cover everything from basic deployment configurations to advanced topics like resource allocation, persistent storage for models, and load balancing strategies that enable you to build a production-ready AI infrastructure that can handle multiple concurrent requests while maintaining optimal performance and cost efficiency.

This comprehensive guide demonstrates how to deploy Ollama on Kubernetes clusters with production-ready configurations, persistent storage, and auto-scaling capabilities.

Prerequisites

Before we begin, ensure you have the following tools installed:

# Install kubectl
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
chmod +x kubectl
sudo mv kubectl /usr/local/bin/

# Verify installation
kubectl version --client

# Install Helm (for easier deployments)
curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash

# Verify cluster access
kubectl cluster-info

Architecture Overview

Our deployment architecture includes:

  • Kubernetes Cluster: The orchestration platform
  • Ollama Pods: Running LLM inference workloads
  • Persistent Storage: For model persistence
  • Load Balancer: For traffic distribution
  • Monitoring Stack: For observability

Namespace Setup

Create a dedicated namespace for LLM workloads:

apiVersion: v1
kind: Namespace
metadata:
  name: llm-workloads
  labels:
    name: llm-workloads
    purpose: ai-inference
---
apiVersion: v1
kind: ResourceQuota
metadata:
  name: llm-quota
  namespace: llm-workloads
spec:
  hard:
    requests.cpu: "20"
    requests.memory: 40Gi
    limits.cpu: "40"
    limits.memory: 80Gi
    persistentvolumeclaims: "5"

Storage Configuration

Set up persistent storage for model data:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: ollama-models-pvc
  namespace: llm-workloads
  labels:
    app: ollama
    component: storage
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: fast-ssd
  resources:
    requests:
      storage: 100Gi
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: ollama-cache-pvc
  namespace: llm-workloads
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: standard
  resources:
    requests:
      storage: 50Gi

Security Configuration

Create service account and RBAC:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: ollama-sa
  namespace: llm-workloads
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: llm-workloads
  name: ollama-role
rules:
- apiGroups: [""]
  resources: ["pods", "services", "configmaps"]
  verbs: ["get", "list", "watch"]
- apiGroups: ["apps"]
  resources: ["deployments"]
  verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: ollama-rolebinding
  namespace: llm-workloads
subjects:
- kind: ServiceAccount
  name: ollama-sa
  namespace: llm-workloads
roleRef:
  kind: Role
  name: ollama-role
  apiGroup: rbac.authorization.k8s.io

Main Deployment Configuration

Create the comprehensive Ollama deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama-deployment
  namespace: llm-workloads
  labels:
    app: ollama
    component: inference
    version: v1
spec:
  replicas: 2
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  selector:
    matchLabels:
      app: ollama
  template:
    metadata:
      labels:
        app: ollama
        component: inference
        version: v1
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "11434"
        prometheus.io/path: "/metrics"
    spec:
      serviceAccountName: ollama-sa
      securityContext:
        runAsNonRoot: true
        runAsUser: 1000
        fsGroup: 1000
      containers:
      - name: ollama
        image: ollama/ollama:latest
        imagePullPolicy: Always
        ports:
        - containerPort: 11434
          name: http
          protocol: TCP
        env:
        - name: OLLAMA_HOST
          value: "0.0.0.0"
        - name: OLLAMA_MODELS
          value: "/models"
        - name: POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: POD_NAMESPACE
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace
        resources:
          requests:
            memory: "4Gi"
            cpu: "2"
          limits:
            memory: "8Gi"
            cpu: "4"
        volumeMounts:
        - name: ollama-models
          mountPath: /models
        - name: ollama-cache
          mountPath: /tmp
        - name: shared-memory
          mountPath: /dev/shm
        livenessProbe:
          httpGet:
            path: /api/tags
            port: 11434
          initialDelaySeconds: 60
          periodSeconds: 30
          timeoutSeconds: 10
          failureThreshold: 3
        readinessProbe:
          httpGet:
            path: /api/tags
            port: 11434
          initialDelaySeconds: 30
          periodSeconds: 10
          timeoutSeconds: 5
          failureThreshold: 3
        startupProbe:
          httpGet:
            path: /api/tags
            port: 11434
          initialDelaySeconds: 30
          periodSeconds: 10
          timeoutSeconds: 5
          failureThreshold: 30
      volumes:
      - name: ollama-models
        persistentVolumeClaim:
          claimName: ollama-models-pvc
      - name: ollama-cache
        persistentVolumeClaim:
          claimName: ollama-cache-pvc
      - name: shared-memory
        emptyDir:
          medium: Memory
          sizeLimit: 2Gi
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - ollama
              topologyKey: kubernetes.io/hostname

Service Configuration

apiVersion: v1
kind: Service
metadata:
  name: ollama-service
  namespace: llm-workloads
  labels:
    app: ollama
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-type: nlb
spec:
  type: LoadBalancer
  ports:
  - port: 11434
    targetPort: 11434
    protocol: TCP
    name: http
  selector:
    app: ollama
---
apiVersion: v1
kind: Service
metadata:
  name: ollama-headless
  namespace: llm-workloads
  labels:
    app: ollama
spec:
  clusterIP: None
  ports:
  - port: 11434
    targetPort: 11434
    protocol: TCP
    name: http
  selector:
    app: ollama

Testing and Validation

Test the deployment with comprehensive checks:

# Check deployment status
kubectl get pods -n llm-workloads -l app=ollama
kubectl get pvc -n llm-workloads
kubectl get svc -n llm-workloads

# Get service endpoint
EXTERNAL_IP=$(kubectl get service ollama-service -n llm-workloads -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
echo "Ollama external IP: $EXTERNAL_IP"

# Port forward for local testing
kubectl port-forward -n llm-workloads service/ollama-service 11434:11434 &

# Wait for port forward
sleep 5

# Test basic connectivity
curl -f http://localhost:11434/api/tags || echo "Service not ready yet"

# Install a model
echo "📥 Installing Llama2 7B model..."
curl -X POST http://localhost:11434/api/pull \
  -H "Content-Type: application/json" \
  -d '{"name": "llama2:7b"}'

# Test inference
echo "🧠 Testing inference..."
curl -X POST http://localhost:11434/api/generate \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama2:7b",
    "prompt": "Explain Kubernetes in simple terms",
    "stream": false
  }' | jq '.response'

Horizontal Pod Autoscaler

Configure auto-scaling based on resource usage:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: ollama-hpa
  namespace: llm-workloads
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: ollama-deployment
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 10
        periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Percent
        value: 50
        periodSeconds: 60

Troubleshooting Guide

Common issues and solutions:

# Check pod logs
kubectl logs -n llm-workloads deployment/ollama-deployment -f

# Check resource usage
kubectl top pods -n llm-workloads

# Describe problematic pods
kubectl describe pod -n llm-workloads -l app=ollama

# Check PVC status
kubectl describe pvc -n llm-workloads

# Check events
kubectl get events -n llm-workloads --sort-by='.lastTimestamp'

# Debug networking
kubectl exec -it -n llm-workloads deployment/ollama-deployment -- /bin/bash

# Test from within cluster
kubectl run debug --image=curlimages/curl -it --rm -- sh
curl http://ollama-service.llm-workloads.svc.cluster.local:11434/api/tags

Production Best Practices

For production deployments, consider:

  1. Resource Management: Set appropriate resource requests and limits
  2. Storage: Use high-performance SSDs with proper backup strategies
  3. Security: Implement network policies and pod security standards
  4. Monitoring: Comprehensive observability with Prometheus and Grafana
  5. Backup: Regular backup of model data and configurations
  6. Updates: Rolling update strategy for zero-downtime deployments

Cleanup

To remove the deployment:

# Delete all resources
kubectl delete namespace llm-workloads

# Or delete individually
kubectl delete -f ollama-deployment.yaml
kubectl delete -f ollama-service.yaml
kubectl delete -f ollama-storage.yaml
kubectl delete -f ollama-rbac.yaml

Conclusion

This tutorial provided a comprehensive guide to deploying Ollama on Kubernetes with production-ready configurations. The setup includes persistent storage, security, monitoring, and auto-scaling capabilities.

Related Resources:

Have Queries? Join https://launchpass.com/collabnix

Collabnix Team The Collabnix Team is a diverse collective of Docker, Kubernetes, and IoT experts united by a passion for cloud-native technologies. With backgrounds spanning across DevOps, platform engineering, cloud architecture, and container orchestration, our contributors bring together decades of combined experience from various industries and technical domains.
Join our Discord Server
Index