Join our Discord Server
Collabnix Team The Collabnix Team is a diverse collective of Docker, Kubernetes, and IoT experts united by a passion for cloud-native technologies. With backgrounds spanning across DevOps, platform engineering, cloud architecture, and container orchestration, our contributors bring together decades of combined experience from various industries and technical domains.

Scaling Ollama Deployments: Load Balancing Strategies for Production

6 min read

As organizations increasingly adopt local large language models (LLMs) for AI workloads, Ollama has emerged as a popular solution for running models like Llama 2, Mistral, and CodeLlama. However, scaling Ollama deployments to handle production traffic requires sophisticated load balancing strategies. This comprehensive guide explores proven approaches to distribute inference workloads efficiently across multiple Ollama instances.

Understanding Ollama’s Architecture for Scale

Before implementing load balancing, it’s crucial to understand Ollama’s operational characteristics. Each Ollama instance is stateless, making it an ideal candidate for horizontal scaling. However, models consume significant GPU/CPU resources and memory, which presents unique challenges:

  • Model loading time: Initial model loads can take 5-30 seconds depending on size
  • Memory footprint: Models remain in memory for faster subsequent requests
  • GPU affinity: Each instance typically binds to specific GPU resources
  • Concurrent request handling: Limited by available VRAM and compute capacity

Kubernetes-Native Load Balancing with Services

The most straightforward approach for Kubernetes deployments leverages native Service resources with multiple Ollama pod replicas. This method provides automatic service discovery and basic round-robin load distribution.

Deploying Multiple Ollama Replicas

Start by creating a StatefulSet or Deployment with multiple replicas. Here’s a production-ready configuration:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama-deployment
  namespace: ai-workloads
spec:
  replicas: 3
  selector:
    matchLabels:
      app: ollama
  template:
    metadata:
      labels:
        app: ollama
    spec:
      containers:
      - name: ollama
        image: ollama/ollama:latest
        ports:
        - containerPort: 11434
          name: http
        resources:
          requests:
            memory: "8Gi"
            cpu: "2000m"
            nvidia.com/gpu: "1"
          limits:
            memory: "16Gi"
            cpu: "4000m"
            nvidia.com/gpu: "1"
        env:
        - name: OLLAMA_HOST
          value: "0.0.0.0"
        volumeMounts:
        - name: ollama-data
          mountPath: /root/.ollama
      volumes:
      - name: ollama-data
        emptyDir: {}
      nodeSelector:
        accelerator: nvidia-tesla-t4
---
apiVersion: v1
kind: Service
metadata:
  name: ollama-service
  namespace: ai-workloads
spec:
  selector:
    app: ollama
  ports:
  - protocol: TCP
    port: 11434
    targetPort: 11434
  type: ClusterIP
  sessionAffinity: ClientIP
  sessionAffinityConfig:
    clientIP:
      timeoutSeconds: 300

The sessionAffinity: ClientIP configuration ensures that requests from the same client are routed to the same pod for 5 minutes, optimizing model caching benefits.

Testing the Load Balancing Setup

Verify your deployment with these commands:

# Check pod distribution across nodes
kubectl get pods -n ai-workloads -o wide

# Test service connectivity
kubectl run -it --rm debug --image=curlimages/curl --restart=Never -- \
  curl http://ollama-service.ai-workloads:11434/api/tags

# Monitor resource utilization
kubectl top pods -n ai-workloads

Advanced Load Balancing with NGINX Ingress

For more sophisticated traffic management, NGINX Ingress Controller offers weighted routing, health checks, and custom balancing algorithms. This approach is particularly valuable when handling diverse model workloads.

Implementing Least Connections Algorithm

The least connections algorithm routes requests to the backend with the fewest active connections, ideal for long-running inference tasks:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: ollama-ingress
  namespace: ai-workloads
  annotations:
    nginx.ingress.kubernetes.io/upstream-hash-by: "$request_uri"
    nginx.ingress.kubernetes.io/load-balance: "least_conn"
    nginx.ingress.kubernetes.io/proxy-body-size: "100m"
    nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "300"
    nginx.ingress.kubernetes.io/backend-protocol: "HTTP"
spec:
  ingressClassName: nginx
  rules:
  - host: ollama.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: ollama-service
            port:
              number: 11434

Custom Health Checks for Model Availability

Implement sophisticated health checks that verify model readiness, not just pod liveness:

apiVersion: v1
kind: ConfigMap
metadata:
  name: nginx-config
  namespace: ai-workloads
data:
  health-check.sh: |
    #!/bin/bash
    RESPONSE=$(curl -s -X POST http://localhost:11434/api/generate \
      -d '{"model": "llama2", "prompt": "test", "stream": false}' \
      -H "Content-Type: application/json")
    
    if echo "$RESPONSE" | grep -q "response"; then
      exit 0
    else
      exit 1
    fi

HAProxy for High-Performance Load Balancing

For maximum throughput and fine-grained control, HAProxy provides exceptional performance with advanced routing capabilities. This solution works well for bare-metal or VM-based deployments.

HAProxy Configuration for Ollama

global
    log /dev/log local0
    maxconn 4096
    tune.ssl.default-dh-param 2048

defaults
    log global
    mode http
    option httplog
    option dontlognull
    timeout connect 10s
    timeout client 300s
    timeout server 300s
    timeout http-keep-alive 10s

frontend ollama_frontend
    bind *:80
    default_backend ollama_backend
    
    # Track request rates
    stick-table type ip size 100k expire 30s store http_req_rate(10s)
    http-request track-sc0 src
    
    # Rate limiting: 100 requests per 10 seconds
    acl too_many_requests sc_http_req_rate(0) gt 100
    http-request deny if too_many_requests

backend ollama_backend
    balance leastconn
    option httpchk POST /api/tags
    http-check expect status 200
    
    # Backend servers
    server ollama1 10.0.1.10:11434 check inter 5s fall 3 rise 2 maxconn 10
    server ollama2 10.0.1.11:11434 check inter 5s fall 3 rise 2 maxconn 10
    server ollama3 10.0.1.12:11434 check inter 5s fall 3 rise 2 maxconn 10
    
    # Stick table for session persistence
    stick-table type string len 32 size 100k expire 30m
    stick on hdr(X-Session-ID)

Deploy HAProxy in a containerized environment:

# Create HAProxy container
docker run -d --name haproxy-ollama \
  -p 80:80 \
  -v /path/to/haproxy.cfg:/usr/local/etc/haproxy/haproxy.cfg:ro \
  --restart unless-stopped \
  haproxy:2.8

# Monitor HAProxy stats
curl http://localhost:8404/stats

Implementing Intelligent Request Routing

Different models have varying resource requirements. Implement model-aware routing to optimize resource utilization across your cluster.

Python-Based Smart Router

Create a custom routing layer that directs requests based on model characteristics:

from flask import Flask, request, Response
import requests
import random
from collections import defaultdict
import threading
import time

app = Flask(__name__)

# Backend pools organized by model size
BACKENDS = {
    'small': ['http://ollama-small-1:11434', 'http://ollama-small-2:11434'],
    'medium': ['http://ollama-medium-1:11434', 'http://ollama-medium-2:11434'],
    'large': ['http://ollama-large-1:11434']
}

# Model to pool mapping
MODEL_POOLS = {
    'llama2:7b': 'small',
    'mistral': 'small',
    'llama2:13b': 'medium',
    'codellama:34b': 'large'
}

# Track backend health and load
backend_stats = defaultdict(lambda: {'requests': 0, 'healthy': True})
stats_lock = threading.Lock()

def get_backend_for_model(model_name):
    """Select optimal backend based on model requirements"""
    pool = MODEL_POOLS.get(model_name, 'medium')
    available_backends = [
        b for b in BACKENDS[pool] 
        if backend_stats[b]['healthy']
    ]
    
    if not available_backends:
        available_backends = BACKENDS[pool]
    
    # Least connections selection
    return min(available_backends, 
               key=lambda b: backend_stats[b]['requests'])

def health_check():
    """Periodic health check for all backends"""
    while True:
        for pool in BACKENDS.values():
            for backend in pool:
                try:
                    response = requests.get(f"{backend}/api/tags", timeout=5)
                    with stats_lock:
                        backend_stats[backend]['healthy'] = response.status_code == 200
                except:
                    with stats_lock:
                        backend_stats[backend]['healthy'] = False
        time.sleep(10)

@app.route('/api/generate', methods=['POST'])
def generate():
    data = request.get_json()
    model = data.get('model', 'llama2:7b')
    
    backend = get_backend_for_model(model)
    
    with stats_lock:
        backend_stats[backend]['requests'] += 1
    
    try:
        response = requests.post(
            f"{backend}/api/generate",
            json=data,
            stream=True,
            timeout=300
        )
        
        return Response(
            response.iter_content(chunk_size=1024),
            content_type=response.headers['content-type']
        )
    finally:
        with stats_lock:
            backend_stats[backend]['requests'] -= 1

if __name__ == '__main__':
    # Start health check thread
    health_thread = threading.Thread(target=health_check, daemon=True)
    health_thread.start()
    
    app.run(host='0.0.0.0', port=8080, threaded=True)

Autoscaling Strategies with HPA and KEDA

Implement dynamic scaling based on actual workload metrics using Horizontal Pod Autoscaler (HPA) or KEDA for event-driven autoscaling.

Custom Metrics-Based Autoscaling

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: ollama-hpa
  namespace: ai-workloads
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: ollama-deployment
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
  - type: Pods
    pods:
      metric:
        name: ollama_active_requests
      target:
        type: AverageValue
        averageValue: "5"
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 50
        periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Percent
        value: 100
        periodSeconds: 30

Troubleshooting Common Load Balancing Issues

Problem: Uneven Load Distribution

Symptoms: Some Ollama instances receive disproportionate traffic while others remain idle.

Solution: Verify service endpoint distribution and check for session affinity misconfigurations:

# Check endpoint distribution
kubectl get endpoints ollama-service -n ai-workloads -o yaml

# Disable session affinity if not needed
kubectl patch service ollama-service -n ai-workloads \
  -p '{"spec":{"sessionAffinity":"None"}}'

# Monitor request distribution
for pod in $(kubectl get pods -n ai-workloads -l app=ollama -o name); do
  echo "$pod:"
  kubectl logs $pod -n ai-workloads | grep "POST /api/generate" | wc -l
done

Problem: Slow Model Loading on New Pods

Symptoms: Initial requests to scaled pods timeout or experience high latency.

Solution: Implement model pre-warming with init containers or readiness probes:

spec:
  containers:
  - name: ollama
    image: ollama/ollama:latest
    readinessProbe:
      exec:
        command:
        - /bin/sh
        - -c
        - |
          curl -s -X POST http://localhost:11434/api/generate \
            -d '{"model": "llama2", "prompt": "ready", "stream": false}' | \
            grep -q response
      initialDelaySeconds: 30
      periodSeconds: 10
      timeoutSeconds: 30
      failureThreshold: 3
    lifecycle:
      postStart:
        exec:
          command:
          - /bin/sh
          - -c
          - |
            sleep 10
            ollama pull llama2
            ollama run llama2 "warmup" --verbose

Problem: Memory Exhaustion Under Load

Symptoms: Pods crash or restart frequently during high traffic periods.

Solution: Implement request queuing and connection limits:

# Add resource quotas to namespace
kubectl create quota ai-workload-quota -n ai-workloads \
  --hard=requests.nvidia.com/gpu=10,limits.nvidia.com/gpu=10

# Configure pod disruption budget
kubectl apply -f - <<EOF
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: ollama-pdb
  namespace: ai-workloads
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: ollama
EOF

Best Practices for Production Deployments

  • Model caching strategy: Use persistent volumes for model storage to reduce startup times
  • GPU affinity: Pin pods to specific GPU-enabled nodes using node selectors and taints
  • Request timeouts: Set appropriate timeouts (60-300s) based on model size and expected inference time
  • Monitoring: Implement comprehensive metrics collection using Prometheus and Grafana
  • Circuit breakers: Implement request circuit breakers to prevent cascade failures
  • Cost optimization: Use spot instances for non-critical workloads with appropriate scaling policies

Monitoring and Observability

Deploy a comprehensive monitoring stack to track load balancing effectiveness:

# Deploy Prometheus ServiceMonitor
kubectl apply -f - <<EOF
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: ollama-monitor
  namespace: ai-workloads
spec:
  selector:
    matchLabels:
      app: ollama
  endpoints:
  - port: metrics
    interval: 30s
    path: /metrics
EOF

# Key metrics to monitor
# - ollama_request_duration_seconds
# - ollama_active_requests
# - ollama_model_load_duration_seconds
# - ollama_gpu_utilization_percent

Conclusion

Scaling Ollama deployments requires a thoughtful approach to load balancing that considers the unique characteristics of LLM inference workloads. Whether you choose Kubernetes-native services, NGINX Ingress, HAProxy, or custom routing solutions, the key is matching your load balancing strategy to your specific workload patterns and infrastructure constraints.

Start with simple round-robin distribution for uniform workloads, then progress to sophisticated strategies like least-connections or model-aware routing as your requirements evolve. Always implement comprehensive monitoring, health checks, and autoscaling to ensure reliable performance under varying load conditions.

By following the patterns and configurations outlined in this guide, you’ll be well-equipped to build robust, scalable Ollama deployments that can handle production AI workloads efficiently.

Have Queries? Join https://launchpass.com/collabnix

Collabnix Team The Collabnix Team is a diverse collective of Docker, Kubernetes, and IoT experts united by a passion for cloud-native technologies. With backgrounds spanning across DevOps, platform engineering, cloud architecture, and container orchestration, our contributors bring together decades of combined experience from various industries and technical domains.
Join our Discord Server
Index