As organizations increasingly adopt local large language models (LLMs) for AI workloads, Ollama has emerged as a popular solution for running models like Llama 2, Mistral, and CodeLlama. However, scaling Ollama deployments to handle production traffic requires sophisticated load balancing strategies. This comprehensive guide explores proven approaches to distribute inference workloads efficiently across multiple Ollama instances.
Understanding Ollama’s Architecture for Scale
Before implementing load balancing, it’s crucial to understand Ollama’s operational characteristics. Each Ollama instance is stateless, making it an ideal candidate for horizontal scaling. However, models consume significant GPU/CPU resources and memory, which presents unique challenges:
- Model loading time: Initial model loads can take 5-30 seconds depending on size
- Memory footprint: Models remain in memory for faster subsequent requests
- GPU affinity: Each instance typically binds to specific GPU resources
- Concurrent request handling: Limited by available VRAM and compute capacity
Kubernetes-Native Load Balancing with Services
The most straightforward approach for Kubernetes deployments leverages native Service resources with multiple Ollama pod replicas. This method provides automatic service discovery and basic round-robin load distribution.
Deploying Multiple Ollama Replicas
Start by creating a StatefulSet or Deployment with multiple replicas. Here’s a production-ready configuration:
apiVersion: apps/v1
kind: Deployment
metadata:
name: ollama-deployment
namespace: ai-workloads
spec:
replicas: 3
selector:
matchLabels:
app: ollama
template:
metadata:
labels:
app: ollama
spec:
containers:
- name: ollama
image: ollama/ollama:latest
ports:
- containerPort: 11434
name: http
resources:
requests:
memory: "8Gi"
cpu: "2000m"
nvidia.com/gpu: "1"
limits:
memory: "16Gi"
cpu: "4000m"
nvidia.com/gpu: "1"
env:
- name: OLLAMA_HOST
value: "0.0.0.0"
volumeMounts:
- name: ollama-data
mountPath: /root/.ollama
volumes:
- name: ollama-data
emptyDir: {}
nodeSelector:
accelerator: nvidia-tesla-t4
---
apiVersion: v1
kind: Service
metadata:
name: ollama-service
namespace: ai-workloads
spec:
selector:
app: ollama
ports:
- protocol: TCP
port: 11434
targetPort: 11434
type: ClusterIP
sessionAffinity: ClientIP
sessionAffinityConfig:
clientIP:
timeoutSeconds: 300
The sessionAffinity: ClientIP configuration ensures that requests from the same client are routed to the same pod for 5 minutes, optimizing model caching benefits.
Testing the Load Balancing Setup
Verify your deployment with these commands:
# Check pod distribution across nodes
kubectl get pods -n ai-workloads -o wide
# Test service connectivity
kubectl run -it --rm debug --image=curlimages/curl --restart=Never -- \
curl http://ollama-service.ai-workloads:11434/api/tags
# Monitor resource utilization
kubectl top pods -n ai-workloads
Advanced Load Balancing with NGINX Ingress
For more sophisticated traffic management, NGINX Ingress Controller offers weighted routing, health checks, and custom balancing algorithms. This approach is particularly valuable when handling diverse model workloads.
Implementing Least Connections Algorithm
The least connections algorithm routes requests to the backend with the fewest active connections, ideal for long-running inference tasks:
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: ollama-ingress
namespace: ai-workloads
annotations:
nginx.ingress.kubernetes.io/upstream-hash-by: "$request_uri"
nginx.ingress.kubernetes.io/load-balance: "least_conn"
nginx.ingress.kubernetes.io/proxy-body-size: "100m"
nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
nginx.ingress.kubernetes.io/proxy-send-timeout: "300"
nginx.ingress.kubernetes.io/backend-protocol: "HTTP"
spec:
ingressClassName: nginx
rules:
- host: ollama.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: ollama-service
port:
number: 11434
Custom Health Checks for Model Availability
Implement sophisticated health checks that verify model readiness, not just pod liveness:
apiVersion: v1
kind: ConfigMap
metadata:
name: nginx-config
namespace: ai-workloads
data:
health-check.sh: |
#!/bin/bash
RESPONSE=$(curl -s -X POST http://localhost:11434/api/generate \
-d '{"model": "llama2", "prompt": "test", "stream": false}' \
-H "Content-Type: application/json")
if echo "$RESPONSE" | grep -q "response"; then
exit 0
else
exit 1
fi
HAProxy for High-Performance Load Balancing
For maximum throughput and fine-grained control, HAProxy provides exceptional performance with advanced routing capabilities. This solution works well for bare-metal or VM-based deployments.
HAProxy Configuration for Ollama
global
log /dev/log local0
maxconn 4096
tune.ssl.default-dh-param 2048
defaults
log global
mode http
option httplog
option dontlognull
timeout connect 10s
timeout client 300s
timeout server 300s
timeout http-keep-alive 10s
frontend ollama_frontend
bind *:80
default_backend ollama_backend
# Track request rates
stick-table type ip size 100k expire 30s store http_req_rate(10s)
http-request track-sc0 src
# Rate limiting: 100 requests per 10 seconds
acl too_many_requests sc_http_req_rate(0) gt 100
http-request deny if too_many_requests
backend ollama_backend
balance leastconn
option httpchk POST /api/tags
http-check expect status 200
# Backend servers
server ollama1 10.0.1.10:11434 check inter 5s fall 3 rise 2 maxconn 10
server ollama2 10.0.1.11:11434 check inter 5s fall 3 rise 2 maxconn 10
server ollama3 10.0.1.12:11434 check inter 5s fall 3 rise 2 maxconn 10
# Stick table for session persistence
stick-table type string len 32 size 100k expire 30m
stick on hdr(X-Session-ID)
Deploy HAProxy in a containerized environment:
# Create HAProxy container
docker run -d --name haproxy-ollama \
-p 80:80 \
-v /path/to/haproxy.cfg:/usr/local/etc/haproxy/haproxy.cfg:ro \
--restart unless-stopped \
haproxy:2.8
# Monitor HAProxy stats
curl http://localhost:8404/stats
Implementing Intelligent Request Routing
Different models have varying resource requirements. Implement model-aware routing to optimize resource utilization across your cluster.
Python-Based Smart Router
Create a custom routing layer that directs requests based on model characteristics:
from flask import Flask, request, Response
import requests
import random
from collections import defaultdict
import threading
import time
app = Flask(__name__)
# Backend pools organized by model size
BACKENDS = {
'small': ['http://ollama-small-1:11434', 'http://ollama-small-2:11434'],
'medium': ['http://ollama-medium-1:11434', 'http://ollama-medium-2:11434'],
'large': ['http://ollama-large-1:11434']
}
# Model to pool mapping
MODEL_POOLS = {
'llama2:7b': 'small',
'mistral': 'small',
'llama2:13b': 'medium',
'codellama:34b': 'large'
}
# Track backend health and load
backend_stats = defaultdict(lambda: {'requests': 0, 'healthy': True})
stats_lock = threading.Lock()
def get_backend_for_model(model_name):
"""Select optimal backend based on model requirements"""
pool = MODEL_POOLS.get(model_name, 'medium')
available_backends = [
b for b in BACKENDS[pool]
if backend_stats[b]['healthy']
]
if not available_backends:
available_backends = BACKENDS[pool]
# Least connections selection
return min(available_backends,
key=lambda b: backend_stats[b]['requests'])
def health_check():
"""Periodic health check for all backends"""
while True:
for pool in BACKENDS.values():
for backend in pool:
try:
response = requests.get(f"{backend}/api/tags", timeout=5)
with stats_lock:
backend_stats[backend]['healthy'] = response.status_code == 200
except:
with stats_lock:
backend_stats[backend]['healthy'] = False
time.sleep(10)
@app.route('/api/generate', methods=['POST'])
def generate():
data = request.get_json()
model = data.get('model', 'llama2:7b')
backend = get_backend_for_model(model)
with stats_lock:
backend_stats[backend]['requests'] += 1
try:
response = requests.post(
f"{backend}/api/generate",
json=data,
stream=True,
timeout=300
)
return Response(
response.iter_content(chunk_size=1024),
content_type=response.headers['content-type']
)
finally:
with stats_lock:
backend_stats[backend]['requests'] -= 1
if __name__ == '__main__':
# Start health check thread
health_thread = threading.Thread(target=health_check, daemon=True)
health_thread.start()
app.run(host='0.0.0.0', port=8080, threaded=True)
Autoscaling Strategies with HPA and KEDA
Implement dynamic scaling based on actual workload metrics using Horizontal Pod Autoscaler (HPA) or KEDA for event-driven autoscaling.
Custom Metrics-Based Autoscaling
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: ollama-hpa
namespace: ai-workloads
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: ollama-deployment
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
- type: Pods
pods:
metric:
name: ollama_active_requests
target:
type: AverageValue
averageValue: "5"
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 50
periodSeconds: 60
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Percent
value: 100
periodSeconds: 30
Troubleshooting Common Load Balancing Issues
Problem: Uneven Load Distribution
Symptoms: Some Ollama instances receive disproportionate traffic while others remain idle.
Solution: Verify service endpoint distribution and check for session affinity misconfigurations:
# Check endpoint distribution
kubectl get endpoints ollama-service -n ai-workloads -o yaml
# Disable session affinity if not needed
kubectl patch service ollama-service -n ai-workloads \
-p '{"spec":{"sessionAffinity":"None"}}'
# Monitor request distribution
for pod in $(kubectl get pods -n ai-workloads -l app=ollama -o name); do
echo "$pod:"
kubectl logs $pod -n ai-workloads | grep "POST /api/generate" | wc -l
done
Problem: Slow Model Loading on New Pods
Symptoms: Initial requests to scaled pods timeout or experience high latency.
Solution: Implement model pre-warming with init containers or readiness probes:
spec:
containers:
- name: ollama
image: ollama/ollama:latest
readinessProbe:
exec:
command:
- /bin/sh
- -c
- |
curl -s -X POST http://localhost:11434/api/generate \
-d '{"model": "llama2", "prompt": "ready", "stream": false}' | \
grep -q response
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 30
failureThreshold: 3
lifecycle:
postStart:
exec:
command:
- /bin/sh
- -c
- |
sleep 10
ollama pull llama2
ollama run llama2 "warmup" --verbose
Problem: Memory Exhaustion Under Load
Symptoms: Pods crash or restart frequently during high traffic periods.
Solution: Implement request queuing and connection limits:
# Add resource quotas to namespace
kubectl create quota ai-workload-quota -n ai-workloads \
--hard=requests.nvidia.com/gpu=10,limits.nvidia.com/gpu=10
# Configure pod disruption budget
kubectl apply -f - <<EOF
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: ollama-pdb
namespace: ai-workloads
spec:
minAvailable: 2
selector:
matchLabels:
app: ollama
EOF
Best Practices for Production Deployments
- Model caching strategy: Use persistent volumes for model storage to reduce startup times
- GPU affinity: Pin pods to specific GPU-enabled nodes using node selectors and taints
- Request timeouts: Set appropriate timeouts (60-300s) based on model size and expected inference time
- Monitoring: Implement comprehensive metrics collection using Prometheus and Grafana
- Circuit breakers: Implement request circuit breakers to prevent cascade failures
- Cost optimization: Use spot instances for non-critical workloads with appropriate scaling policies
Monitoring and Observability
Deploy a comprehensive monitoring stack to track load balancing effectiveness:
# Deploy Prometheus ServiceMonitor
kubectl apply -f - <<EOF
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: ollama-monitor
namespace: ai-workloads
spec:
selector:
matchLabels:
app: ollama
endpoints:
- port: metrics
interval: 30s
path: /metrics
EOF
# Key metrics to monitor
# - ollama_request_duration_seconds
# - ollama_active_requests
# - ollama_model_load_duration_seconds
# - ollama_gpu_utilization_percent
Conclusion
Scaling Ollama deployments requires a thoughtful approach to load balancing that considers the unique characteristics of LLM inference workloads. Whether you choose Kubernetes-native services, NGINX Ingress, HAProxy, or custom routing solutions, the key is matching your load balancing strategy to your specific workload patterns and infrastructure constraints.
Start with simple round-robin distribution for uniform workloads, then progress to sophisticated strategies like least-connections or model-aware routing as your requirements evolve. Always implement comprehensive monitoring, health checks, and autoscaling to ensure reliable performance under varying load conditions.
By following the patterns and configurations outlined in this guide, you’ll be well-equipped to build robust, scalable Ollama deployments that can handle production AI workloads efficiently.