Large Language Model (LLM) inference workloads present unique challenges for Kubernetes autoscaling. Unlike traditional microservices, LLM deployments require GPU resources, have unpredictable latency patterns, and consume significant memory. This comprehensive guide explores production-ready autoscaling strategies specifically designed for LLM inference on Kubernetes.
Understanding LLM Inference Characteristics
Before implementing autoscaling, it’s crucial to understand what makes LLM workloads different:
- GPU Dependency: Most LLMs require GPU acceleration, making pod scheduling more complex
- Long Startup Times: Loading multi-gigabyte models can take 30-120 seconds
- Variable Request Duration: Inference time varies dramatically based on prompt length and generation parameters
- Memory Intensive: Models like Llama-2-70B require 140GB+ of GPU memory
- Batch Processing Benefits: Throughput improves significantly with request batching
Autoscaling Strategies for LLM Workloads
1. Horizontal Pod Autoscaler (HPA) with Custom Metrics
The standard HPA can work for LLM inference when configured with appropriate metrics. CPU-based scaling is insufficient; instead, focus on GPU utilization and queue depth.
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: llm-inference-hpa
namespace: ml-workloads
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: llm-inference
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 70
- type: Pods
pods:
metric:
name: gpu_utilization
target:
type: AverageValue
averageValue: "75"
- type: Pods
pods:
metric:
name: inference_queue_depth
target:
type: AverageValue
averageValue: "10"
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 50
periodSeconds: 60
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Percent
value: 100
periodSeconds: 30
- type: Pods
value: 2
periodSeconds: 30
selectPolicy: Max
Key configuration considerations:
- stabilizationWindowSeconds: Set higher for scale-down (300s) to prevent thrashing during traffic fluctuations
- scaleUp policies: Aggressive scaling up (100% or 2 pods) to handle sudden traffic spikes
- Custom metrics: GPU utilization and queue depth provide better signals than CPU/memory alone
2. KEDA (Kubernetes Event-Driven Autoscaling)
KEDA excels at scaling LLM workloads based on external metrics like message queues, which is ideal for asynchronous inference patterns.
# Install KEDA
kubectl apply -f https://github.com/kedacore/keda/releases/download/v2.12.0/keda-2.12.0.yaml
# Verify installation
kubectl get pods -n keda
Here’s a KEDA ScaledObject configuration for Redis-backed inference queues:
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: llm-inference-scaler
namespace: ml-workloads
spec:
scaleTargetRef:
name: llm-inference
pollingInterval: 15
cooldownPeriod: 300
minReplicaCount: 1
maxReplicaCount: 20
advanced:
horizontalPodAutoscalerConfig:
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 25
periodSeconds: 60
triggers:
- type: redis
metadata:
addressFromEnv: REDIS_HOST
listName: inference_queue
listLength: "5"
activationListLength: "1"
- type: prometheus
metadata:
serverAddress: http://prometheus:9090
metricName: gpu_memory_utilization
threshold: "80"
query: avg(nvidia_gpu_memory_used_bytes / nvidia_gpu_memory_total_bytes * 100)
3. Vertical Pod Autoscaler (VPA) for Resource Optimization
VPA helps right-size resource requests for LLM pods, particularly important when running different model sizes.
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: llm-inference-vpa
namespace: ml-workloads
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: llm-inference
updatePolicy:
updateMode: "Auto"
resourcePolicy:
containerPolicies:
- containerName: inference-server
minAllowed:
memory: "16Gi"
nvidia.com/gpu: "1"
maxAllowed:
memory: "80Gi"
nvidia.com/gpu: "1"
controlledResources:
- memory
mode: Auto
Production Deployment Configuration
LLM Inference Deployment with Autoscaling
apiVersion: apps/v1
kind: Deployment
metadata:
name: llm-inference
namespace: ml-workloads
labels:
app: llm-inference
model: llama-2-13b
spec:
replicas: 2
selector:
matchLabels:
app: llm-inference
template:
metadata:
labels:
app: llm-inference
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8000"
prometheus.io/path: "/metrics"
spec:
nodeSelector:
cloud.google.com/gke-accelerator: nvidia-tesla-t4
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
containers:
- name: inference-server
image: your-registry/llm-inference:v1.0
resources:
requests:
memory: "24Gi"
cpu: "4"
nvidia.com/gpu: "1"
limits:
memory: "32Gi"
cpu: "8"
nvidia.com/gpu: "1"
env:
- name: MODEL_NAME
value: "meta-llama/Llama-2-13b-chat-hf"
- name: MAX_BATCH_SIZE
value: "8"
- name: MAX_CONCURRENT_REQUESTS
value: "32"
ports:
- containerPort: 8000
name: http
- containerPort: 8001
name: metrics
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 60
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 120
periodSeconds: 30
timeoutSeconds: 10
failureThreshold: 3
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 30"]
Monitoring and Custom Metrics
Exposing LLM-Specific Metrics
Implement custom metrics exporter in your inference server:
from prometheus_client import Counter, Histogram, Gauge, start_http_server
import time
# Define metrics
inference_requests_total = Counter(
'llm_inference_requests_total',
'Total number of inference requests',
['model', 'status']
)
inference_duration_seconds = Histogram(
'llm_inference_duration_seconds',
'Inference request duration in seconds',
['model'],
buckets=[0.1, 0.5, 1.0, 2.0, 5.0, 10.0, 30.0, 60.0]
)
queue_depth = Gauge(
'llm_inference_queue_depth',
'Number of requests waiting in queue'
)
gpu_memory_used = Gauge(
'llm_gpu_memory_used_bytes',
'GPU memory used by model'
)
active_requests = Gauge(
'llm_active_requests',
'Number of currently processing requests'
)
def track_inference(model_name):
"""Decorator to track inference metrics"""
def decorator(func):
def wrapper(*args, **kwargs):
start_time = time.time()
active_requests.inc()
try:
result = func(*args, **kwargs)
inference_requests_total.labels(
model=model_name,
status='success'
).inc()
return result
except Exception as e:
inference_requests_total.labels(
model=model_name,
status='error'
).inc()
raise
finally:
duration = time.time() - start_time
inference_duration_seconds.labels(
model=model_name
).observe(duration)
active_requests.dec()
return wrapper
return decorator
# Start metrics server
start_http_server(8001)
Prometheus ServiceMonitor Configuration
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: llm-inference-metrics
namespace: ml-workloads
spec:
selector:
matchLabels:
app: llm-inference
endpoints:
- port: metrics
interval: 15s
path: /metrics
Advanced Scaling Patterns
Multi-Model Serving with Selective Scaling
For environments serving multiple models, implement model-specific autoscaling:
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: llama-13b-scaler
spec:
scaleTargetRef:
name: llama-13b-inference
minReplicaCount: 1
maxReplicaCount: 10
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus:9090
metricName: model_request_rate
threshold: "10"
query: |
rate(llm_inference_requests_total{
model="llama-13b",
status="success"
}[2m])
---
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: llama-70b-scaler
spec:
scaleTargetRef:
name: llama-70b-inference
minReplicaCount: 0
maxReplicaCount: 5
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus:9090
metricName: model_request_rate
threshold: "5"
query: |
rate(llm_inference_requests_total{
model="llama-70b",
status="success"
}[2m])
Troubleshooting Common Issues
Issue 1: Pods Scaling Too Slowly
Symptoms: Request queues building up, increased latency during traffic spikes
Solutions:
# Check HPA status
kubectl describe hpa llm-inference-hpa -n ml-workloads
# Verify metrics are being collected
kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1 | jq .
# Check for resource constraints
kubectl describe nodes | grep -A 5 "Allocated resources"
- Reduce
stabilizationWindowSecondsfor scale-up - Increase
scaleUppolicy percentages - Pre-warm pods using
minReplicas - Implement pod priority classes for LLM workloads
Issue 2: GPU Node Unavailability
Symptoms: Pods stuck in Pending state despite autoscaling triggers
# Check pending pods
kubectl get pods -n ml-workloads | grep Pending
# Describe pending pod for details
kubectl describe pod <pod-name> -n ml-workloads
# Check cluster autoscaler logs
kubectl logs -n kube-system -l app=cluster-autoscaler
Solutions:
- Configure Cluster Autoscaler with GPU node pools
- Set appropriate
maxReplicasbased on available GPU nodes - Implement pod topology spread constraints
- Use node affinity for GPU node selection
Issue 3: Thrashing (Rapid Scale Up/Down)
Symptoms: Frequent pod creation and termination, unstable replica counts
Solutions:
- Increase
stabilizationWindowSecondsfor both scale-up and scale-down - Adjust metric thresholds to provide more buffer
- Implement longer
cooldownPeriodin KEDA - Use multiple metrics with different sensitivities
Best Practices for Production
1. Implement Graceful Shutdown
Ensure in-flight requests complete before pod termination:
lifecycle:
preStop:
exec:
command:
- /bin/sh
- -c
- |
# Stop accepting new requests
touch /tmp/shutdown
# Wait for existing requests to complete (max 60s)
for i in $(seq 1 60); do
if [ $(curl -s localhost:8000/active_requests) -eq 0 ]; then
exit 0
fi
sleep 1
done
2. Set Appropriate Resource Limits
# Monitor actual resource usage
kubectl top pods -n ml-workloads --containers
# Use VPA recommendations
kubectl get vpa llm-inference-vpa -n ml-workloads -o jsonpath='{.status.recommendation}'
3. Configure Pod Disruption Budgets
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: llm-inference-pdb
namespace: ml-workloads
spec:
minAvailable: 2
selector:
matchLabels:
app: llm-inference
4. Optimize Model Loading
- Use persistent volumes for model caching
- Implement model warm-up in readiness probes
- Consider init containers for model pre-loading
- Use faster storage classes (SSD-backed PVs)
Performance Benchmarking
Test your autoscaling configuration under load:
# Install load testing tool
kubectl apply -f https://raw.githubusercontent.com/grafana/k6-operator/main/bundle.yaml
# Create load test
cat <<EOF | kubectl apply -f -
apiVersion: k6.io/v1alpha1
kind: K6
metadata:
name: llm-load-test
spec:
parallelism: 4
script:
configMap:
name: llm-load-test-script
file: test.js
EOF
# Monitor scaling behavior
watch kubectl get hpa,pods -n ml-workloads
Conclusion
Autoscaling LLM inference workloads on Kubernetes requires careful consideration of GPU resources, model characteristics, and traffic patterns. By implementing the strategies outlined in this guide—combining HPA, KEDA, and VPA with custom metrics and proper monitoring—you can build a cost-effective, responsive inference infrastructure that handles variable demand efficiently.
Key takeaways:
- Use custom metrics (GPU utilization, queue depth) instead of CPU/memory alone
- Configure aggressive scale-up with conservative scale-down policies
- Implement comprehensive monitoring with LLM-specific metrics
- Test thoroughly under realistic load conditions
- Plan for GPU node availability and cluster autoscaling
Start with conservative settings and iterate based on observed behavior. Monitor costs alongside performance metrics, and continuously optimize your configuration as traffic patterns evolve.