Autoscaling AI and machine learning workloads presents unique challenges that traditional application scaling strategies often fail to address. Unlike stateless web applications, ML workloads involve GPU-intensive computations, long-running inference tasks, and complex dependencies on external data sources. In this comprehensive guide, we’ll explore how to leverage Kubernetes Horizontal Pod Autoscaler (HPA) and KEDA (Kubernetes Event Driven Autoscaling) to build production-ready autoscaling solutions for your ML applications.
Understanding the Autoscaling Challenge for AI Workloads
Machine learning applications differ fundamentally from traditional microservices. A single inference request might consume significant GPU resources, model loading can take minutes, and request patterns are often unpredictable. Standard CPU-based autoscaling fails to capture these nuances, leading to either resource waste or degraded performance.
The key challenges include:
- Cold start latency: ML models can take 30-120 seconds to load into memory
- GPU utilization metrics: Standard HPA doesn’t natively support GPU metrics
- Event-driven patterns: ML workloads often respond to message queues, not HTTP traffic
- Custom metrics: Inference queue depth and model-specific metrics matter more than CPU
Horizontal Pod Autoscaler (HPA) for ML Inference
HPA is Kubernetes’ native autoscaling solution that adjusts the number of pod replicas based on observed metrics. While HPA traditionally focuses on CPU and memory, it can be extended with custom metrics for ML workloads.
Setting Up HPA with Custom Metrics
First, deploy the Metrics Server if you haven’t already:
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
Here’s a production-ready deployment for an ML inference service with resource requests properly configured:
apiVersion: apps/v1
kind: Deployment
metadata:
name: ml-inference-service
namespace: ml-workloads
spec:
replicas: 2
selector:
matchLabels:
app: ml-inference
template:
metadata:
labels:
app: ml-inference
spec:
containers:
- name: inference
image: your-registry/ml-inference:v1.0
ports:
- containerPort: 8080
name: http
resources:
requests:
memory: "4Gi"
cpu: "2000m"
nvidia.com/gpu: "1"
limits:
memory: "8Gi"
cpu: "4000m"
nvidia.com/gpu: "1"
env:
- name: MODEL_PATH
value: "/models/resnet50"
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 120
periodSeconds: 30
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 90
periodSeconds: 10
Now configure HPA with multiple metrics including custom ones:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: ml-inference-hpa
namespace: ml-workloads
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: ml-inference-service
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
- type: Pods
pods:
metric:
name: inference_queue_depth
target:
type: AverageValue
averageValue: "10"
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 50
periodSeconds: 60
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Percent
value: 100
periodSeconds: 30
- type: Pods
value: 2
periodSeconds: 30
selectPolicy: Max
Exposing Custom Metrics for ML Workloads
To expose custom metrics like inference queue depth, implement a Prometheus exporter in your application:
from prometheus_client import Gauge, start_http_server
import queue
import threading
# Initialize metrics
inference_queue_depth = Gauge('inference_queue_depth', 'Number of pending inference requests')
model_load_time = Gauge('model_load_time_seconds', 'Time taken to load the model')
active_inferences = Gauge('active_inferences', 'Number of currently processing inferences')
class MLInferenceService:
def __init__(self):
self.request_queue = queue.Queue()
start_http_server(8000) # Prometheus metrics endpoint
def update_metrics(self):
"""Update Prometheus metrics continuously"""
while True:
inference_queue_depth.set(self.request_queue.qsize())
time.sleep(5)
def process_inference(self, input_data):
active_inferences.inc()
try:
# Your inference logic here
result = self.model.predict(input_data)
return result
finally:
active_inferences.dec()
KEDA: Event-Driven Autoscaling for ML Pipelines
KEDA extends Kubernetes autoscaling capabilities by enabling event-driven scaling based on external metrics sources. This is particularly powerful for ML workloads that consume from message queues, process batch jobs, or respond to cloud storage events.
Installing KEDA
# Add KEDA Helm repository
helm repo add kedacore https://kedacore.github.io/charts
helm repo update
# Install KEDA
helm install keda kedacore/keda --namespace keda --create-namespace
# Verify installation
kubectl get pods -n keda
Scaling Based on Message Queue Depth
For ML batch processing pipelines that consume from RabbitMQ or Apache Kafka, KEDA provides native scalers:
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: ml-batch-processor-scaler
namespace: ml-workloads
spec:
scaleTargetRef:
name: ml-batch-processor
minReplicaCount: 0
maxReplicaCount: 20
pollingInterval: 15
cooldownPeriod: 300
triggers:
- type: rabbitmq
metadata:
protocol: auto
queueName: ml-inference-queue
mode: QueueLength
value: "5"
activationValue: "1"
authenticationRef:
name: rabbitmq-auth
- type: prometheus
metadata:
serverAddress: http://prometheus-server.monitoring.svc.cluster.local:9090
metricName: gpu_utilization
threshold: "70"
query: avg(gpu_utilization{job="ml-inference"})
Create the authentication secret for RabbitMQ:
apiVersion: v1
kind: Secret
metadata:
name: rabbitmq-auth
namespace: ml-workloads
type: Opaque
stringData:
host: amqp://rabbitmq.messaging.svc.cluster.local:5672
---
apiVersion: keda.sh/v1alpha1
kind: TriggerAuthentication
metadata:
name: rabbitmq-auth
namespace: ml-workloads
spec:
secretTargetRef:
- parameter: host
name: rabbitmq-auth
key: host
Scaling to Zero for Cost Optimization
One of KEDA’s most powerful features for ML workloads is scaling to zero when there’s no demand:
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: ml-training-job-scaler
namespace: ml-workloads
spec:
scaleTargetRef:
name: ml-training-worker
minReplicaCount: 0 # Scale to zero when idle
maxReplicaCount: 5
pollingInterval: 30
cooldownPeriod: 600 # Wait 10 minutes before scaling down
triggers:
- type: aws-sqs-queue
authenticationRef:
name: aws-credentials
metadata:
queueURL: https://sqs.us-east-1.amazonaws.com/123456789/ml-training-jobs
queueLength: "2"
awsRegion: "us-east-1"
activationQueueLength: "1"
Advanced Patterns: Combining HPA and KEDA
For sophisticated ML platforms, you can combine HPA and KEDA to handle both real-time inference and batch processing:
apiVersion: apps/v1
kind: Deployment
metadata:
name: hybrid-ml-service
namespace: ml-workloads
annotations:
autoscaling.keda.sh/paused: "true" # Let HPA handle scaling initially
spec:
replicas: 3
selector:
matchLabels:
app: hybrid-ml
template:
metadata:
labels:
app: hybrid-ml
spec:
containers:
- name: ml-service
image: your-registry/hybrid-ml:v2.0
ports:
- containerPort: 8080
name: http
- containerPort: 8000
name: metrics
resources:
requests:
memory: "8Gi"
cpu: "4000m"
nvidia.com/gpu: "1"
limits:
memory: "16Gi"
cpu: "8000m"
nvidia.com/gpu: "1"
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: hybrid-ml-hpa
namespace: ml-workloads
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: hybrid-ml-service
minReplicas: 3
maxReplicas: 15
metrics:
- type: External
external:
metric:
name: requests_per_second
selector:
matchLabels:
service: hybrid-ml
target:
type: AverageValue
averageValue: "100"
Monitoring and Troubleshooting
Essential Monitoring Commands
# Check HPA status
kubectl get hpa -n ml-workloads
kubectl describe hpa ml-inference-hpa -n ml-workloads
# View HPA events
kubectl get events -n ml-workloads --field-selector involvedObject.name=ml-inference-hpa
# Check KEDA scaled objects
kubectl get scaledobjects -n ml-workloads
kubectl describe scaledobject ml-batch-processor-scaler -n ml-workloads
# View KEDA operator logs
kubectl logs -n keda -l app=keda-operator --tail=100
# Check current metrics
kubectl top pods -n ml-workloads
kubectl get --raw "/apis/metrics.k8s.io/v1beta1/namespaces/ml-workloads/pods" | jq .
Common Issues and Solutions
Issue: HPA shows “unknown” for custom metrics
Solution: Verify that your Prometheus Adapter is correctly configured and the metrics are being scraped:
# Check if custom metrics API is available
kubectl get apiservices | grep custom.metrics
# Test custom metrics endpoint
kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/ml-workloads/pods/*/inference_queue_depth" | jq .
Issue: KEDA not scaling pods
Solution: Check trigger authentication and polling intervals:
# Verify KEDA can reach external metrics source
kubectl logs -n keda deployment/keda-operator | grep -i error
# Check ScaledObject status
kubectl get scaledobject ml-batch-processor-scaler -n ml-workloads -o yaml | grep -A 10 status
Issue: Pods scaling too aggressively
Solution: Adjust stabilization windows and cooldown periods:
behavior:
scaleDown:
stabilizationWindowSeconds: 600 # Increase to 10 minutes
policies:
- type: Percent
value: 25 # Scale down more gradually
periodSeconds: 120
Best Practices for Production ML Autoscaling
- Set appropriate resource requests: Ensure CPU and memory requests accurately reflect your model’s requirements to avoid scheduling issues
- Implement proper health checks: Use readiness probes with adequate initialDelaySeconds to account for model loading time
- Use PodDisruptionBudgets: Prevent excessive pod terminations during scale-down events
- Monitor cold start latency: Track the time from pod creation to first successful inference
- Implement request queuing: Buffer incoming requests to smooth out traffic spikes
- Use node affinity for GPU workloads: Ensure pods are scheduled on appropriate GPU-enabled nodes
- Set conservative scale-down policies: ML workloads benefit from longer cooldown periods due to cold start costs
- Test scaling behavior: Use load testing tools to validate autoscaling configuration before production
Performance Optimization Tips
Create a PodDisruptionBudget to maintain service availability during scaling:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: ml-inference-pdb
namespace: ml-workloads
spec:
minAvailable: 2
selector:
matchLabels:
app: ml-inference
Implement a warm pool strategy using node affinity:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node.kubernetes.io/instance-type
operator: In
values:
- g4dn.xlarge
- p3.2xlarge
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
preference:
matchExpressions:
- key: workload-type
operator: In
values:
- ml-inference
Conclusion
Autoscaling AI and ML workloads on Kubernetes requires a nuanced approach that goes beyond traditional CPU-based metrics. By combining HPA’s native Kubernetes integration with KEDA’s event-driven capabilities, you can build robust, cost-effective scaling solutions that handle both real-time inference and batch processing workloads.
The key to success lies in understanding your workload characteristics, implementing appropriate metrics, and fine-tuning scaling behaviors through stabilization windows and cooldown periods. Start with conservative settings, monitor closely, and iterate based on observed behavior in your production environment.
Remember that autoscaling is not a set-and-forget solution—continuous monitoring and adjustment based on actual usage patterns will ensure optimal performance and cost efficiency for your ML applications.