Join our Discord Server
Collabnix Team The Collabnix Team is a diverse collective of Docker, Kubernetes, and IoT experts united by a passion for cloud-native technologies. With backgrounds spanning across DevOps, platform engineering, cloud architecture, and container orchestration, our contributors bring together decades of combined experience from various industries and technical domains.

Autoscaling AI Workloads: HPA and KEDA for ML Applications

5 min read

Autoscaling AI and machine learning workloads presents unique challenges that traditional application scaling strategies often fail to address. Unlike stateless web applications, ML workloads involve GPU-intensive computations, long-running inference tasks, and complex dependencies on external data sources. In this comprehensive guide, we’ll explore how to leverage Kubernetes Horizontal Pod Autoscaler (HPA) and KEDA (Kubernetes Event Driven Autoscaling) to build production-ready autoscaling solutions for your ML applications.

Understanding the Autoscaling Challenge for AI Workloads

Machine learning applications differ fundamentally from traditional microservices. A single inference request might consume significant GPU resources, model loading can take minutes, and request patterns are often unpredictable. Standard CPU-based autoscaling fails to capture these nuances, leading to either resource waste or degraded performance.

The key challenges include:

  • Cold start latency: ML models can take 30-120 seconds to load into memory
  • GPU utilization metrics: Standard HPA doesn’t natively support GPU metrics
  • Event-driven patterns: ML workloads often respond to message queues, not HTTP traffic
  • Custom metrics: Inference queue depth and model-specific metrics matter more than CPU

Horizontal Pod Autoscaler (HPA) for ML Inference

HPA is Kubernetes’ native autoscaling solution that adjusts the number of pod replicas based on observed metrics. While HPA traditionally focuses on CPU and memory, it can be extended with custom metrics for ML workloads.

Setting Up HPA with Custom Metrics

First, deploy the Metrics Server if you haven’t already:

kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

Here’s a production-ready deployment for an ML inference service with resource requests properly configured:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ml-inference-service
  namespace: ml-workloads
spec:
  replicas: 2
  selector:
    matchLabels:
      app: ml-inference
  template:
    metadata:
      labels:
        app: ml-inference
    spec:
      containers:
      - name: inference
        image: your-registry/ml-inference:v1.0
        ports:
        - containerPort: 8080
          name: http
        resources:
          requests:
            memory: "4Gi"
            cpu: "2000m"
            nvidia.com/gpu: "1"
          limits:
            memory: "8Gi"
            cpu: "4000m"
            nvidia.com/gpu: "1"
        env:
        - name: MODEL_PATH
          value: "/models/resnet50"
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 120
          periodSeconds: 30
        readinessProbe:
          httpGet:
            path: /ready
            port: 8080
          initialDelaySeconds: 90
          periodSeconds: 10

Now configure HPA with multiple metrics including custom ones:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: ml-inference-hpa
  namespace: ml-workloads
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: ml-inference-service
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
  - type: Pods
    pods:
      metric:
        name: inference_queue_depth
      target:
        type: AverageValue
        averageValue: "10"
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 50
        periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Percent
        value: 100
        periodSeconds: 30
      - type: Pods
        value: 2
        periodSeconds: 30
      selectPolicy: Max

Exposing Custom Metrics for ML Workloads

To expose custom metrics like inference queue depth, implement a Prometheus exporter in your application:

from prometheus_client import Gauge, start_http_server
import queue
import threading

# Initialize metrics
inference_queue_depth = Gauge('inference_queue_depth', 'Number of pending inference requests')
model_load_time = Gauge('model_load_time_seconds', 'Time taken to load the model')
active_inferences = Gauge('active_inferences', 'Number of currently processing inferences')

class MLInferenceService:
    def __init__(self):
        self.request_queue = queue.Queue()
        start_http_server(8000)  # Prometheus metrics endpoint
        
    def update_metrics(self):
        """Update Prometheus metrics continuously"""
        while True:
            inference_queue_depth.set(self.request_queue.qsize())
            time.sleep(5)
    
    def process_inference(self, input_data):
        active_inferences.inc()
        try:
            # Your inference logic here
            result = self.model.predict(input_data)
            return result
        finally:
            active_inferences.dec()

KEDA: Event-Driven Autoscaling for ML Pipelines

KEDA extends Kubernetes autoscaling capabilities by enabling event-driven scaling based on external metrics sources. This is particularly powerful for ML workloads that consume from message queues, process batch jobs, or respond to cloud storage events.

Installing KEDA

# Add KEDA Helm repository
helm repo add kedacore https://kedacore.github.io/charts
helm repo update

# Install KEDA
helm install keda kedacore/keda --namespace keda --create-namespace

# Verify installation
kubectl get pods -n keda

Scaling Based on Message Queue Depth

For ML batch processing pipelines that consume from RabbitMQ or Apache Kafka, KEDA provides native scalers:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: ml-batch-processor-scaler
  namespace: ml-workloads
spec:
  scaleTargetRef:
    name: ml-batch-processor
  minReplicaCount: 0
  maxReplicaCount: 20
  pollingInterval: 15
  cooldownPeriod: 300
  triggers:
  - type: rabbitmq
    metadata:
      protocol: auto
      queueName: ml-inference-queue
      mode: QueueLength
      value: "5"
      activationValue: "1"
    authenticationRef:
      name: rabbitmq-auth
  - type: prometheus
    metadata:
      serverAddress: http://prometheus-server.monitoring.svc.cluster.local:9090
      metricName: gpu_utilization
      threshold: "70"
      query: avg(gpu_utilization{job="ml-inference"})

Create the authentication secret for RabbitMQ:

apiVersion: v1
kind: Secret
metadata:
  name: rabbitmq-auth
  namespace: ml-workloads
type: Opaque
stringData:
  host: amqp://rabbitmq.messaging.svc.cluster.local:5672
---
apiVersion: keda.sh/v1alpha1
kind: TriggerAuthentication
metadata:
  name: rabbitmq-auth
  namespace: ml-workloads
spec:
  secretTargetRef:
  - parameter: host
    name: rabbitmq-auth
    key: host

Scaling to Zero for Cost Optimization

One of KEDA’s most powerful features for ML workloads is scaling to zero when there’s no demand:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: ml-training-job-scaler
  namespace: ml-workloads
spec:
  scaleTargetRef:
    name: ml-training-worker
  minReplicaCount: 0  # Scale to zero when idle
  maxReplicaCount: 5
  pollingInterval: 30
  cooldownPeriod: 600  # Wait 10 minutes before scaling down
  triggers:
  - type: aws-sqs-queue
    authenticationRef:
      name: aws-credentials
    metadata:
      queueURL: https://sqs.us-east-1.amazonaws.com/123456789/ml-training-jobs
      queueLength: "2"
      awsRegion: "us-east-1"
      activationQueueLength: "1"

Advanced Patterns: Combining HPA and KEDA

For sophisticated ML platforms, you can combine HPA and KEDA to handle both real-time inference and batch processing:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: hybrid-ml-service
  namespace: ml-workloads
  annotations:
    autoscaling.keda.sh/paused: "true"  # Let HPA handle scaling initially
spec:
  replicas: 3
  selector:
    matchLabels:
      app: hybrid-ml
  template:
    metadata:
      labels:
        app: hybrid-ml
    spec:
      containers:
      - name: ml-service
        image: your-registry/hybrid-ml:v2.0
        ports:
        - containerPort: 8080
          name: http
        - containerPort: 8000
          name: metrics
        resources:
          requests:
            memory: "8Gi"
            cpu: "4000m"
            nvidia.com/gpu: "1"
          limits:
            memory: "16Gi"
            cpu: "8000m"
            nvidia.com/gpu: "1"
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: hybrid-ml-hpa
  namespace: ml-workloads
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: hybrid-ml-service
  minReplicas: 3
  maxReplicas: 15
  metrics:
  - type: External
    external:
      metric:
        name: requests_per_second
        selector:
          matchLabels:
            service: hybrid-ml
      target:
        type: AverageValue
        averageValue: "100"

Monitoring and Troubleshooting

Essential Monitoring Commands

# Check HPA status
kubectl get hpa -n ml-workloads
kubectl describe hpa ml-inference-hpa -n ml-workloads

# View HPA events
kubectl get events -n ml-workloads --field-selector involvedObject.name=ml-inference-hpa

# Check KEDA scaled objects
kubectl get scaledobjects -n ml-workloads
kubectl describe scaledobject ml-batch-processor-scaler -n ml-workloads

# View KEDA operator logs
kubectl logs -n keda -l app=keda-operator --tail=100

# Check current metrics
kubectl top pods -n ml-workloads
kubectl get --raw "/apis/metrics.k8s.io/v1beta1/namespaces/ml-workloads/pods" | jq .

Common Issues and Solutions

Issue: HPA shows “unknown” for custom metrics

Solution: Verify that your Prometheus Adapter is correctly configured and the metrics are being scraped:

# Check if custom metrics API is available
kubectl get apiservices | grep custom.metrics

# Test custom metrics endpoint
kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/ml-workloads/pods/*/inference_queue_depth" | jq .

Issue: KEDA not scaling pods

Solution: Check trigger authentication and polling intervals:

# Verify KEDA can reach external metrics source
kubectl logs -n keda deployment/keda-operator | grep -i error

# Check ScaledObject status
kubectl get scaledobject ml-batch-processor-scaler -n ml-workloads -o yaml | grep -A 10 status

Issue: Pods scaling too aggressively

Solution: Adjust stabilization windows and cooldown periods:

behavior:
  scaleDown:
    stabilizationWindowSeconds: 600  # Increase to 10 minutes
    policies:
    - type: Percent
      value: 25  # Scale down more gradually
      periodSeconds: 120

Best Practices for Production ML Autoscaling

  • Set appropriate resource requests: Ensure CPU and memory requests accurately reflect your model’s requirements to avoid scheduling issues
  • Implement proper health checks: Use readiness probes with adequate initialDelaySeconds to account for model loading time
  • Use PodDisruptionBudgets: Prevent excessive pod terminations during scale-down events
  • Monitor cold start latency: Track the time from pod creation to first successful inference
  • Implement request queuing: Buffer incoming requests to smooth out traffic spikes
  • Use node affinity for GPU workloads: Ensure pods are scheduled on appropriate GPU-enabled nodes
  • Set conservative scale-down policies: ML workloads benefit from longer cooldown periods due to cold start costs
  • Test scaling behavior: Use load testing tools to validate autoscaling configuration before production

Performance Optimization Tips

Create a PodDisruptionBudget to maintain service availability during scaling:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: ml-inference-pdb
  namespace: ml-workloads
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: ml-inference

Implement a warm pool strategy using node affinity:

affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
      - matchExpressions:
        - key: node.kubernetes.io/instance-type
          operator: In
          values:
          - g4dn.xlarge
          - p3.2xlarge
    preferredDuringSchedulingIgnoredDuringExecution:
    - weight: 100
      preference:
        matchExpressions:
        - key: workload-type
          operator: In
          values:
          - ml-inference

Conclusion

Autoscaling AI and ML workloads on Kubernetes requires a nuanced approach that goes beyond traditional CPU-based metrics. By combining HPA’s native Kubernetes integration with KEDA’s event-driven capabilities, you can build robust, cost-effective scaling solutions that handle both real-time inference and batch processing workloads.

The key to success lies in understanding your workload characteristics, implementing appropriate metrics, and fine-tuning scaling behaviors through stabilization windows and cooldown periods. Start with conservative settings, monitor closely, and iterate based on observed behavior in your production environment.

Remember that autoscaling is not a set-and-forget solution—continuous monitoring and adjustment based on actual usage patterns will ensure optimal performance and cost efficiency for your ML applications.

Have Queries? Join https://launchpass.com/collabnix

Collabnix Team The Collabnix Team is a diverse collective of Docker, Kubernetes, and IoT experts united by a passion for cloud-native technologies. With backgrounds spanning across DevOps, platform engineering, cloud architecture, and container orchestration, our contributors bring together decades of combined experience from various industries and technical domains.
Join our Discord Server
Index