Join our Discord Server
Collabnix Team The Collabnix Team is a diverse collective of Docker, Kubernetes, and IoT experts united by a passion for cloud-native technologies. With backgrounds spanning across DevOps, platform engineering, cloud architecture, and container orchestration, our contributors bring together decades of combined experience from various industries and technical domains.

Kubernetes Autoscaling for LLM Inference: Complete Guide (2024)

5 min read

Large Language Model (LLM) inference workloads present unique challenges for Kubernetes autoscaling. Unlike traditional microservices, LLM deployments require GPU resources, have unpredictable latency patterns, and consume significant memory. This comprehensive guide explores production-ready autoscaling strategies specifically designed for LLM inference on Kubernetes.

Understanding LLM Inference Characteristics

Before implementing autoscaling, it’s crucial to understand what makes LLM workloads different:

  • GPU Dependency: Most LLMs require GPU acceleration, making pod scheduling more complex
  • Long Startup Times: Loading multi-gigabyte models can take 30-120 seconds
  • Variable Request Duration: Inference time varies dramatically based on prompt length and generation parameters
  • Memory Intensive: Models like Llama-2-70B require 140GB+ of GPU memory
  • Batch Processing Benefits: Throughput improves significantly with request batching

Autoscaling Strategies for LLM Workloads

1. Horizontal Pod Autoscaler (HPA) with Custom Metrics

The standard HPA can work for LLM inference when configured with appropriate metrics. CPU-based scaling is insufficient; instead, focus on GPU utilization and queue depth.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: llm-inference-hpa
  namespace: ml-workloads
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: llm-inference
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 70
  - type: Pods
    pods:
      metric:
        name: gpu_utilization
      target:
        type: AverageValue
        averageValue: "75"
  - type: Pods
    pods:
      metric:
        name: inference_queue_depth
      target:
        type: AverageValue
        averageValue: "10"
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 50
        periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Percent
        value: 100
        periodSeconds: 30
      - type: Pods
        value: 2
        periodSeconds: 30
      selectPolicy: Max

Key configuration considerations:

  • stabilizationWindowSeconds: Set higher for scale-down (300s) to prevent thrashing during traffic fluctuations
  • scaleUp policies: Aggressive scaling up (100% or 2 pods) to handle sudden traffic spikes
  • Custom metrics: GPU utilization and queue depth provide better signals than CPU/memory alone

2. KEDA (Kubernetes Event-Driven Autoscaling)

KEDA excels at scaling LLM workloads based on external metrics like message queues, which is ideal for asynchronous inference patterns.

# Install KEDA
kubectl apply -f https://github.com/kedacore/keda/releases/download/v2.12.0/keda-2.12.0.yaml

# Verify installation
kubectl get pods -n keda

Here’s a KEDA ScaledObject configuration for Redis-backed inference queues:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: llm-inference-scaler
  namespace: ml-workloads
spec:
  scaleTargetRef:
    name: llm-inference
  pollingInterval: 15
  cooldownPeriod: 300
  minReplicaCount: 1
  maxReplicaCount: 20
  advanced:
    horizontalPodAutoscalerConfig:
      behavior:
        scaleDown:
          stabilizationWindowSeconds: 300
          policies:
          - type: Percent
            value: 25
            periodSeconds: 60
  triggers:
  - type: redis
    metadata:
      addressFromEnv: REDIS_HOST
      listName: inference_queue
      listLength: "5"
      activationListLength: "1"
  - type: prometheus
    metadata:
      serverAddress: http://prometheus:9090
      metricName: gpu_memory_utilization
      threshold: "80"
      query: avg(nvidia_gpu_memory_used_bytes / nvidia_gpu_memory_total_bytes * 100)

3. Vertical Pod Autoscaler (VPA) for Resource Optimization

VPA helps right-size resource requests for LLM pods, particularly important when running different model sizes.

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: llm-inference-vpa
  namespace: ml-workloads
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: llm-inference
  updatePolicy:
    updateMode: "Auto"
  resourcePolicy:
    containerPolicies:
    - containerName: inference-server
      minAllowed:
        memory: "16Gi"
        nvidia.com/gpu: "1"
      maxAllowed:
        memory: "80Gi"
        nvidia.com/gpu: "1"
      controlledResources:
      - memory
      mode: Auto

Production Deployment Configuration

LLM Inference Deployment with Autoscaling

apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-inference
  namespace: ml-workloads
  labels:
    app: llm-inference
    model: llama-2-13b
spec:
  replicas: 2
  selector:
    matchLabels:
      app: llm-inference
  template:
    metadata:
      labels:
        app: llm-inference
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8000"
        prometheus.io/path: "/metrics"
    spec:
      nodeSelector:
        cloud.google.com/gke-accelerator: nvidia-tesla-t4
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      containers:
      - name: inference-server
        image: your-registry/llm-inference:v1.0
        resources:
          requests:
            memory: "24Gi"
            cpu: "4"
            nvidia.com/gpu: "1"
          limits:
            memory: "32Gi"
            cpu: "8"
            nvidia.com/gpu: "1"
        env:
        - name: MODEL_NAME
          value: "meta-llama/Llama-2-13b-chat-hf"
        - name: MAX_BATCH_SIZE
          value: "8"
        - name: MAX_CONCURRENT_REQUESTS
          value: "32"
        ports:
        - containerPort: 8000
          name: http
        - containerPort: 8001
          name: metrics
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 60
          periodSeconds: 10
          timeoutSeconds: 5
          failureThreshold: 3
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 120
          periodSeconds: 30
          timeoutSeconds: 10
          failureThreshold: 3
        lifecycle:
          preStop:
            exec:
              command: ["/bin/sh", "-c", "sleep 30"]

Monitoring and Custom Metrics

Exposing LLM-Specific Metrics

Implement custom metrics exporter in your inference server:

from prometheus_client import Counter, Histogram, Gauge, start_http_server
import time

# Define metrics
inference_requests_total = Counter(
    'llm_inference_requests_total',
    'Total number of inference requests',
    ['model', 'status']
)

inference_duration_seconds = Histogram(
    'llm_inference_duration_seconds',
    'Inference request duration in seconds',
    ['model'],
    buckets=[0.1, 0.5, 1.0, 2.0, 5.0, 10.0, 30.0, 60.0]
)

queue_depth = Gauge(
    'llm_inference_queue_depth',
    'Number of requests waiting in queue'
)

gpu_memory_used = Gauge(
    'llm_gpu_memory_used_bytes',
    'GPU memory used by model'
)

active_requests = Gauge(
    'llm_active_requests',
    'Number of currently processing requests'
)

def track_inference(model_name):
    """Decorator to track inference metrics"""
    def decorator(func):
        def wrapper(*args, **kwargs):
            start_time = time.time()
            active_requests.inc()
            try:
                result = func(*args, **kwargs)
                inference_requests_total.labels(
                    model=model_name,
                    status='success'
                ).inc()
                return result
            except Exception as e:
                inference_requests_total.labels(
                    model=model_name,
                    status='error'
                ).inc()
                raise
            finally:
                duration = time.time() - start_time
                inference_duration_seconds.labels(
                    model=model_name
                ).observe(duration)
                active_requests.dec()
        return wrapper
    return decorator

# Start metrics server
start_http_server(8001)

Prometheus ServiceMonitor Configuration

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: llm-inference-metrics
  namespace: ml-workloads
spec:
  selector:
    matchLabels:
      app: llm-inference
  endpoints:
  - port: metrics
    interval: 15s
    path: /metrics

Advanced Scaling Patterns

Multi-Model Serving with Selective Scaling

For environments serving multiple models, implement model-specific autoscaling:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: llama-13b-scaler
spec:
  scaleTargetRef:
    name: llama-13b-inference
  minReplicaCount: 1
  maxReplicaCount: 10
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://prometheus:9090
      metricName: model_request_rate
      threshold: "10"
      query: |
        rate(llm_inference_requests_total{
          model="llama-13b",
          status="success"
        }[2m])
---
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: llama-70b-scaler
spec:
  scaleTargetRef:
    name: llama-70b-inference
  minReplicaCount: 0
  maxReplicaCount: 5
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://prometheus:9090
      metricName: model_request_rate
      threshold: "5"
      query: |
        rate(llm_inference_requests_total{
          model="llama-70b",
          status="success"
        }[2m])

Troubleshooting Common Issues

Issue 1: Pods Scaling Too Slowly

Symptoms: Request queues building up, increased latency during traffic spikes

Solutions:

# Check HPA status
kubectl describe hpa llm-inference-hpa -n ml-workloads

# Verify metrics are being collected
kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1 | jq .

# Check for resource constraints
kubectl describe nodes | grep -A 5 "Allocated resources"
  • Reduce stabilizationWindowSeconds for scale-up
  • Increase scaleUp policy percentages
  • Pre-warm pods using minReplicas
  • Implement pod priority classes for LLM workloads

Issue 2: GPU Node Unavailability

Symptoms: Pods stuck in Pending state despite autoscaling triggers

# Check pending pods
kubectl get pods -n ml-workloads | grep Pending

# Describe pending pod for details
kubectl describe pod <pod-name> -n ml-workloads

# Check cluster autoscaler logs
kubectl logs -n kube-system -l app=cluster-autoscaler

Solutions:

  • Configure Cluster Autoscaler with GPU node pools
  • Set appropriate maxReplicas based on available GPU nodes
  • Implement pod topology spread constraints
  • Use node affinity for GPU node selection

Issue 3: Thrashing (Rapid Scale Up/Down)

Symptoms: Frequent pod creation and termination, unstable replica counts

Solutions:

  • Increase stabilizationWindowSeconds for both scale-up and scale-down
  • Adjust metric thresholds to provide more buffer
  • Implement longer cooldownPeriod in KEDA
  • Use multiple metrics with different sensitivities

Best Practices for Production

1. Implement Graceful Shutdown

Ensure in-flight requests complete before pod termination:

lifecycle:
  preStop:
    exec:
      command:
      - /bin/sh
      - -c
      - |
        # Stop accepting new requests
        touch /tmp/shutdown
        # Wait for existing requests to complete (max 60s)
        for i in $(seq 1 60); do
          if [ $(curl -s localhost:8000/active_requests) -eq 0 ]; then
            exit 0
          fi
          sleep 1
        done

2. Set Appropriate Resource Limits

# Monitor actual resource usage
kubectl top pods -n ml-workloads --containers

# Use VPA recommendations
kubectl get vpa llm-inference-vpa -n ml-workloads -o jsonpath='{.status.recommendation}'

3. Configure Pod Disruption Budgets

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: llm-inference-pdb
  namespace: ml-workloads
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: llm-inference

4. Optimize Model Loading

  • Use persistent volumes for model caching
  • Implement model warm-up in readiness probes
  • Consider init containers for model pre-loading
  • Use faster storage classes (SSD-backed PVs)

Performance Benchmarking

Test your autoscaling configuration under load:

# Install load testing tool
kubectl apply -f https://raw.githubusercontent.com/grafana/k6-operator/main/bundle.yaml

# Create load test
cat <<EOF | kubectl apply -f -
apiVersion: k6.io/v1alpha1
kind: K6
metadata:
  name: llm-load-test
spec:
  parallelism: 4
  script:
    configMap:
      name: llm-load-test-script
      file: test.js
EOF

# Monitor scaling behavior
watch kubectl get hpa,pods -n ml-workloads

Conclusion

Autoscaling LLM inference workloads on Kubernetes requires careful consideration of GPU resources, model characteristics, and traffic patterns. By implementing the strategies outlined in this guide—combining HPA, KEDA, and VPA with custom metrics and proper monitoring—you can build a cost-effective, responsive inference infrastructure that handles variable demand efficiently.

Key takeaways:

  • Use custom metrics (GPU utilization, queue depth) instead of CPU/memory alone
  • Configure aggressive scale-up with conservative scale-down policies
  • Implement comprehensive monitoring with LLM-specific metrics
  • Test thoroughly under realistic load conditions
  • Plan for GPU node availability and cluster autoscaling

Start with conservative settings and iterate based on observed behavior. Monitor costs alongside performance metrics, and continuously optimize your configuration as traffic patterns evolve.

Have Queries? Join https://launchpass.com/collabnix

Collabnix Team The Collabnix Team is a diverse collective of Docker, Kubernetes, and IoT experts united by a passion for cloud-native technologies. With backgrounds spanning across DevOps, platform engineering, cloud architecture, and container orchestration, our contributors bring together decades of combined experience from various industries and technical domains.
Join our Discord Server
Index