Join our Discord Server
Collabnix Team The Collabnix Team is a diverse collective of Docker, Kubernetes, and IoT experts united by a passion for cloud-native technologies. With backgrounds spanning across DevOps, platform engineering, cloud architecture, and container orchestration, our contributors bring together decades of combined experience from various industries and technical domains.

From Prototype to Production: Scaling LLM Applications in Kubernetes

5 min read

Large Language Models (LLMs) have revolutionized AI applications, but moving from a prototype running on your laptop to a production-grade system serving thousands of users presents unique challenges. This comprehensive guide walks you through the architectural decisions, infrastructure setup, and best practices for scaling LLM applications in production environments.

Understanding the Scaling Challenges of LLM Applications

Unlike traditional microservices, LLM applications demand significant computational resources, have unpredictable latency patterns, and require careful memory management. A single inference request can consume 10-80GB of GPU memory depending on the model size, making efficient resource utilization critical for cost-effective scaling.

The key challenges include:

  • GPU Resource Management: Efficiently allocating expensive GPU resources across multiple requests
  • Model Loading Time: Large models can take 30-120 seconds to load into memory
  • Request Batching: Optimizing throughput while maintaining acceptable latency
  • Cost Optimization: GPU instances cost 10-50x more than standard compute
  • Version Management: Handling multiple model versions and A/B testing

Architecture Patterns for Production LLM Systems

1. The Inference Gateway Pattern

Implement a dedicated gateway service that handles request routing, queue management, and response streaming. This pattern decouples your application logic from the inference infrastructure, enabling independent scaling.

from fastapi import FastAPI, BackgroundTasks
from pydantic import BaseModel
import asyncio
from typing import Optional
import aioredis

app = FastAPI()
redis = aioredis.from_url("redis://redis-service:6379")

class InferenceRequest(BaseModel):
    prompt: str
    max_tokens: int = 512
    temperature: float = 0.7
    model_version: Optional[str] = "v1"

class RequestQueue:
    def __init__(self, redis_client):
        self.redis = redis_client
        self.queue_key = "llm:inference:queue"
    
    async def enqueue(self, request_id: str, payload: dict):
        await self.redis.lpush(self.queue_key, 
                              f"{request_id}:{json.dumps(payload)}")
        return request_id
    
    async def get_result(self, request_id: str, timeout: int = 300):
        result_key = f"llm:result:{request_id}"
        for _ in range(timeout):
            result = await self.redis.get(result_key)
            if result:
                await self.redis.delete(result_key)
                return json.loads(result)
            await asyncio.sleep(1)
        raise TimeoutError("Inference timeout")

@app.post("/v1/inference")
async def create_inference(request: InferenceRequest):
    request_id = str(uuid.uuid4())
    queue = RequestQueue(redis)
    
    await queue.enqueue(request_id, request.dict())
    result = await queue.get_result(request_id)
    
    return {"request_id": request_id, "result": result}

2. Model Serving with vLLM

vLLM provides optimized inference with PagedAttention, continuous batching, and efficient memory management. It’s the recommended solution for production LLM deployments.

FROM nvidia/cuda:12.1.0-runtime-ubuntu22.04

RUN apt-get update && apt-get install -y \
    python3.10 python3-pip git && \
    rm -rf /var/lib/apt/lists/*

RUN pip3 install vllm==0.2.7 fastapi uvicorn prometheus-client

WORKDIR /app
COPY serve.py .

ENV MODEL_NAME="meta-llama/Llama-2-7b-chat-hf"
ENV TENSOR_PARALLEL_SIZE=1
ENV MAX_MODEL_LEN=4096

CMD ["python3", "serve.py"]
# serve.py
from vllm import LLM, SamplingParams
from vllm.engine.arg_utils import AsyncEngineArgs
from vllm.engine.async_llm_engine import AsyncLLMEngine
from fastapi import FastAPI
import os

app = FastAPI()

engine_args = AsyncEngineArgs(
    model=os.getenv("MODEL_NAME"),
    tensor_parallel_size=int(os.getenv("TENSOR_PARALLEL_SIZE", 1)),
    max_model_len=int(os.getenv("MAX_MODEL_LEN", 4096)),
    gpu_memory_utilization=0.95,
    enforce_eager=False
)

engine = AsyncLLMEngine.from_engine_args(engine_args)

@app.post("/generate")
async def generate(prompt: str, max_tokens: int = 512):
    sampling_params = SamplingParams(
        temperature=0.7,
        top_p=0.95,
        max_tokens=max_tokens
    )
    
    request_id = f"req-{uuid.uuid4()}"
    results_generator = engine.generate(prompt, sampling_params, request_id)
    
    final_output = None
    async for request_output in results_generator:
        final_output = request_output
    
    return {"text": final_output.outputs[0].text}

Kubernetes Deployment Configuration

GPU Node Pool Setup

Create dedicated node pools for GPU workloads with appropriate taints and labels to prevent non-GPU workloads from consuming these expensive resources.

apiVersion: v1
kind: Namespace
metadata:
  name: llm-inference
---
apiVersion: v1
kind: ResourceQuota
metadata:
  name: gpu-quota
  namespace: llm-inference
spec:
  hard:
    requests.nvidia.com/gpu: "8"
    limits.nvidia.com/gpu: "8"
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-inference-server
  namespace: llm-inference
spec:
  replicas: 2
  selector:
    matchLabels:
      app: llm-inference
  template:
    metadata:
      labels:
        app: llm-inference
    spec:
      nodeSelector:
        cloud.google.com/gke-accelerator: nvidia-tesla-a100
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      containers:
      - name: vllm-server
        image: your-registry/vllm-server:latest
        resources:
          requests:
            memory: "40Gi"
            cpu: "8"
            nvidia.com/gpu: "1"
          limits:
            memory: "80Gi"
            cpu: "16"
            nvidia.com/gpu: "1"
        env:
        - name: MODEL_NAME
          value: "meta-llama/Llama-2-13b-chat-hf"
        - name: TENSOR_PARALLEL_SIZE
          value: "1"
        - name: MAX_MODEL_LEN
          value: "4096"
        ports:
        - containerPort: 8000
          name: http
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 300
          periodSeconds: 30
          timeoutSeconds: 10
        readinessProbe:
          httpGet:
            path: /ready
            port: 8000
          initialDelaySeconds: 300
          periodSeconds: 10
---
apiVersion: v1
kind: Service
metadata:
  name: llm-inference-service
  namespace: llm-inference
spec:
  selector:
    app: llm-inference
  ports:
  - protocol: TCP
    port: 80
    targetPort: 8000
  type: ClusterIP

Horizontal Pod Autoscaling with Custom Metrics

Standard CPU/memory-based autoscaling doesn’t work well for LLM workloads. Implement custom metrics based on queue depth and GPU utilization.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: llm-inference-hpa
  namespace: llm-inference
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: llm-inference-server
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Pods
    pods:
      metric:
        name: queue_depth
      target:
        type: AverageValue
        averageValue: "10"
  - type: Pods
    pods:
      metric:
        name: gpu_utilization
      target:
        type: AverageValue
        averageValue: "80"
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 50
        periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Percent
        value: 100
        periodSeconds: 30

Implementing Request Batching and Queue Management

Dynamic batching significantly improves throughput. Implement a smart batching system that balances latency and throughput based on current load.

import asyncio
from collections import deque
from dataclasses import dataclass
from typing import List
import time

@dataclass
class BatchConfig:
    max_batch_size: int = 32
    max_wait_time: float = 0.1  # seconds
    min_batch_size: int = 1

class DynamicBatcher:
    def __init__(self, config: BatchConfig):
        self.config = config
        self.queue = deque()
        self.lock = asyncio.Lock()
        
    async def add_request(self, request):
        async with self.lock:
            future = asyncio.Future()
            self.queue.append((request, future, time.time()))
            return await future
    
    async def process_batches(self, inference_fn):
        while True:
            batch = await self._collect_batch()
            if batch:
                requests, futures, _ = zip(*batch)
                try:
                    results = await inference_fn(list(requests))
                    for future, result in zip(futures, results):
                        future.set_result(result)
                except Exception as e:
                    for future in futures:
                        future.set_exception(e)
            await asyncio.sleep(0.01)
    
    async def _collect_batch(self):
        start_time = time.time()
        batch = []
        
        while len(batch) < self.config.max_batch_size:
            async with self.lock:
                if self.queue:
                    batch.append(self.queue.popleft())
                elif batch and (time.time() - start_time) > self.config.max_wait_time:
                    break
                elif not batch:
                    await asyncio.sleep(0.001)
                    continue
                else:
                    await asyncio.sleep(0.001)
        
        return batch if len(batch) >= self.config.min_batch_size else None

Monitoring and Observability

Comprehensive monitoring is essential for production LLM systems. Track these critical metrics:

from prometheus_client import Counter, Histogram, Gauge, start_http_server

# Define metrics
inference_requests = Counter('llm_inference_requests_total', 
                            'Total inference requests', 
                            ['model_version', 'status'])

inference_latency = Histogram('llm_inference_latency_seconds',
                             'Inference latency in seconds',
                             ['model_version'],
                             buckets=[0.1, 0.5, 1.0, 2.0, 5.0, 10.0, 30.0])

queue_depth = Gauge('llm_queue_depth', 
                   'Current queue depth')

gpu_memory_used = Gauge('llm_gpu_memory_bytes',
                       'GPU memory used',
                       ['gpu_id'])

tokens_generated = Counter('llm_tokens_generated_total',
                          'Total tokens generated',
                          ['model_version'])

# Usage in your inference function
async def monitored_inference(request):
    start_time = time.time()
    try:
        result = await perform_inference(request)
        inference_requests.labels(
            model_version=request.model_version,
            status='success'
        ).inc()
        tokens_generated.labels(
            model_version=request.model_version
        ).inc(len(result.tokens))
        return result
    except Exception as e:
        inference_requests.labels(
            model_version=request.model_version,
            status='error'
        ).inc()
        raise
    finally:
        duration = time.time() - start_time
        inference_latency.labels(
            model_version=request.model_version
        ).observe(duration)

Cost Optimization Strategies

1. Model Quantization

Reduce memory footprint and increase throughput with quantization. INT8 quantization can reduce memory by 4x with minimal accuracy loss.

# Install dependencies
pip install auto-gptq optimum

# Quantize model
python -m auto_gptq.quantize \
  --model_name_or_path meta-llama/Llama-2-13b-chat-hf \
  --output_dir ./llama-2-13b-gptq \
  --bits 4 \
  --group_size 128 \
  --desc_act

2. Spot Instance Management

Use spot instances for non-critical workloads with proper graceful degradation.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-inference-spot
  namespace: llm-inference
spec:
  replicas: 3
  selector:
    matchLabels:
      app: llm-inference
      tier: spot
  template:
    metadata:
      labels:
        app: llm-inference
        tier: spot
    spec:
      nodeSelector:
        cloud.google.com/gke-spot: "true"
      tolerations:
      - key: cloud.google.com/gke-spot
        operator: Equal
        value: "true"
        effect: NoSchedule
      terminationGracePeriodSeconds: 60
      containers:
      - name: vllm-server
        image: your-registry/vllm-server:latest
        lifecycle:
          preStop:
            exec:
              command: ["/bin/sh", "-c", "sleep 30"]

Troubleshooting Common Production Issues

Issue 1: OOM (Out of Memory) Errors

Symptoms: Pods crash with exit code 137, GPU memory exhausted errors

Solutions:

# Check GPU memory usage
kubectl exec -it <pod-name> -n llm-inference -- nvidia-smi

# Reduce max_model_len
kubectl set env deployment/llm-inference-server MAX_MODEL_LEN=2048

# Enable memory profiling
kubectl set env deployment/llm-inference-server CUDA_LAUNCH_BLOCKING=1

Issue 2: High Latency Spikes

Symptoms: P99 latency 10x higher than P50

Solutions:

  • Implement request timeout and circuit breakers
  • Use separate deployments for different SLAs
  • Enable continuous batching in vLLM
  • Pre-warm model instances during scale-up

Issue 3: Cold Start Problems

Symptoms: First request takes 60+ seconds

# Add init container for model pre-loading
initContainers:
- name: model-downloader
  image: your-registry/model-downloader:latest
  volumeMounts:
  - name: model-cache
    mountPath: /models
  env:
  - name: HF_HOME
    value: /models
volumes:
- name: model-cache
  persistentVolumeClaim:
    claimName: model-cache-pvc

Best Practices Checklist

  • Resource Management: Set appropriate resource requests/limits based on model size
  • Health Checks: Implement proper liveness and readiness probes with adequate timeouts
  • Graceful Shutdown: Use terminationGracePeriodSeconds to finish in-flight requests
  • Model Versioning: Use semantic versioning and blue-green deployments
  • Security: Implement authentication, rate limiting, and input validation
  • Caching: Cache frequent prompts and responses to reduce compute costs
  • Load Testing: Test with realistic workloads before production deployment
  • Cost Monitoring: Track GPU utilization and cost per inference

Conclusion

Scaling LLM applications from prototype to production requires careful consideration of infrastructure, cost optimization, and operational excellence. By implementing the patterns and practices outlined in this guide—including proper Kubernetes configuration, dynamic batching, comprehensive monitoring, and cost optimization strategies—you can build robust, scalable LLM systems that deliver consistent performance at reasonable costs.

Start with a single GPU deployment, implement proper monitoring, and scale incrementally based on actual usage patterns. Remember that LLM infrastructure is rapidly evolving, so stay updated with the latest optimization techniques and tooling from the community.

Have Queries? Join https://launchpass.com/collabnix

Collabnix Team The Collabnix Team is a diverse collective of Docker, Kubernetes, and IoT experts united by a passion for cloud-native technologies. With backgrounds spanning across DevOps, platform engineering, cloud architecture, and container orchestration, our contributors bring together decades of combined experience from various industries and technical domains.
Join our Discord Server
Index