Join our Discord Server
Collabnix Team The Collabnix Team is a diverse collective of Docker, Kubernetes, and IoT experts united by a passion for cloud-native technologies. With backgrounds spanning across DevOps, platform engineering, cloud architecture, and container orchestration, our contributors bring together decades of combined experience from various industries and technical domains.

Serverless AI Deployment for Scalable LLM Inference

5 min read

Unlocking Serverless AI Deployment for LLM Inference

Large Language Models (LLMs) are transforming how we build intelligent applications, but deploying them efficiently remains a significant challenge. Traditional deployment approaches often lead to resource waste, complex scaling configurations, and high operational costs. Enter Knative—a Kubernetes-based platform that brings true serverless capabilities to LLM inference workloads.

In this comprehensive guide, we’ll explore how to deploy LLM inference services using Knative, achieving automatic scaling, efficient resource utilization, and production-grade reliability.

Why Knative for LLM Inference?

Before diving into implementation, let’s understand why Knative is particularly well-suited for LLM workloads:

  • Scale-to-Zero: LLM models consume significant GPU/CPU resources. Knative can scale your inference service to zero when idle, dramatically reducing costs.
  • Automatic Scaling: Handle traffic spikes seamlessly with built-in autoscaling based on concurrency, RPS, or custom metrics.
  • Traffic Splitting: Deploy multiple model versions simultaneously and gradually shift traffic for A/B testing or canary deployments.
  • Built-in Observability: Native integration with Prometheus, Grafana, and distributed tracing for monitoring model performance.

Prerequisites and Environment Setup

To follow this tutorial, you’ll need:

  • A Kubernetes cluster (v1.24+) with at least 16GB RAM and GPU support (optional but recommended)
  • kubectl configured to access your cluster
  • Knative Serving installed (v1.8+)
  • Docker or Podman for building container images

Installing Knative Serving

First, install Knative Serving components:

# Install Knative Serving CRDs
kubectl apply -f https://github.com/knative/serving/releases/download/knative-v1.12.0/serving-crds.yaml

# Install Knative Serving core components
kubectl apply -f https://github.com/knative/serving/releases/download/knative-v1.12.0/serving-core.yaml

# Install Kourier as the networking layer
kubectl apply -f https://github.com/knative/net-kourier/releases/download/knative-v1.12.0/kourier.yaml

# Configure Knative to use Kourier
kubectl patch configmap/config-network \
  --namespace knative-serving \
  --type merge \
  --patch '{"data":{"ingress-class":"kourier.ingress.networking.knative.dev"}}'

Verify the installation:

kubectl get pods -n knative-serving
kubectl get pods -n kourier-system

Building an LLM Inference Container

We’ll create a containerized inference service using FastAPI and the Hugging Face Transformers library. This example uses a smaller model (DistilGPT-2) for demonstration, but the approach scales to larger models like Llama, Mistral, or GPT-J.

Creating the Inference Application

Create a file named inference_server.py:

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import logging
import os

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

app = FastAPI(title="LLM Inference Service")

# Model configuration
MODEL_NAME = os.getenv("MODEL_NAME", "distilgpt2")
MAX_LENGTH = int(os.getenv("MAX_LENGTH", "100"))
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

# Global model and tokenizer
model = None
tokenizer = None

class InferenceRequest(BaseModel):
    prompt: str
    max_length: int = 50
    temperature: float = 0.7

class InferenceResponse(BaseModel):
    generated_text: str
    model: str

@app.on_event("startup")
async def load_model():
    global model, tokenizer
    logger.info(f"Loading model: {MODEL_NAME} on {DEVICE}")
    tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
    model = AutoModelForCausalLM.from_pretrained(MODEL_NAME)
    model.to(DEVICE)
    model.eval()
    logger.info("Model loaded successfully")

@app.get("/health")
async def health_check():
    return {"status": "healthy", "model": MODEL_NAME, "device": DEVICE}

@app.post("/generate", response_model=InferenceResponse)
async def generate_text(request: InferenceRequest):
    try:
        inputs = tokenizer(request.prompt, return_tensors="pt").to(DEVICE)
        
        with torch.no_grad():
            outputs = model.generate(
                **inputs,
                max_length=min(request.max_length, MAX_LENGTH),
                temperature=request.temperature,
                do_sample=True,
                pad_token_id=tokenizer.eos_token_id
            )
        
        generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
        
        return InferenceResponse(
            generated_text=generated_text,
            model=MODEL_NAME
        )
    except Exception as e:
        logger.error(f"Inference error: {str(e)}")
        raise HTTPException(status_code=500, detail=str(e))

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8080)

Creating the Dockerfile

Create a Dockerfile optimized for fast startup times:

FROM python:3.10-slim

WORKDIR /app

# Install dependencies
RUN pip install --no-cache-dir \
    fastapi==0.104.1 \
    uvicorn[standard]==0.24.0 \
    transformers==4.35.0 \
    torch==2.1.0 \
    pydantic==2.5.0

# Pre-download model during build to reduce startup time
ENV MODEL_NAME=distilgpt2
RUN python -c "from transformers import AutoTokenizer, AutoModelForCausalLM; \
    AutoTokenizer.from_pretrained('${MODEL_NAME}'); \
    AutoModelForCausalLM.from_pretrained('${MODEL_NAME}')"

COPY inference_server.py .

EXPOSE 8080

CMD ["python", "inference_server.py"]

Build and push the container:

# Build the image
docker build -t your-registry/llm-inference:v1 .

# Push to your container registry
docker push your-registry/llm-inference:v1

Deploying LLM Inference with Knative

Now let’s create a Knative Service that deploys our LLM inference workload with intelligent autoscaling.

Basic Knative Service Configuration

Create llm-service.yaml:

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: llm-inference
  namespace: default
spec:
  template:
    metadata:
      annotations:
        # Scale to zero after 5 minutes of inactivity
        autoscaling.knative.dev/scale-down-delay: "5m"
        # Minimum number of replicas
        autoscaling.knative.dev/min-scale: "0"
        # Maximum number of replicas
        autoscaling.knative.dev/max-scale: "10"
        # Target concurrency per replica
        autoscaling.knative.dev/target: "1"
        # Use RPS-based autoscaling
        autoscaling.knative.dev/metric: "rps"
        autoscaling.knative.dev/target-utilization-percentage: "70"
    spec:
      containerConcurrency: 1
      timeoutSeconds: 300
      containers:
      - name: llm-container
        image: your-registry/llm-inference:v1
        ports:
        - containerPort: 8080
          protocol: TCP
        env:
        - name: MODEL_NAME
          value: "distilgpt2"
        - name: MAX_LENGTH
          value: "100"
        resources:
          requests:
            memory: "2Gi"
            cpu: "1000m"
          limits:
            memory: "4Gi"
            cpu: "2000m"
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 5

Deploy the service:

kubectl apply -f llm-service.yaml

# Check the service status
kubectl get ksvc llm-inference

# Get the service URL
kubectl get ksvc llm-inference -o jsonpath='{.status.url}'

Advanced Configuration: GPU Support

For production LLM workloads, GPU acceleration is essential. Here’s how to configure Knative for GPU-enabled inference:

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: llm-inference-gpu
  namespace: default
spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/min-scale: "1"
        autoscaling.knative.dev/max-scale: "5"
        autoscaling.knative.dev/target: "1"
        # Longer scale-down delay for GPU instances
        autoscaling.knative.dev/scale-down-delay: "10m"
    spec:
      containerConcurrency: 1
      timeoutSeconds: 600
      containers:
      - name: llm-container
        image: your-registry/llm-inference-gpu:v1
        ports:
        - containerPort: 8080
        env:
        - name: MODEL_NAME
          value: "meta-llama/Llama-2-7b-chat-hf"
        resources:
          requests:
            memory: "16Gi"
            cpu: "4000m"
            nvidia.com/gpu: "1"
          limits:
            memory: "32Gi"
            cpu: "8000m"
            nvidia.com/gpu: "1"
      nodeSelector:
        cloud.google.com/gke-accelerator: nvidia-tesla-t4

Testing the Inference Service

Once deployed, test your service with curl or any HTTP client:

# Get the service URL
SERVICE_URL=$(kubectl get ksvc llm-inference -o jsonpath='{.status.url}')

# Test the health endpoint
curl $SERVICE_URL/health

# Generate text
curl -X POST $SERVICE_URL/generate \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "The future of artificial intelligence is",
    "max_length": 50,
    "temperature": 0.7
  }'

Monitoring and Observability

Knative provides built-in metrics for monitoring your LLM service. Configure Prometheus scraping:

apiVersion: v1
kind: ConfigMap
metadata:
  name: config-observability
  namespace: knative-serving
data:
  metrics.backend-destination: prometheus
  metrics.request-metrics-backend-destination: prometheus
  metrics.allow-stackdriver-custom-metrics: "false"

Key metrics to monitor:

  • knative_serving_revision_request_count: Total requests per revision
  • knative_serving_revision_request_latencies: Request latency distribution
  • knative_serving_actual_pods: Current number of active pods
  • knative_serving_desired_pods: Desired pod count based on autoscaling

Best Practices and Optimization Tips

1. Model Caching Strategy

Pre-download models during container build to minimize cold start times. For very large models, consider using init containers with persistent volumes:

spec:
  template:
    spec:
      volumes:
      - name: model-cache
        persistentVolumeClaim:
          claimName: model-cache-pvc
      initContainers:
      - name: model-downloader
        image: your-registry/model-downloader:v1
        volumeMounts:
        - name: model-cache
          mountPath: /models
      containers:
      - name: llm-container
        volumeMounts:
        - name: model-cache
          mountPath: /models
          readOnly: true

2. Concurrency Configuration

LLM inference is typically memory and compute-intensive. Set containerConcurrency: 1 to ensure one request per pod, preventing resource contention.

3. Autoscaling Tuning

For LLMs with high initialization costs, consider these settings:

  • Set min-scale: 1 to maintain at least one warm instance
  • Increase scale-down-delay to 10-15 minutes for GPU workloads
  • Use target: 1 for RPS-based scaling to ensure responsive scaling

4. Request Timeout Configuration

Large models may require longer processing times. Adjust timeoutSeconds based on your model’s inference latency:

spec:
  template:
    spec:
      timeoutSeconds: 600  # 10 minutes for large models

Troubleshooting Common Issues

Issue: Pods Not Scaling to Zero

Check if there are active connections or if the scale-down delay hasn’t elapsed:

# Check revision status
kubectl get revisions -l serving.knative.dev/service=llm-inference

# Check pod status
kubectl get pods -l serving.knative.dev/service=llm-inference

# View autoscaler logs
kubectl logs -n knative-serving -l app=autoscaler

Issue: High Cold Start Latency

Solutions:

  • Pre-download models during container build
  • Use smaller base images (distroless, alpine)
  • Set min-scale: 1 to maintain warm instances
  • Implement model quantization for faster loading

Issue: Out of Memory Errors

Increase memory limits or use model quantization:

# Add to inference_server.py
from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_8bit=True,
    llm_int8_threshold=6.0
)

model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    quantization_config=quantization_config,
    device_map="auto"
)

Production Deployment Checklist

Before deploying to production, ensure:

  • ✓ Resource limits are appropriately set based on model size
  • ✓ Health checks are configured with appropriate timeouts
  • ✓ Autoscaling parameters are tuned for your traffic patterns
  • ✓ Monitoring and alerting are configured
  • ✓ Model artifacts are versioned and stored securely
  • ✓ Network policies restrict access to authorized clients
  • ✓ Rate limiting is implemented to prevent abuse
  • ✓ Cost monitoring is in place for GPU resources

Conclusion

Knative provides a powerful platform for deploying LLM inference workloads with true serverless capabilities. By leveraging automatic scaling, scale-to-zero, and built-in traffic management, you can build cost-effective, scalable AI services that handle variable workloads efficiently.

The combination of Knative’s serverless features and modern LLM frameworks creates a robust foundation for production AI applications. Whether you’re deploying small models for edge cases or large language models requiring GPU acceleration, Knative’s flexible architecture adapts to your needs.

Start experimenting with the configurations provided in this guide, and adapt them to your specific use cases. The serverless AI revolution is here—and Knative is leading the way.

Have Queries? Join https://launchpass.com/collabnix

Collabnix Team The Collabnix Team is a diverse collective of Docker, Kubernetes, and IoT experts united by a passion for cloud-native technologies. With backgrounds spanning across DevOps, platform engineering, cloud architecture, and container orchestration, our contributors bring together decades of combined experience from various industries and technical domains.
Join our Discord Server
Index