Join our Discord Server
Collabnix Team The Collabnix Team is a diverse collective of Docker, Kubernetes, and IoT experts united by a passion for cloud-native technologies. With backgrounds spanning across DevOps, platform engineering, cloud architecture, and container orchestration, our contributors bring together decades of combined experience from various industries and technical domains.

Model Serving at Scale: TorchServe on Kubernetes Guide 2024

4 min read

As machine learning models grow in complexity and organizations demand real-time inference at scale, the need for robust model serving infrastructure becomes critical. TorchServe, PyTorch’s official model serving framework, combined with Kubernetes orchestration, provides a production-ready solution for deploying ML models at enterprise scale.

In this comprehensive guide, we’ll explore how to deploy TorchServe on Kubernetes, implement autoscaling, optimize performance, and apply production-grade best practices for serving PyTorch models in production environments.

Why TorchServe on Kubernetes?

TorchServe offers several advantages for serving PyTorch models:

  • Multi-model serving: Deploy and manage multiple models simultaneously
  • Model versioning: Support for A/B testing and canary deployments
  • Built-in metrics: Prometheus-compatible metrics out of the box
  • RESTful and gRPC APIs: Flexible inference endpoints
  • Dynamic batching: Automatic request batching for throughput optimization

When combined with Kubernetes, you gain horizontal scaling, self-healing, rolling updates, and declarative infrastructure management—essential capabilities for production ML systems.

Prerequisites and Environment Setup

Before deploying TorchServe on Kubernetes, ensure you have:

  • A running Kubernetes cluster (1.19+)
  • kubectl configured and connected to your cluster
  • Docker installed for building custom images
  • Basic understanding of PyTorch model formats

Creating a TorchServe Model Archive

First, let’s create a model archive (MAR file) that TorchServe can serve. Here’s a simple example using a pre-trained ResNet model:

import torch
import torchvision.models as models

# Load pre-trained ResNet model
model = models.resnet50(pretrained=True)
model.eval()

# Save the model
torch.save(model.state_dict(), 'resnet50.pth')

Create a custom handler for inference logic:

# handler.py
from torchvision import transforms
from ts.torch_handler.image_classifier import ImageClassifier

class ResNetHandler(ImageClassifier):
    def __init__(self):
        super(ResNetHandler, self).__init__()
        self.transform = transforms.Compose([
            transforms.Resize(256),
            transforms.CenterCrop(224),
            transforms.ToTensor(),
            transforms.Normalize(
                mean=[0.485, 0.456, 0.406],
                std=[0.229, 0.224, 0.225]
            )
        ])
    
    def preprocess(self, data):
        images = []
        for row in data:
            image = self.transform(row.get("data"))
            images.append(image)
        return torch.stack(images)

Generate the model archive using torch-model-archiver:

# Install torch-model-archiver
pip install torch-model-archiver

# Create model archive
torch-model-archiver --model-name resnet50 \
  --version 1.0 \
  --model-file model.py \
  --serialized-file resnet50.pth \
  --handler handler.py \
  --extra-files index_to_name.json \
  --export-path model-store/

Building a Production-Ready TorchServe Docker Image

While TorchServe provides official Docker images, creating a custom image allows for better control and optimization:

FROM pytorch/torchserve:latest-gpu

# Install additional dependencies
RUN pip install --no-cache-dir \
    transformers==4.30.0 \
    pillow==10.0.0 \
    numpy==1.24.3

# Create model store directory
RUN mkdir -p /home/model-server/model-store

# Copy model archives
COPY model-store/*.mar /home/model-server/model-store/

# Copy configuration
COPY config.properties /home/model-server/config.properties

# Expose ports
EXPOSE 8080 8081 8082 7070 7071

# Set working directory
WORKDIR /home/model-server

# Start TorchServe
CMD ["torchserve", \
     "--start", \
     "--model-store", "/home/model-server/model-store", \
     "--ts-config", "/home/model-server/config.properties"]

Create a TorchServe configuration file:

# config.properties
inference_address=http://0.0.0.0:8080
management_address=http://0.0.0.0:8081
metrics_address=http://0.0.0.0:8082

# Performance tuning
default_workers_per_model=4
job_queue_size=100
max_request_size=104857600
max_response_size=104857600

# Enable metrics
enable_metrics_api=true
metrics_format=prometheus

# Logging
default_response_timeout=120

Build and push the Docker image:

docker build -t your-registry/torchserve:v1.0 .
docker push your-registry/torchserve:v1.0

Deploying TorchServe on Kubernetes

Creating the Deployment Configuration

Here’s a production-ready Kubernetes deployment manifest:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: torchserve
  namespace: ml-serving
  labels:
    app: torchserve
    version: v1.0
spec:
  replicas: 3
  selector:
    matchLabels:
      app: torchserve
  template:
    metadata:
      labels:
        app: torchserve
        version: v1.0
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8082"
        prometheus.io/path: "/metrics"
    spec:
      containers:
      - name: torchserve
        image: your-registry/torchserve:v1.0
        ports:
        - containerPort: 8080
          name: inference
          protocol: TCP
        - containerPort: 8081
          name: management
          protocol: TCP
        - containerPort: 8082
          name: metrics
          protocol: TCP
        resources:
          requests:
            memory: "4Gi"
            cpu: "2000m"
            nvidia.com/gpu: "1"
          limits:
            memory: "8Gi"
            cpu: "4000m"
            nvidia.com/gpu: "1"
        livenessProbe:
          httpGet:
            path: /ping
            port: 8080
          initialDelaySeconds: 60
          periodSeconds: 10
          timeoutSeconds: 5
          failureThreshold: 3
        readinessProbe:
          httpGet:
            path: /ping
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 5
          timeoutSeconds: 3
        env:
        - name: TS_NUMBER_OF_GPU
          value: "1"
        - name: TS_ENABLE_METRICS_API
          value: "true"
        volumeMounts:
        - name: model-store
          mountPath: /home/model-server/model-store
      volumes:
      - name: model-store
        persistentVolumeClaim:
          claimName: torchserve-models-pvc
---
apiVersion: v1
kind: Service
metadata:
  name: torchserve
  namespace: ml-serving
  labels:
    app: torchserve
spec:
  type: LoadBalancer
  ports:
  - port: 8080
    targetPort: 8080
    name: inference
  - port: 8081
    targetPort: 8081
    name: management
  - port: 8082
    targetPort: 8082
    name: metrics
  selector:
    app: torchserve

Implementing Horizontal Pod Autoscaling

Configure HPA based on CPU, memory, or custom metrics:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: torchserve-hpa
  namespace: ml-serving
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: torchserve
  minReplicas: 3
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
  - type: Pods
    pods:
      metric:
        name: torchserve_inference_requests_total
      target:
        type: AverageValue
        averageValue: "1000"
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 50
        periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:
      - type: Percent
        value: 100
        periodSeconds: 30
      - type: Pods
        value: 2
        periodSeconds: 30
      selectPolicy: Max

Deploy the resources:

# Create namespace
kubectl create namespace ml-serving

# Apply configurations
kubectl apply -f torchserve-deployment.yaml
kubectl apply -f torchserve-hpa.yaml

# Verify deployment
kubectl get pods -n ml-serving
kubectl get svc -n ml-serving

Model Management and Dynamic Loading

TorchServe supports dynamic model registration without restarting pods:

# Register a new model
curl -X POST "http://<TORCHSERVE_SERVICE>:8081/models?url=resnet50.mar&initial_workers=4&synchronous=true"

# List registered models
curl http://<TORCHSERVE_SERVICE>:8081/models

# Scale workers for a specific model
curl -X PUT "http://<TORCHSERVE_SERVICE>:8081/models/resnet50?min_worker=2&max_worker=8"

# Unregister a model
curl -X DELETE http://<TORCHSERVE_SERVICE>:8081/models/resnet50

Monitoring and Observability

Integrating with Prometheus

Create a ServiceMonitor for Prometheus Operator:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: torchserve-metrics
  namespace: ml-serving
  labels:
    app: torchserve
spec:
  selector:
    matchLabels:
      app: torchserve
  endpoints:
  - port: metrics
    interval: 30s
    path: /metrics

Key metrics to monitor:

  • ts_inference_requests_total: Total inference requests
  • ts_inference_latency_microseconds: Request latency
  • ts_queue_latency_microseconds: Queue wait time
  • ts_worker_thread_utilization: Worker utilization

Performance Optimization Best Practices

1. Batch Processing Configuration

Enable dynamic batching for improved throughput:

# In config.properties
batch_size=8
max_batch_delay=100

2. GPU Resource Management

For multi-GPU scenarios, use node selectors and taints:

nodeSelector:
  accelerator: nvidia-tesla-v100
tolerations:
- key: nvidia.com/gpu
  operator: Exists
  effect: NoSchedule

3. Model Optimization

Use TorchScript for faster inference:

import torch

# Trace the model
example_input = torch.rand(1, 3, 224, 224)
traced_model = torch.jit.trace(model, example_input)
traced_model.save('model_traced.pt')

Troubleshooting Common Issues

Pod Crashes Due to OOM

Increase memory limits or reduce batch size:

# Check pod memory usage
kubectl top pods -n ml-serving

# Describe pod for OOMKilled events
kubectl describe pod <pod-name> -n ml-serving

High Inference Latency

Debug with TorchServe logs:

# View logs
kubectl logs -f <pod-name> -n ml-serving

# Check worker status
curl http://<TORCHSERVE_SERVICE>:8081/models/resnet50

Model Loading Failures

Verify model archive integrity:

# Extract and inspect MAR file
mkdir -p /tmp/model-debug
cd /tmp/model-debug
jar xvf /path/to/model.mar

# Check manifest
cat MAR-INF/MANIFEST.json

Production Deployment Checklist

  • ✓ Implement proper resource requests and limits
  • ✓ Configure health checks (liveness and readiness probes)
  • ✓ Enable horizontal pod autoscaling
  • ✓ Set up monitoring and alerting
  • ✓ Implement request/response logging
  • ✓ Use persistent volumes for model storage
  • ✓ Configure network policies for security
  • ✓ Implement rate limiting at ingress level
  • ✓ Set up CI/CD pipelines for model updates
  • ✓ Document model versions and dependencies

Conclusion

Deploying TorchServe on Kubernetes provides a robust, scalable infrastructure for serving PyTorch models in production. By following the practices outlined in this guide—from containerization and deployment to monitoring and optimization—you can build a reliable ML serving platform that handles production workloads efficiently.

The combination of TorchServe’s model management capabilities and Kubernetes’ orchestration power enables teams to focus on model development while maintaining operational excellence. As your ML infrastructure grows, this foundation will scale with your needs, supporting everything from experimental deployments to mission-critical production services.

Start with a simple deployment, monitor your metrics closely, and iterate based on your specific performance requirements. The modular nature of this architecture allows you to optimize individual components without disrupting your entire serving infrastructure.

Have Queries? Join https://launchpass.com/collabnix

Collabnix Team The Collabnix Team is a diverse collective of Docker, Kubernetes, and IoT experts united by a passion for cloud-native technologies. With backgrounds spanning across DevOps, platform engineering, cloud architecture, and container orchestration, our contributors bring together decades of combined experience from various industries and technical domains.
Join our Discord Server
Index