As machine learning models grow in complexity and organizations demand real-time inference at scale, the need for robust model serving infrastructure becomes critical. TorchServe, PyTorch’s official model serving framework, combined with Kubernetes orchestration, provides a production-ready solution for deploying ML models at enterprise scale.
In this comprehensive guide, we’ll explore how to deploy TorchServe on Kubernetes, implement autoscaling, optimize performance, and apply production-grade best practices for serving PyTorch models in production environments.
Why TorchServe on Kubernetes?
TorchServe offers several advantages for serving PyTorch models:
- Multi-model serving: Deploy and manage multiple models simultaneously
- Model versioning: Support for A/B testing and canary deployments
- Built-in metrics: Prometheus-compatible metrics out of the box
- RESTful and gRPC APIs: Flexible inference endpoints
- Dynamic batching: Automatic request batching for throughput optimization
When combined with Kubernetes, you gain horizontal scaling, self-healing, rolling updates, and declarative infrastructure management—essential capabilities for production ML systems.
Prerequisites and Environment Setup
Before deploying TorchServe on Kubernetes, ensure you have:
- A running Kubernetes cluster (1.19+)
- kubectl configured and connected to your cluster
- Docker installed for building custom images
- Basic understanding of PyTorch model formats
Creating a TorchServe Model Archive
First, let’s create a model archive (MAR file) that TorchServe can serve. Here’s a simple example using a pre-trained ResNet model:
import torch
import torchvision.models as models
# Load pre-trained ResNet model
model = models.resnet50(pretrained=True)
model.eval()
# Save the model
torch.save(model.state_dict(), 'resnet50.pth')
Create a custom handler for inference logic:
# handler.py
from torchvision import transforms
from ts.torch_handler.image_classifier import ImageClassifier
class ResNetHandler(ImageClassifier):
def __init__(self):
super(ResNetHandler, self).__init__()
self.transform = transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize(
mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225]
)
])
def preprocess(self, data):
images = []
for row in data:
image = self.transform(row.get("data"))
images.append(image)
return torch.stack(images)
Generate the model archive using torch-model-archiver:
# Install torch-model-archiver
pip install torch-model-archiver
# Create model archive
torch-model-archiver --model-name resnet50 \
--version 1.0 \
--model-file model.py \
--serialized-file resnet50.pth \
--handler handler.py \
--extra-files index_to_name.json \
--export-path model-store/
Building a Production-Ready TorchServe Docker Image
While TorchServe provides official Docker images, creating a custom image allows for better control and optimization:
FROM pytorch/torchserve:latest-gpu
# Install additional dependencies
RUN pip install --no-cache-dir \
transformers==4.30.0 \
pillow==10.0.0 \
numpy==1.24.3
# Create model store directory
RUN mkdir -p /home/model-server/model-store
# Copy model archives
COPY model-store/*.mar /home/model-server/model-store/
# Copy configuration
COPY config.properties /home/model-server/config.properties
# Expose ports
EXPOSE 8080 8081 8082 7070 7071
# Set working directory
WORKDIR /home/model-server
# Start TorchServe
CMD ["torchserve", \
"--start", \
"--model-store", "/home/model-server/model-store", \
"--ts-config", "/home/model-server/config.properties"]
Create a TorchServe configuration file:
# config.properties
inference_address=http://0.0.0.0:8080
management_address=http://0.0.0.0:8081
metrics_address=http://0.0.0.0:8082
# Performance tuning
default_workers_per_model=4
job_queue_size=100
max_request_size=104857600
max_response_size=104857600
# Enable metrics
enable_metrics_api=true
metrics_format=prometheus
# Logging
default_response_timeout=120
Build and push the Docker image:
docker build -t your-registry/torchserve:v1.0 .
docker push your-registry/torchserve:v1.0
Deploying TorchServe on Kubernetes
Creating the Deployment Configuration
Here’s a production-ready Kubernetes deployment manifest:
apiVersion: apps/v1
kind: Deployment
metadata:
name: torchserve
namespace: ml-serving
labels:
app: torchserve
version: v1.0
spec:
replicas: 3
selector:
matchLabels:
app: torchserve
template:
metadata:
labels:
app: torchserve
version: v1.0
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8082"
prometheus.io/path: "/metrics"
spec:
containers:
- name: torchserve
image: your-registry/torchserve:v1.0
ports:
- containerPort: 8080
name: inference
protocol: TCP
- containerPort: 8081
name: management
protocol: TCP
- containerPort: 8082
name: metrics
protocol: TCP
resources:
requests:
memory: "4Gi"
cpu: "2000m"
nvidia.com/gpu: "1"
limits:
memory: "8Gi"
cpu: "4000m"
nvidia.com/gpu: "1"
livenessProbe:
httpGet:
path: /ping
port: 8080
initialDelaySeconds: 60
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /ping
port: 8080
initialDelaySeconds: 30
periodSeconds: 5
timeoutSeconds: 3
env:
- name: TS_NUMBER_OF_GPU
value: "1"
- name: TS_ENABLE_METRICS_API
value: "true"
volumeMounts:
- name: model-store
mountPath: /home/model-server/model-store
volumes:
- name: model-store
persistentVolumeClaim:
claimName: torchserve-models-pvc
---
apiVersion: v1
kind: Service
metadata:
name: torchserve
namespace: ml-serving
labels:
app: torchserve
spec:
type: LoadBalancer
ports:
- port: 8080
targetPort: 8080
name: inference
- port: 8081
targetPort: 8081
name: management
- port: 8082
targetPort: 8082
name: metrics
selector:
app: torchserve
Implementing Horizontal Pod Autoscaling
Configure HPA based on CPU, memory, or custom metrics:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: torchserve-hpa
namespace: ml-serving
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: torchserve
minReplicas: 3
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
- type: Pods
pods:
metric:
name: torchserve_inference_requests_total
target:
type: AverageValue
averageValue: "1000"
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 50
periodSeconds: 60
scaleUp:
stabilizationWindowSeconds: 0
policies:
- type: Percent
value: 100
periodSeconds: 30
- type: Pods
value: 2
periodSeconds: 30
selectPolicy: Max
Deploy the resources:
# Create namespace
kubectl create namespace ml-serving
# Apply configurations
kubectl apply -f torchserve-deployment.yaml
kubectl apply -f torchserve-hpa.yaml
# Verify deployment
kubectl get pods -n ml-serving
kubectl get svc -n ml-serving
Model Management and Dynamic Loading
TorchServe supports dynamic model registration without restarting pods:
# Register a new model
curl -X POST "http://<TORCHSERVE_SERVICE>:8081/models?url=resnet50.mar&initial_workers=4&synchronous=true"
# List registered models
curl http://<TORCHSERVE_SERVICE>:8081/models
# Scale workers for a specific model
curl -X PUT "http://<TORCHSERVE_SERVICE>:8081/models/resnet50?min_worker=2&max_worker=8"
# Unregister a model
curl -X DELETE http://<TORCHSERVE_SERVICE>:8081/models/resnet50
Monitoring and Observability
Integrating with Prometheus
Create a ServiceMonitor for Prometheus Operator:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: torchserve-metrics
namespace: ml-serving
labels:
app: torchserve
spec:
selector:
matchLabels:
app: torchserve
endpoints:
- port: metrics
interval: 30s
path: /metrics
Key metrics to monitor:
- ts_inference_requests_total: Total inference requests
- ts_inference_latency_microseconds: Request latency
- ts_queue_latency_microseconds: Queue wait time
- ts_worker_thread_utilization: Worker utilization
Performance Optimization Best Practices
1. Batch Processing Configuration
Enable dynamic batching for improved throughput:
# In config.properties
batch_size=8
max_batch_delay=100
2. GPU Resource Management
For multi-GPU scenarios, use node selectors and taints:
nodeSelector:
accelerator: nvidia-tesla-v100
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
3. Model Optimization
Use TorchScript for faster inference:
import torch
# Trace the model
example_input = torch.rand(1, 3, 224, 224)
traced_model = torch.jit.trace(model, example_input)
traced_model.save('model_traced.pt')
Troubleshooting Common Issues
Pod Crashes Due to OOM
Increase memory limits or reduce batch size:
# Check pod memory usage
kubectl top pods -n ml-serving
# Describe pod for OOMKilled events
kubectl describe pod <pod-name> -n ml-serving
High Inference Latency
Debug with TorchServe logs:
# View logs
kubectl logs -f <pod-name> -n ml-serving
# Check worker status
curl http://<TORCHSERVE_SERVICE>:8081/models/resnet50
Model Loading Failures
Verify model archive integrity:
# Extract and inspect MAR file
mkdir -p /tmp/model-debug
cd /tmp/model-debug
jar xvf /path/to/model.mar
# Check manifest
cat MAR-INF/MANIFEST.json
Production Deployment Checklist
- ✓ Implement proper resource requests and limits
- ✓ Configure health checks (liveness and readiness probes)
- ✓ Enable horizontal pod autoscaling
- ✓ Set up monitoring and alerting
- ✓ Implement request/response logging
- ✓ Use persistent volumes for model storage
- ✓ Configure network policies for security
- ✓ Implement rate limiting at ingress level
- ✓ Set up CI/CD pipelines for model updates
- ✓ Document model versions and dependencies
Conclusion
Deploying TorchServe on Kubernetes provides a robust, scalable infrastructure for serving PyTorch models in production. By following the practices outlined in this guide—from containerization and deployment to monitoring and optimization—you can build a reliable ML serving platform that handles production workloads efficiently.
The combination of TorchServe’s model management capabilities and Kubernetes’ orchestration power enables teams to focus on model development while maintaining operational excellence. As your ML infrastructure grows, this foundation will scale with your needs, supporting everything from experimental deployments to mission-critical production services.
Start with a simple deployment, monitor your metrics closely, and iterate based on your specific performance requirements. The modular nature of this architecture allows you to optimize individual components without disrupting your entire serving infrastructure.