Managing persistent storage for AI/ML workloads in Kubernetes presents unique challenges. Unlike stateless applications, machine learning workflows require stable, persistent storage for datasets, model checkpoints, and training artifacts. This is where Kubernetes StatefulSets become essential.
In this comprehensive guide, we’ll explore how to leverage StatefulSets to build robust, scalable storage infrastructure for AI/ML applications running on Kubernetes.
Why StatefulSets Matter for AI/ML Workloads
Traditional Kubernetes Deployments work well for stateless applications, but AI/ML workloads have different requirements:
- Persistent Identity: Training nodes need consistent network identities for distributed training frameworks like TensorFlow, PyTorch, and Horovod
- Stable Storage: Model checkpoints and datasets must persist across pod restarts
- Ordered Operations: Sequential deployment and scaling ensure data consistency
- Predictable Naming: Deterministic pod names enable easier debugging and monitoring
StatefulSets provide these guarantees, making them ideal for AI/ML storage infrastructure.
Understanding StatefulSet Architecture for ML Storage
StatefulSets maintain a sticky identity for each pod. When you create a StatefulSet with three replicas, Kubernetes creates pods named podname-0, podname-1, and podname-2. Each pod gets its own PersistentVolumeClaim (PVC), ensuring dedicated storage.
Key Components
- Headless Service: Provides stable network identity without load balancing
- VolumeClaimTemplates: Automatically provisions PVCs for each pod
- Pod Management Policy: Controls ordering and uniqueness guarantees
- Update Strategy: Manages rolling updates and rollbacks
Creating a StatefulSet for Distributed ML Training
Let’s build a StatefulSet for a distributed TensorFlow training cluster with persistent storage for model checkpoints and datasets.
Step 1: Create a Headless Service
apiVersion: v1
kind: Service
metadata:
name: tensorflow-training
labels:
app: ml-training
spec:
ports:
- port: 2222
name: training
clusterIP: None
selector:
app: ml-training
The clusterIP: None directive creates a headless service, allowing direct pod-to-pod communication essential for distributed training.
Step 2: Define the StatefulSet with Storage
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: ml-training
spec:
serviceName: tensorflow-training
replicas: 3
selector:
matchLabels:
app: ml-training
template:
metadata:
labels:
app: ml-training
spec:
containers:
- name: tensorflow
image: tensorflow/tensorflow:latest-gpu
ports:
- containerPort: 2222
name: training
env:
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: TF_CONFIG
value: |
{
"cluster": {
"worker": [
"ml-training-0.tensorflow-training:2222",
"ml-training-1.tensorflow-training:2222",
"ml-training-2.tensorflow-training:2222"
]
},
"task": {"type": "worker", "index": 0}
}
volumeMounts:
- name: model-storage
mountPath: /mnt/models
- name: dataset-storage
mountPath: /mnt/datasets
resources:
requests:
memory: "16Gi"
cpu: "4"
nvidia.com/gpu: 1
limits:
memory: "32Gi"
cpu: "8"
nvidia.com/gpu: 1
volumeClaimTemplates:
- metadata:
name: model-storage
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: fast-ssd
resources:
requests:
storage: 100Gi
- metadata:
name: dataset-storage
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: standard
resources:
requests:
storage: 500Gi
Step 3: Deploy the StatefulSet
# Apply the configurations
kubectl apply -f tensorflow-service.yaml
kubectl apply -f tensorflow-statefulset.yaml
# Verify the deployment
kubectl get statefulset ml-training
kubectl get pods -l app=ml-training
kubectl get pvc
Advanced Storage Patterns for AI/ML
Multi-Tier Storage Strategy
AI/ML workloads benefit from a multi-tier storage approach:
- Hot Storage (SSD): For active training data and model checkpoints
- Warm Storage (Standard): For datasets and intermediate results
- Cold Storage (Object Storage): For archived models and historical data
Implementing Shared Dataset Storage
For read-only datasets shared across training pods, use ReadOnlyMany access mode:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: shared-dataset
spec:
accessModes:
- ReadOnlyMany
storageClassName: nfs-client
resources:
requests:
storage: 1Ti
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: ml-training-shared
spec:
serviceName: ml-service
replicas: 5
selector:
matchLabels:
app: ml-training
template:
metadata:
labels:
app: ml-training
spec:
containers:
- name: trainer
image: pytorch/pytorch:latest
volumeMounts:
- name: shared-data
mountPath: /data
readOnly: true
- name: model-output
mountPath: /output
volumes:
- name: shared-data
persistentVolumeClaim:
claimName: shared-dataset
volumeClaimTemplates:
- metadata:
name: model-output
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: fast-ssd
resources:
requests:
storage: 50Gi
Storage Class Configuration for ML Workloads
Choosing the right StorageClass is critical for performance. Here’s a high-performance configuration using CSI drivers:
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: fast-ssd
provisioner: kubernetes.io/aws-ebs
parameters:
type: gp3
iops: "16000"
throughput: "1000"
encrypted: "true"
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: ml-nvme
provisioner: ebs.csi.aws.com
parameters:
type: io2
iopsPerGB: "64"
encrypted: "true"
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
Monitoring and Observability
Monitor your StatefulSet storage with these commands:
# Check StatefulSet status
kubectl describe statefulset ml-training
# Monitor PVC usage
kubectl get pvc -l app=ml-training
# Check storage capacity
kubectl exec ml-training-0 -- df -h /mnt/models
# View pod logs
kubectl logs ml-training-0 -f
# Check volume attachment status
kubectl get volumeattachment
Prometheus Metrics for Storage
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
data:
prometheus.yml: |
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app]
action: keep
regex: ml-training
- source_labels: [__meta_kubernetes_pod_name]
target_label: pod_name
Backup and Disaster Recovery
Implement automated backups for model checkpoints using Velero:
# Install Velero
velero install \
--provider aws \
--plugins velero/velero-plugin-for-aws:v1.7.0 \
--bucket ml-backups \
--backup-location-config region=us-west-2
# Create backup schedule
velero schedule create ml-training-backup \
--schedule="0 2 * * *" \
--include-namespaces ml-namespace \
--selector app=ml-training
# Restore from backup
velero restore create --from-backup ml-training-backup-20240115
Troubleshooting Common Issues
Pod Stuck in Pending State
# Check PVC status
kubectl get pvc
# Describe the pending pod
kubectl describe pod ml-training-0
# Check storage provisioner logs
kubectl logs -n kube-system -l app=ebs-csi-controller
Volume Mount Failures
Common causes and solutions:
- Insufficient capacity: Check node storage and increase volume size
- Zone mismatch: Ensure PV and pod are in the same availability zone
- Permission issues: Verify fsGroup and runAsUser in securityContext
spec:
template:
spec:
securityContext:
fsGroup: 1000
runAsUser: 1000
containers:
- name: trainer
securityContext:
allowPrivilegeEscalation: false
Slow I/O Performance
# Test disk performance inside pod
kubectl exec ml-training-0 -- fio \
--name=write-test \
--size=10G \
--filename=/mnt/models/test \
--ioengine=libaio \
--direct=1 \
--bs=4k \
--rw=write
# Check IOPS limits
kubectl describe pv pvc-xxxxxxxx
Best Practices for Production
- Resource Limits: Always set memory and CPU limits to prevent resource exhaustion
- Pod Disruption Budgets: Protect training jobs from voluntary disruptions
- Topology Spread: Distribute pods across nodes and zones for high availability
- Volume Expansion: Enable allowVolumeExpansion in StorageClass for growth
- Backup Strategy: Implement automated backups for critical model data
- Monitoring: Track storage metrics, IOPS, and throughput continuously
- Cost Optimization: Use lifecycle policies to move old data to cheaper storage tiers
Production-Ready PodDisruptionBudget
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: ml-training-pdb
spec:
minAvailable: 2
selector:
matchLabels:
app: ml-training
Scaling Considerations
When scaling StatefulSets for ML workloads:
# Scale up gradually
kubectl scale statefulset ml-training --replicas=5
# Monitor scaling progress
kubectl rollout status statefulset ml-training
# Scale down (removes highest ordinal first)
kubectl scale statefulset ml-training --replicas=2
Important: Scaling down doesn’t delete PVCs automatically. Clean them up manually if needed:
# List orphaned PVCs
kubectl get pvc | grep ml-training
# Delete specific PVC
kubectl delete pvc model-storage-ml-training-3
Conclusion
Kubernetes StatefulSets provide the foundation for robust, scalable storage infrastructure for AI/ML workloads. By leveraging persistent volumes, stable network identities, and ordered deployment guarantees, you can build production-grade machine learning platforms that handle petabytes of data reliably.
The key to success lies in understanding your workload characteristics, choosing appropriate storage classes, implementing comprehensive monitoring, and following best practices for backup and disaster recovery. With the configurations and patterns outlined in this guide, you’re well-equipped to deploy and manage AI/ML storage on Kubernetes at scale.
Start with the basic StatefulSet configuration, monitor performance metrics closely, and iterate based on your specific requirements. Remember that storage is often the bottleneck in ML pipelines—invest time in optimizing it for maximum training efficiency.