Join our Discord Server
Collabnix Team The Collabnix Team is a diverse collective of Docker, Kubernetes, and IoT experts united by a passion for cloud-native technologies. With backgrounds spanning across DevOps, platform engineering, cloud architecture, and container orchestration, our contributors bring together decades of combined experience from various industries and technical domains.

Kubernetes StatefulSets for AI/ML Storage: Complete Guide 2024

4 min read

Managing persistent storage for AI/ML workloads in Kubernetes presents unique challenges. Unlike stateless applications, machine learning workflows require stable, persistent storage for datasets, model checkpoints, and training artifacts. This is where Kubernetes StatefulSets become essential.

In this comprehensive guide, we’ll explore how to leverage StatefulSets to build robust, scalable storage infrastructure for AI/ML applications running on Kubernetes.

Why StatefulSets Matter for AI/ML Workloads

Traditional Kubernetes Deployments work well for stateless applications, but AI/ML workloads have different requirements:

  • Persistent Identity: Training nodes need consistent network identities for distributed training frameworks like TensorFlow, PyTorch, and Horovod
  • Stable Storage: Model checkpoints and datasets must persist across pod restarts
  • Ordered Operations: Sequential deployment and scaling ensure data consistency
  • Predictable Naming: Deterministic pod names enable easier debugging and monitoring

StatefulSets provide these guarantees, making them ideal for AI/ML storage infrastructure.

Understanding StatefulSet Architecture for ML Storage

StatefulSets maintain a sticky identity for each pod. When you create a StatefulSet with three replicas, Kubernetes creates pods named podname-0, podname-1, and podname-2. Each pod gets its own PersistentVolumeClaim (PVC), ensuring dedicated storage.

Key Components

  • Headless Service: Provides stable network identity without load balancing
  • VolumeClaimTemplates: Automatically provisions PVCs for each pod
  • Pod Management Policy: Controls ordering and uniqueness guarantees
  • Update Strategy: Manages rolling updates and rollbacks

Creating a StatefulSet for Distributed ML Training

Let’s build a StatefulSet for a distributed TensorFlow training cluster with persistent storage for model checkpoints and datasets.

Step 1: Create a Headless Service

apiVersion: v1
kind: Service
metadata:
  name: tensorflow-training
  labels:
    app: ml-training
spec:
  ports:
  - port: 2222
    name: training
  clusterIP: None
  selector:
    app: ml-training

The clusterIP: None directive creates a headless service, allowing direct pod-to-pod communication essential for distributed training.

Step 2: Define the StatefulSet with Storage

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: ml-training
spec:
  serviceName: tensorflow-training
  replicas: 3
  selector:
    matchLabels:
      app: ml-training
  template:
    metadata:
      labels:
        app: ml-training
    spec:
      containers:
      - name: tensorflow
        image: tensorflow/tensorflow:latest-gpu
        ports:
        - containerPort: 2222
          name: training
        env:
        - name: POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: TF_CONFIG
          value: |
            {
              "cluster": {
                "worker": [
                  "ml-training-0.tensorflow-training:2222",
                  "ml-training-1.tensorflow-training:2222",
                  "ml-training-2.tensorflow-training:2222"
                ]
              },
              "task": {"type": "worker", "index": 0}
            }
        volumeMounts:
        - name: model-storage
          mountPath: /mnt/models
        - name: dataset-storage
          mountPath: /mnt/datasets
        resources:
          requests:
            memory: "16Gi"
            cpu: "4"
            nvidia.com/gpu: 1
          limits:
            memory: "32Gi"
            cpu: "8"
            nvidia.com/gpu: 1
  volumeClaimTemplates:
  - metadata:
      name: model-storage
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: fast-ssd
      resources:
        requests:
          storage: 100Gi
  - metadata:
      name: dataset-storage
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: standard
      resources:
        requests:
          storage: 500Gi

Step 3: Deploy the StatefulSet

# Apply the configurations
kubectl apply -f tensorflow-service.yaml
kubectl apply -f tensorflow-statefulset.yaml

# Verify the deployment
kubectl get statefulset ml-training
kubectl get pods -l app=ml-training
kubectl get pvc

Advanced Storage Patterns for AI/ML

Multi-Tier Storage Strategy

AI/ML workloads benefit from a multi-tier storage approach:

  • Hot Storage (SSD): For active training data and model checkpoints
  • Warm Storage (Standard): For datasets and intermediate results
  • Cold Storage (Object Storage): For archived models and historical data

Implementing Shared Dataset Storage

For read-only datasets shared across training pods, use ReadOnlyMany access mode:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: shared-dataset
spec:
  accessModes:
    - ReadOnlyMany
  storageClassName: nfs-client
  resources:
    requests:
      storage: 1Ti
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: ml-training-shared
spec:
  serviceName: ml-service
  replicas: 5
  selector:
    matchLabels:
      app: ml-training
  template:
    metadata:
      labels:
        app: ml-training
    spec:
      containers:
      - name: trainer
        image: pytorch/pytorch:latest
        volumeMounts:
        - name: shared-data
          mountPath: /data
          readOnly: true
        - name: model-output
          mountPath: /output
      volumes:
      - name: shared-data
        persistentVolumeClaim:
          claimName: shared-dataset
  volumeClaimTemplates:
  - metadata:
      name: model-output
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: fast-ssd
      resources:
        requests:
          storage: 50Gi

Storage Class Configuration for ML Workloads

Choosing the right StorageClass is critical for performance. Here’s a high-performance configuration using CSI drivers:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: fast-ssd
provisioner: kubernetes.io/aws-ebs
parameters:
  type: gp3
  iops: "16000"
  throughput: "1000"
  encrypted: "true"
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: ml-nvme
provisioner: ebs.csi.aws.com
parameters:
  type: io2
  iopsPerGB: "64"
  encrypted: "true"
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true

Monitoring and Observability

Monitor your StatefulSet storage with these commands:

# Check StatefulSet status
kubectl describe statefulset ml-training

# Monitor PVC usage
kubectl get pvc -l app=ml-training

# Check storage capacity
kubectl exec ml-training-0 -- df -h /mnt/models

# View pod logs
kubectl logs ml-training-0 -f

# Check volume attachment status
kubectl get volumeattachment

Prometheus Metrics for Storage

apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
data:
  prometheus.yml: |
    scrape_configs:
    - job_name: 'kubernetes-pods'
      kubernetes_sd_configs:
      - role: pod
      relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_app]
        action: keep
        regex: ml-training
      - source_labels: [__meta_kubernetes_pod_name]
        target_label: pod_name

Backup and Disaster Recovery

Implement automated backups for model checkpoints using Velero:

# Install Velero
velero install \
  --provider aws \
  --plugins velero/velero-plugin-for-aws:v1.7.0 \
  --bucket ml-backups \
  --backup-location-config region=us-west-2

# Create backup schedule
velero schedule create ml-training-backup \
  --schedule="0 2 * * *" \
  --include-namespaces ml-namespace \
  --selector app=ml-training

# Restore from backup
velero restore create --from-backup ml-training-backup-20240115

Troubleshooting Common Issues

Pod Stuck in Pending State

# Check PVC status
kubectl get pvc

# Describe the pending pod
kubectl describe pod ml-training-0

# Check storage provisioner logs
kubectl logs -n kube-system -l app=ebs-csi-controller

Volume Mount Failures

Common causes and solutions:

  • Insufficient capacity: Check node storage and increase volume size
  • Zone mismatch: Ensure PV and pod are in the same availability zone
  • Permission issues: Verify fsGroup and runAsUser in securityContext
spec:
  template:
    spec:
      securityContext:
        fsGroup: 1000
        runAsUser: 1000
      containers:
      - name: trainer
        securityContext:
          allowPrivilegeEscalation: false

Slow I/O Performance

# Test disk performance inside pod
kubectl exec ml-training-0 -- fio \
  --name=write-test \
  --size=10G \
  --filename=/mnt/models/test \
  --ioengine=libaio \
  --direct=1 \
  --bs=4k \
  --rw=write

# Check IOPS limits
kubectl describe pv pvc-xxxxxxxx

Best Practices for Production

  • Resource Limits: Always set memory and CPU limits to prevent resource exhaustion
  • Pod Disruption Budgets: Protect training jobs from voluntary disruptions
  • Topology Spread: Distribute pods across nodes and zones for high availability
  • Volume Expansion: Enable allowVolumeExpansion in StorageClass for growth
  • Backup Strategy: Implement automated backups for critical model data
  • Monitoring: Track storage metrics, IOPS, and throughput continuously
  • Cost Optimization: Use lifecycle policies to move old data to cheaper storage tiers

Production-Ready PodDisruptionBudget

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: ml-training-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: ml-training

Scaling Considerations

When scaling StatefulSets for ML workloads:

# Scale up gradually
kubectl scale statefulset ml-training --replicas=5

# Monitor scaling progress
kubectl rollout status statefulset ml-training

# Scale down (removes highest ordinal first)
kubectl scale statefulset ml-training --replicas=2

Important: Scaling down doesn’t delete PVCs automatically. Clean them up manually if needed:

# List orphaned PVCs
kubectl get pvc | grep ml-training

# Delete specific PVC
kubectl delete pvc model-storage-ml-training-3

Conclusion

Kubernetes StatefulSets provide the foundation for robust, scalable storage infrastructure for AI/ML workloads. By leveraging persistent volumes, stable network identities, and ordered deployment guarantees, you can build production-grade machine learning platforms that handle petabytes of data reliably.

The key to success lies in understanding your workload characteristics, choosing appropriate storage classes, implementing comprehensive monitoring, and following best practices for backup and disaster recovery. With the configurations and patterns outlined in this guide, you’re well-equipped to deploy and manage AI/ML storage on Kubernetes at scale.

Start with the basic StatefulSet configuration, monitor performance metrics closely, and iterate based on your specific requirements. Remember that storage is often the bottleneck in ML pipelines—invest time in optimizing it for maximum training efficiency.

Have Queries? Join https://launchpass.com/collabnix

Collabnix Team The Collabnix Team is a diverse collective of Docker, Kubernetes, and IoT experts united by a passion for cloud-native technologies. With backgrounds spanning across DevOps, platform engineering, cloud architecture, and container orchestration, our contributors bring together decades of combined experience from various industries and technical domains.
Join our Discord Server
Index