As AI agents become increasingly sophisticated, orchestrating them at scale requires robust infrastructure. Kubernetes Jobs and CronJobs provide the perfect foundation for running AI workloads—from one-time model training tasks to scheduled inference pipelines. This comprehensive guide walks you through building production-ready AI agents using Kubernetes native resources.
Why Kubernetes for AI Agents?
Kubernetes Jobs and CronJobs offer several advantages for AI agent orchestration:
- Reliability: Automatic retry mechanisms and failure handling
- Scalability: Parallel execution for distributed AI workloads
- Resource Management: Fine-grained control over CPU, GPU, and memory allocation
- Scheduling: Native cron-based scheduling for periodic tasks
- Observability: Built-in logging and monitoring integration
Understanding Kubernetes Jobs vs CronJobs
Before diving into implementation, let’s clarify the distinction:
Jobs are designed for one-time execution tasks that run to completion. Perfect for:
- Model training workflows
- Batch inference processing
- Data preprocessing pipelines
- One-time agent deployments
CronJobs create Jobs on a schedule, ideal for:
- Periodic model retraining
- Scheduled data collection agents
- Regular model evaluation tasks
- Automated report generation
Building Your First AI Agent with Kubernetes Jobs
Let’s build a sentiment analysis AI agent that processes a batch of customer reviews.
Step 1: Create the AI Agent Container
First, create a Python-based AI agent using the transformers library:
import os
import json
from transformers import pipeline
import boto3
class SentimentAnalysisAgent:
def __init__(self):
self.classifier = pipeline(
"sentiment-analysis",
model="distilbert-base-uncased-finetuned-sst-2-english"
)
self.s3_client = boto3.client('s3')
def process_reviews(self, bucket, key):
# Download data from S3
obj = self.s3_client.get_object(Bucket=bucket, Key=key)
reviews = json.loads(obj['Body'].read())
# Process sentiment analysis
results = []
for review in reviews:
sentiment = self.classifier(review['text'])[0]
results.append({
'review_id': review['id'],
'sentiment': sentiment['label'],
'confidence': sentiment['score']
})
# Upload results
output_key = f"results/{key}"
self.s3_client.put_object(
Bucket=bucket,
Key=output_key,
Body=json.dumps(results)
)
print(f"Processed {len(results)} reviews")
if __name__ == "__main__":
agent = SentimentAnalysisAgent()
bucket = os.getenv('S3_BUCKET')
key = os.getenv('INPUT_KEY')
agent.process_reviews(bucket, key)
Step 2: Containerize the Agent
FROM python:3.9-slim
WORKDIR /app
# Install dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy agent code
COPY sentiment_agent.py .
# Run as non-root user
RUN useradd -m -u 1000 agent && chown -R agent:agent /app
USER agent
CMD ["python", "sentiment_agent.py"]
Build and push the container:
docker build -t your-registry/sentiment-agent:v1.0 .
docker push your-registry/sentiment-agent:v1.0
Step 3: Create the Kubernetes Job Manifest
apiVersion: batch/v1
kind: Job
metadata:
name: sentiment-analysis-job
namespace: ai-agents
labels:
app: sentiment-agent
version: v1.0
spec:
# Number of parallel pods
parallelism: 3
# Total number of completions needed
completions: 3
# Retry policy
backoffLimit: 4
# Cleanup completed jobs after 1 hour
ttlSecondsAfterFinished: 3600
template:
metadata:
labels:
app: sentiment-agent
spec:
restartPolicy: Never
containers:
- name: sentiment-agent
image: your-registry/sentiment-agent:v1.0
env:
- name: S3_BUCKET
value: "customer-reviews"
- name: INPUT_KEY
value: "batch-2024-01/reviews.json"
- name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef:
name: aws-credentials
key: access-key-id
- name: AWS_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
name: aws-credentials
key: secret-access-key
resources:
requests:
memory: "2Gi"
cpu: "1000m"
limits:
memory: "4Gi"
cpu: "2000m"
volumeMounts:
- name: model-cache
mountPath: /home/agent/.cache
volumes:
- name: model-cache
emptyDir:
sizeLimit: 5Gi
Step 4: Deploy and Monitor the Job
# Create namespace
kubectl create namespace ai-agents
# Create AWS credentials secret
kubectl create secret generic aws-credentials \
--from-literal=access-key-id=YOUR_ACCESS_KEY \
--from-literal=secret-access-key=YOUR_SECRET_KEY \
-n ai-agents
# Deploy the job
kubectl apply -f sentiment-job.yaml
# Monitor job status
kubectl get jobs -n ai-agents -w
# Check pod logs
kubectl logs -n ai-agents -l app=sentiment-agent --tail=100
# Get detailed job information
kubectl describe job sentiment-analysis-job -n ai-agents
Implementing Scheduled AI Agents with CronJobs
For recurring AI tasks, CronJobs provide automated scheduling. Let’s create a model retraining agent that runs weekly.
Creating a Model Retraining CronJob
apiVersion: batch/v1
kind: CronJob
metadata:
name: model-retraining-cronjob
namespace: ai-agents
spec:
# Run every Sunday at 2 AM
schedule: "0 2 * * 0"
# Timezone support (Kubernetes 1.25+)
timeZone: "America/New_York"
# Keep last 3 successful and 1 failed job
successfulJobsHistoryLimit: 3
failedJobsHistoryLimit: 1
# Prevent concurrent runs
concurrencyPolicy: Forbid
# Start deadline in seconds
startingDeadlineSeconds: 300
jobTemplate:
spec:
backoffLimit: 2
ttlSecondsAfterFinished: 86400
template:
metadata:
labels:
app: model-retraining
component: ai-agent
spec:
restartPolicy: OnFailure
serviceAccountName: model-trainer
containers:
- name: retraining-agent
image: your-registry/model-trainer:v2.0
env:
- name: TRAINING_DATA_PATH
value: "s3://training-data/weekly/"
- name: MODEL_REGISTRY
value: "mlflow-server.ml-platform.svc.cluster.local"
- name: EXPERIMENT_NAME
value: "sentiment-model-weekly"
resources:
requests:
memory: "8Gi"
cpu: "4000m"
nvidia.com/gpu: "1"
limits:
memory: "16Gi"
cpu: "8000m"
nvidia.com/gpu: "1"
volumeMounts:
- name: training-cache
mountPath: /cache
- name: model-output
mountPath: /models
volumes:
- name: training-cache
persistentVolumeClaim:
claimName: training-cache-pvc
- name: model-output
persistentVolumeClaim:
claimName: model-output-pvc
nodeSelector:
workload-type: gpu
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
Managing CronJob Execution
# Deploy the CronJob
kubectl apply -f model-retraining-cronjob.yaml
# View CronJob status
kubectl get cronjobs -n ai-agents
# Manually trigger a CronJob
kubectl create job --from=cronjob/model-retraining-cronjob manual-run-001 -n ai-agents
# Suspend a CronJob temporarily
kubectl patch cronjob model-retraining-cronjob -n ai-agents -p '{"spec":{"suspend":true}}'
# Resume the CronJob
kubectl patch cronjob model-retraining-cronjob -n ai-agents -p '{"spec":{"suspend":false}}'
# View job history
kubectl get jobs -n ai-agents -l component=ai-agent --sort-by=.metadata.creationTimestamp
Advanced Patterns for AI Agent Orchestration
Parallel Processing with Indexed Jobs
For distributed AI workloads, use indexed Jobs (Kubernetes 1.24+):
apiVersion: batch/v1
kind: Job
metadata:
name: distributed-inference-job
spec:
completions: 10
parallelism: 10
completionMode: Indexed
template:
spec:
restartPolicy: Never
containers:
- name: inference-worker
image: your-registry/inference-agent:v1.0
env:
- name: JOB_COMPLETION_INDEX
valueFrom:
fieldRef:
fieldPath: metadata.annotations['batch.kubernetes.io/job-completion-index']
- name: TOTAL_WORKERS
value: "10"
command:
- python
- inference.py
- --worker-id=$(JOB_COMPLETION_INDEX)
- --total-workers=$(TOTAL_WORKERS)
Using Init Containers for Model Downloads
spec:
template:
spec:
initContainers:
- name: model-downloader
image: amazon/aws-cli:latest
command:
- sh
- -c
- |
aws s3 cp s3://model-registry/sentiment-v2.0/ /models/ --recursive
volumeMounts:
- name: model-cache
mountPath: /models
containers:
- name: ai-agent
image: your-registry/sentiment-agent:v1.0
volumeMounts:
- name: model-cache
mountPath: /models
volumes:
- name: model-cache
emptyDir: {}
Troubleshooting Common Issues
Job Failures and Debugging
# Check failed pod reasons
kubectl get pods -n ai-agents -l app=sentiment-agent --field-selector=status.phase=Failed
# Get detailed failure information
kubectl describe pod <pod-name> -n ai-agents
# View previous container logs if pod restarted
kubectl logs <pod-name> -n ai-agents --previous
# Debug with ephemeral container (Kubernetes 1.23+)
kubectl debug -it <pod-name> -n ai-agents --image=busybox --target=ai-agent
Resource Exhaustion
If Jobs fail due to OOMKilled or CPU throttling:
# Check resource usage
kubectl top pods -n ai-agents -l app=sentiment-agent
# View resource events
kubectl get events -n ai-agents --sort-by='.lastTimestamp' | grep -i "oom\|cpu"
Adjust resource requests and limits accordingly in your Job spec.
CronJob Not Triggering
# Check CronJob status and last schedule time
kubectl get cronjob model-retraining-cronjob -n ai-agents -o yaml | grep -A 5 status
# Verify schedule syntax
kubectl describe cronjob model-retraining-cronjob -n ai-agents | grep Schedule
# Check for suspended state
kubectl get cronjob model-retraining-cronjob -n ai-agents -o jsonpath='{.spec.suspend}'
Best Practices for Production AI Agents
1. Implement Proper Resource Limits
Always set resource requests and limits to prevent node exhaustion:
- Set requests based on minimum requirements
- Set limits 1.5-2x higher than requests for burst capacity
- Use GPU node selectors for GPU-intensive workloads
- Implement pod priority classes for critical jobs
2. Use Secrets Management
Never hardcode credentials. Use Kubernetes Secrets or external secret managers:
# Using External Secrets Operator
kubectl apply -f - <<EOF
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: ai-agent-secrets
namespace: ai-agents
spec:
refreshInterval: 1h
secretStoreRef:
name: aws-secrets-manager
kind: SecretStore
target:
name: agent-credentials
data:
- secretKey: api-key
remoteRef:
key: ai-agent/api-key
EOF
3. Implement Health Checks and Timeouts
spec:
activeDeadlineSeconds: 3600 # Kill job after 1 hour
template:
spec:
containers:
- name: ai-agent
livenessProbe:
exec:
command:
- cat
- /tmp/healthy
initialDelaySeconds: 30
periodSeconds: 10
4. Enable Observability
Integrate with monitoring and logging systems:
metadata:
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics"
spec:
template:
spec:
containers:
- name: ai-agent
env:
- name: OTEL_EXPORTER_OTLP_ENDPOINT
value: "http://opentelemetry-collector:4317"
5. Implement Job Cleanup Policies
spec:
ttlSecondsAfterFinished: 3600 # Clean up after 1 hour
backoffLimit: 3 # Retry up to 3 times
Conclusion
Kubernetes Jobs and CronJobs provide a robust foundation for orchestrating AI agents at scale. By leveraging native Kubernetes features like resource management, retry policies, and scheduled execution, you can build reliable, production-grade AI systems without complex external orchestration tools.
Key takeaways:
- Use Jobs for one-time AI tasks and CronJobs for scheduled workloads
- Implement proper resource limits and GPU scheduling for ML workloads
- Leverage parallel execution for distributed AI processing
- Follow security best practices with secrets management
- Monitor and observe your AI agents with proper instrumentation
Start small with a simple Job, then scale to complex CronJob-based pipelines as your AI infrastructure matures. The patterns and examples provided here will help you build maintainable, scalable AI agent systems on Kubernetes.