As Large Language Models (LLMs) become integral to production systems, evaluating their performance, accuracy, and reliability at scale has become a critical challenge. Kubernetes provides the perfect orchestration platform for building robust, scalable LLM evaluation pipelines that can handle continuous testing, benchmarking, and quality assurance workflows.
In this comprehensive guide, we’ll walk through designing and implementing production-grade LLM evaluation pipelines on Kubernetes, complete with practical examples, YAML configurations, and battle-tested best practices.
Why Kubernetes for LLM Evaluation Pipelines?
Before diving into implementation details, let’s understand why Kubernetes is the ideal platform for LLM evaluation workloads:
- Resource Management: LLM evaluations require significant compute resources (GPUs/CPUs) that Kubernetes can efficiently schedule and manage
- Scalability: Run hundreds of evaluation jobs in parallel across your cluster
- Reproducibility: Container-based workflows ensure consistent evaluation environments
- Cost Optimization: Dynamic resource allocation and autoscaling reduce infrastructure costs
- Integration: Seamlessly integrate with MLOps tools like MLflow, Kubeflow, and Argo Workflows
Architecture Overview
A comprehensive LLM evaluation pipeline on Kubernetes typically consists of these components:
- Evaluation Jobs: Kubernetes Jobs or CronJobs that execute evaluation scripts
- Model Registry: Storage for model artifacts (S3, GCS, or in-cluster storage)
- Dataset Management: PersistentVolumes for test datasets and benchmarks
- Metrics Collection: Prometheus for monitoring and custom metrics
- Results Storage: Database or object storage for evaluation results
- Orchestration: Argo Workflows or Kubeflow Pipelines for complex workflows
Setting Up the Foundation
Creating a Dedicated Namespace
Start by creating a dedicated namespace for your LLM evaluation workloads:
kubectl create namespace llm-evaluation
kubectl config set-context --current --namespace=llm-evaluation
Configuring Resource Quotas
Define resource quotas to prevent evaluation jobs from consuming all cluster resources:
apiVersion: v1
kind: ResourceQuota
metadata:
name: llm-eval-quota
namespace: llm-evaluation
spec:
hard:
requests.cpu: "100"
requests.memory: 500Gi
requests.nvidia.com/gpu: "8"
limits.cpu: "200"
limits.memory: 1000Gi
persistentvolumeclaims: "10"
Apply the quota:
kubectl apply -f resource-quota.yaml
Building the Evaluation Container
Create a Docker image with your evaluation framework. Here’s an example using Python with popular evaluation libraries:
FROM python:3.11-slim
WORKDIR /app
# Install system dependencies
RUN apt-get update && apt-get install -y \
git \
curl \
&& rm -rf /var/lib/apt/lists/*
# Install Python packages
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy evaluation scripts
COPY evaluate.py .
COPY metrics/ ./metrics/
COPY datasets/ ./datasets/
ENTRYPOINT ["python", "evaluate.py"]
Sample Evaluation Script
Here’s a Python script that evaluates an LLM using common metrics:
import os
import json
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
from rouge_score import rouge_scorer
from bert_score import score as bert_score
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
class LLMEvaluator:
def __init__(self, model_name, dataset_name):
self.model_name = model_name
self.dataset_name = dataset_name
self.device = "cuda" if torch.cuda.is_available() else "cpu"
logger.info(f"Loading model: {model_name}")
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
self.model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16 if self.device == "cuda" else torch.float32
).to(self.device)
def evaluate(self, num_samples=100):
dataset = load_dataset(self.dataset_name, split=f"test[:{num_samples}]")
rouge = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'])
results = {
"rouge_scores": [],
"bert_scores": [],
"perplexity": []
}
for example in dataset:
prompt = example['input']
reference = example['output']
# Generate prediction
inputs = self.tokenizer(prompt, return_tensors="pt").to(self.device)
outputs = self.model.generate(**inputs, max_length=512)
prediction = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
# Calculate ROUGE scores
rouge_result = rouge.score(reference, prediction)
results['rouge_scores'].append(rouge_result)
# Calculate average metrics
avg_results = self._aggregate_results(results)
return avg_results
def _aggregate_results(self, results):
# Aggregation logic here
return results
if __name__ == "__main__":
model_name = os.getenv("MODEL_NAME", "gpt2")
dataset_name = os.getenv("DATASET_NAME", "squad")
evaluator = LLMEvaluator(model_name, dataset_name)
results = evaluator.evaluate()
# Save results
with open("/results/evaluation_results.json", "w") as f:
json.dump(results, f, indent=2)
logger.info("Evaluation complete")
Deploying Evaluation Jobs
Single Evaluation Job
Create a Kubernetes Job for one-time evaluation:
apiVersion: batch/v1
kind: Job
metadata:
name: llm-eval-job
namespace: llm-evaluation
spec:
template:
metadata:
labels:
app: llm-evaluator
spec:
restartPolicy: Never
containers:
- name: evaluator
image: your-registry/llm-evaluator:latest
env:
- name: MODEL_NAME
value: "meta-llama/Llama-2-7b-hf"
- name: DATASET_NAME
value: "truthful_qa"
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: huggingface-token
key: token
resources:
requests:
memory: "16Gi"
cpu: "4"
nvidia.com/gpu: "1"
limits:
memory: "32Gi"
cpu: "8"
nvidia.com/gpu: "1"
volumeMounts:
- name: results
mountPath: /results
- name: cache
mountPath: /root/.cache
volumes:
- name: results
persistentVolumeClaim:
claimName: eval-results-pvc
- name: cache
emptyDir: {}
backoffLimit: 3
Deploy the job:
kubectl apply -f evaluation-job.yaml
kubectl logs -f job/llm-eval-job
Scheduled Evaluation with CronJobs
For continuous evaluation, use CronJobs to run evaluations on a schedule:
apiVersion: batch/v1
kind: CronJob
metadata:
name: llm-eval-cronjob
namespace: llm-evaluation
spec:
schedule: "0 2 * * *" # Run daily at 2 AM
jobTemplate:
spec:
template:
metadata:
labels:
app: llm-evaluator
spec:
restartPolicy: OnFailure
containers:
- name: evaluator
image: your-registry/llm-evaluator:latest
env:
- name: EVALUATION_DATE
value: "$(date +%Y-%m-%d)"
- name: MODEL_NAME
value: "meta-llama/Llama-2-7b-hf"
resources:
requests:
memory: "16Gi"
cpu: "4"
nvidia.com/gpu: "1"
limits:
memory: "32Gi"
cpu: "8"
nvidia.com/gpu: "1"
volumeMounts:
- name: results
mountPath: /results
volumes:
- name: results
persistentVolumeClaim:
claimName: eval-results-pvc
successfulJobsHistoryLimit: 5
failedJobsHistoryLimit: 3
Implementing Parallel Evaluation with Argo Workflows
For complex evaluation pipelines with multiple models and datasets, Argo Workflows provides powerful orchestration capabilities:
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: llm-eval-parallel-
namespace: llm-evaluation
spec:
entrypoint: evaluate-models
arguments:
parameters:
- name: models
value: |
["gpt2", "meta-llama/Llama-2-7b-hf", "mistralai/Mistral-7B-v0.1"]
- name: datasets
value: |
["truthful_qa", "mmlu", "hellaswag"]
templates:
- name: evaluate-models
steps:
- - name: evaluate
template: run-evaluation
arguments:
parameters:
- name: model
value: "{{item.model}}"
- name: dataset
value: "{{item.dataset}}"
withParam: |
[
{{range $m := .arguments.parameters.models}}
{{range $d := .arguments.parameters.datasets}}
{"model": "{{$m}}", "dataset": "{{$d}}"},
{{end}}
{{end}}
]
- - name: aggregate-results
template: aggregate
- name: run-evaluation
inputs:
parameters:
- name: model
- name: dataset
container:
image: your-registry/llm-evaluator:latest
env:
- name: MODEL_NAME
value: "{{inputs.parameters.model}}"
- name: DATASET_NAME
value: "{{inputs.parameters.dataset}}"
resources:
requests:
memory: "16Gi"
cpu: "4"
nvidia.com/gpu: "1"
limits:
memory: "32Gi"
cpu: "8"
nvidia.com/gpu: "1"
- name: aggregate
container:
image: your-registry/results-aggregator:latest
command: ["python", "aggregate.py"]
volumeMounts:
- name: results
mountPath: /results
Submit the workflow:
argo submit llm-eval-workflow.yaml -n llm-evaluation
argo watch llm-eval-parallel-xxxxx -n llm-evaluation
Monitoring and Observability
Custom Metrics with Prometheus
Export evaluation metrics to Prometheus for monitoring:
from prometheus_client import Counter, Histogram, Gauge, push_to_gateway
evaluation_duration = Histogram(
'llm_evaluation_duration_seconds',
'Time spent on evaluation',
['model_name', 'dataset']
)
evaluation_score = Gauge(
'llm_evaluation_score',
'Evaluation score',
['model_name', 'dataset', 'metric']
)
def push_metrics(model_name, dataset, metrics):
for metric_name, value in metrics.items():
evaluation_score.labels(
model_name=model_name,
dataset=dataset,
metric=metric_name
).set(value)
push_to_gateway(
'pushgateway.monitoring.svc.cluster.local:9091',
job='llm-evaluation',
registry=registry
)
ServiceMonitor Configuration
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: llm-evaluation-metrics
namespace: llm-evaluation
spec:
selector:
matchLabels:
app: llm-evaluator
endpoints:
- port: metrics
interval: 30s
path: /metrics
Best Practices and Troubleshooting
Resource Optimization
- Use Node Affinity: Schedule GPU-intensive evaluations on GPU nodes
- Implement Resource Requests: Always specify resource requests and limits
- Enable Horizontal Pod Autoscaling: Scale evaluation workers based on queue depth
- Use Spot Instances: Leverage preemptible nodes for cost savings on non-critical evaluations
Common Issues and Solutions
Problem: OOMKilled errors during evaluation
Solution: Increase memory limits or implement batch processing with smaller chunks:
def evaluate_in_batches(dataset, batch_size=10):
for i in range(0, len(dataset), batch_size):
batch = dataset[i:i+batch_size]
results = evaluate_batch(batch)
save_intermediate_results(results)
torch.cuda.empty_cache() # Clear GPU memory
Problem: Model download timeouts
Solution: Use init containers to pre-download models:
initContainers:
- name: model-downloader
image: your-registry/model-downloader:latest
env:
- name: MODEL_NAME
value: "meta-llama/Llama-2-7b-hf"
volumeMounts:
- name: model-cache
mountPath: /models
Problem: Inconsistent evaluation results
Solution: Set random seeds and use deterministic algorithms:
import torch
import random
import numpy as np
def set_seed(seed=42):
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
Security Best Practices
- Use Secrets: Store API keys and tokens in Kubernetes Secrets
- Network Policies: Restrict network access for evaluation pods
- RBAC: Implement role-based access control for evaluation namespaces
- Image Scanning: Scan container images for vulnerabilities before deployment
Conclusion
Building LLM evaluation pipelines on Kubernetes provides a robust, scalable foundation for continuous model assessment. By leveraging Kubernetes’ orchestration capabilities, resource management, and ecosystem tools like Argo Workflows, you can create production-grade evaluation systems that ensure your LLMs maintain high quality and performance standards.
The examples and configurations provided in this guide offer a starting point for implementing your own evaluation pipelines. As your requirements grow, you can extend these patterns with additional features like A/B testing, automated model promotion, and integration with CI/CD pipelines.
Start small with single evaluation jobs, then gradually scale to complex parallel workflows as your evaluation needs evolve. The flexibility of Kubernetes ensures your evaluation infrastructure can grow alongside your AI/ML operations.