Introduction: Building Enterprise-Grade LLM Infrastructure
The landscape of AI infrastructure is rapidly evolving, and organizations are increasingly looking to deploy Large Language Models (LLMs) in production environments. While cloud-based APIs offer convenience, they come with limitations: cost unpredictability, latency concerns, data privacy restrictions, and vendor lock-in.
Enter the powerful combination of Ollama, Kubernetes, and Anthropic’s Model Context Protocol (MCP) – a stack that enables you to build production-ready, self-hosted LLM infrastructure with enterprise-grade reliability, security, and scalability.
This comprehensive guide will walk you through implementing this stack, sharing battle-tested best practices learned from real-world deployments. Whether you’re a DevOps engineer planning your first LLM deployment or an architect designing multi-tenant AI infrastructure, this guide provides the practical knowledge you need.
Understanding the Technology Stack
Ollama: Simplified LLM Operations
Ollama is an open-source platform that simplifies running large language models locally. It provides:
- Model Management: Easy installation, switching, and version control of LLMs
- Optimized Performance: Hardware-specific optimizations for CPU and GPU inference
- RESTful API: Clean HTTP interface for application integration
- Memory Efficiency: Smart model loading and unloading based on usage
Kubernetes: Enterprise Orchestration
Kubernetes brings production-ready capabilities to LLM workloads:
- Auto-scaling: Dynamic resource allocation based on demand
- High Availability: Multi-node deployment with automatic failover
- Resource Management: GPU scheduling and memory isolation
- Rolling Updates: Zero-downtime deployments and model updates
Anthropic MCP: Intelligent Context Management
The Model Context Protocol (MCP) enables:
- Standardized Communication: Universal protocol for AI model interactions
- Context Preservation: Maintaining conversation state across requests
- Tool Integration: Seamless connection to external systems and APIs
- Security: Built-in authentication and authorization mechanisms
Prerequisites and Infrastructure Planning
Hardware Requirements

Software Prerequisites
# Verify your environment
kubectl version --client
helm version
docker --version
nvidia-docker --version # For GPU support
# Check Kubernetes cluster resources
kubectl top nodes
kubectl describe nodes | grep -A 5 "Allocated resources"
Step 1: Setting Up the Kubernetes Namespace and RBAC
Let’s start with proper namespace isolation and security:
# ollama-namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
name: ai-infrastructure
labels:
name: ai-infrastructure
security-level: standard
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: ollama-service-account
namespace: ai-infrastructure
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
namespace: ai-infrastructure
name: ollama-role
rules:
- apiGroups: [""]
resources: ["pods", "services", "configmaps", "secrets"]
verbs: ["get", "list", "watch", "create", "update", "patch"]
- apiGroups: ["apps"]
resources: ["deployments", "replicasets"]
verbs: ["get", "list", "watch", "create", "update", "patch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: ollama-rolebinding
namespace: ai-infrastructure
subjects:
- kind: ServiceAccount
name: ollama-service-account
namespace: ai-infrastructure
roleRef:
kind: Role
name: ollama-role
apiGroup: rbac.authorization.k8s.io
Step 2: Configuring Persistent Storage with Performance Optimization
# ollama-storage.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: fast-nvme
provisioner: kubernetes.io/no-provisioner
parameters:
type: nvme-ssd
replication-type: none
fsType: ext4
mountOptions:
- noatime
- nodiratime
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: ollama-models-pvc
namespace: ai-infrastructure
labels:
app: ollama
component: storage
spec:
accessModes:
- ReadWriteOnce
storageClassName: fast-nvme
resources:
requests:
storage: 500Gi # Adjust based on models you plan to run
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: ollama-cache-pvc
namespace: ai-infrastructure
spec:
accessModes:
- ReadWriteOnce
storageClassName: fast-nvme
resources:
requests:
storage: 100Gi # For temporary files and caching
Step 3: Ollama Deployment with GPU Support and Resource Management
# ollama-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: ollama
namespace: ai-infrastructure
labels:
app: ollama
version: v1.0.0
spec:
replicas: 2 # Start with 2 for HA
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0 # Ensure zero downtime
selector:
matchLabels:
app: ollama
template:
metadata:
labels:
app: ollama
version: v1.0.0
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics"
spec:
serviceAccountName: ollama-service-account
nodeSelector:
kubernetes.io/arch: amd64
node-type: gpu-enabled # Use GPU-enabled nodes
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
containers:
- name: ollama
image: ollama/ollama:latest
ports:
- containerPort: 11434
name: api
protocol: TCP
- containerPort: 8080
name: metrics
protocol: TCP
env:
- name: OLLAMA_HOST
value: "0.0.0.0"
- name: OLLAMA_ORIGINS
value: "*"
- name: OLLAMA_NUM_PARALLEL
value: "4" # Adjust based on your hardware
- name: OLLAMA_MAX_LOADED_MODELS
value: "3" # Memory management
- name: OLLAMA_FLASH_ATTENTION
value: "1" # Enable optimizations
- name: CUDA_VISIBLE_DEVICES
value: "0" # GPU assignment
resources:
requests:
memory: "8Gi"
cpu: "4"
nvidia.com/gpu: 1
limits:
memory: "32Gi"
cpu: "16"
nvidia.com/gpu: 1
volumeMounts:
- name: ollama-models
mountPath: /root/.ollama
- name: ollama-cache
mountPath: /tmp/ollama
- name: config
mountPath: /etc/ollama
livenessProbe:
httpGet:
path: /api/tags
port: 11434
initialDelaySeconds: 60
periodSeconds: 30
timeoutSeconds: 10
failureThreshold: 3
readinessProbe:
httpGet:
path: /api/tags
port: 11434
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 2
startupProbe:
httpGet:
path: /api/tags
port: 11434
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 30 # Allow up to 5 minutes for startup
- name: model-downloader
image: curlimages/curl:latest
command: ["/bin/sh"]
args:
- -c
- |
# Download essential models on startup
while ! curl -f http://localhost:11434/api/tags; do
echo "Waiting for Ollama to be ready..."
sleep 5
done
# Download models in background
curl -X POST http://localhost:11434/api/pull \
-H "Content-Type: application/json" \
-d '{"name": "llama3.1:8b"}' &
curl -X POST http://localhost:11434/api/pull \
-H "Content-Type: application/json" \
-d '{"name": "codellama:7b"}' &
# Keep container running
sleep infinity
resources:
requests:
memory: "256Mi"
cpu: "100m"
limits:
memory: "512Mi"
cpu: "500m"
volumes:
- name: ollama-models
persistentVolumeClaim:
claimName: ollama-models-pvc
- name: ollama-cache
persistentVolumeClaim:
claimName: ollama-cache-pvc
- name: config
configMap:
name: ollama-config
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- ollama
topologyKey: kubernetes.io/hostname
Step 4: Service Configuration with Load Balancing
# ollama-service.yaml
apiVersion: v1
kind: Service
metadata:
name: ollama-service
namespace: ai-infrastructure
labels:
app: ollama
annotations:
service.beta.kubernetes.io/aws-load-balancer-type: "nlb"
service.beta.kubernetes.io/aws-load-balancer-backend-protocol: "tcp"
spec:
selector:
app: ollama
ports:
- name: api
port: 11434
targetPort: 11434
protocol: TCP
- name: metrics
port: 8080
targetPort: 8080
protocol: TCP
type: LoadBalancer
sessionAffinity: ClientIP # For MCP context preservation
---
apiVersion: v1
kind: Service
metadata:
name: ollama-headless
namespace: ai-infrastructure
labels:
app: ollama
spec:
clusterIP: None
selector:
app: ollama
ports:
- name: api
port: 11434
targetPort: 11434
Step 5: Implementing Horizontal Pod Autoscaler
# ollama-hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: ollama-hpa
namespace: ai-infrastructure
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: ollama
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
behavior:
scaleUp:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 100
periodSeconds: 15
scaleDown:
stabilizationWindowSeconds: 600
policies:
- type: Percent
value: 50
periodSeconds: 60
Step 6: Anthropic MCP Integration and Configuration
# mcp-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: mcp-config
namespace: ai-infrastructure
data:
mcp-server.json: |
{
"servers": {
"ollama-cluster": {
"command": "node",
"args": ["/app/mcp-server.js"],
"env": {
"OLLAMA_HOST": "ollama-service.ai-infrastructure.svc.cluster.local:11434",
"MCP_LOG_LEVEL": "info",
"CONTEXT_WINDOW_SIZE": "8192",
"MAX_CONCURRENT_REQUESTS": "10"
}
}
},
"mcpServers": {
"ollama-cluster": {
"command": "node",
"args": ["/app/mcp-server.js"]
}
}
}
mcp-server.js: |
const { Server } = require('@anthropic-ai/mcp-server');
const axios = require('axios');
class OllamaMCPServer extends Server {
constructor() {
super();
this.ollamaUrl = process.env.OLLAMA_HOST || 'localhost:11434';
this.contextCache = new Map();
this.setupHandlers();
}
setupHandlers() {
// Handle model list requests
this.addHandler('tools/list', async () => {
const response = await axios.get(`http://${this.ollamaUrl}/api/tags`);
return {
tools: response.data.models.map(model => ({
name: model.name,
description: `LLM: ${model.name}`,
inputSchema: {
type: "object",
properties: {
prompt: { type: "string" },
context: { type: "string" },
temperature: { type: "number", default: 0.7 }
}
}
}))
};
});
// Handle model inference
this.addHandler('tools/call', async (request) => {
const { name, arguments: args } = request.params;
// Preserve context for conversation continuity
const contextKey = `${request.meta?.conversationId || 'default'}`;
let context = this.contextCache.get(contextKey) || '';
const payload = {
model: name,
prompt: args.prompt,
context: context,
options: {
temperature: args.temperature || 0.7,
num_ctx: parseInt(process.env.CONTEXT_WINDOW_SIZE) || 8192
}
};
try {
const response = await axios.post(
`http://${this.ollamaUrl}/api/generate`,
payload
);
// Update context cache
if (response.data.context) {
this.contextCache.set(contextKey, response.data.context);
}
return {
content: [{
type: "text",
text: response.data.response
}]
};
} catch (error) {
throw new Error(`Ollama request failed: ${error.message}`);
}
});
}
}
const server = new OllamaMCPServer();
server.connect({ stdio: true });
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: mcp-server
namespace: ai-infrastructure
spec:
replicas: 3
selector:
matchLabels:
app: mcp-server
template:
metadata:
labels:
app: mcp-server
spec:
containers:
- name: mcp-server
image: node:18-alpine
command: ["node", "/app/mcp-server.js"]
ports:
- containerPort: 3000
env:
- name: OLLAMA_HOST
value: "ollama-service.ai-infrastructure.svc.cluster.local:11434"
- name: MCP_LOG_LEVEL
value: "info"
volumeMounts:
- name: mcp-config
mountPath: /app
resources:
requests:
memory: "256Mi"
cpu: "100m"
limits:
memory: "512Mi"
cpu: "500m"
volumes:
- name: mcp-config
configMap:
name: mcp-config
Step 7: Monitoring and Observability Setup
# monitoring.yaml
apiVersion: v1
kind: ServiceMonitor
metadata:
name: ollama-metrics
namespace: ai-infrastructure
labels:
app: ollama
spec:
selector:
matchLabels:
app: ollama
endpoints:
- port: metrics
interval: 30s
path: /metrics
---
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-dashboard
namespace: ai-infrastructure
data:
ollama-dashboard.json: |
{
"dashboard": {
"title": "Ollama LLM Infrastructure",
"panels": [
{
"title": "Model Inference Rate",
"type": "graph",
"targets": [
{
"expr": "rate(ollama_requests_total[5m])",
"legendFormat": "{{model}}"
}
]
},
{
"title": "GPU Utilization",
"type": "graph",
"targets": [
{
"expr": "nvidia_gpu_utilization_gpu",
"legendFormat": "GPU {{gpu}}"
}
]
},
{
"title": "Model Load Time",
"type": "graph",
"targets": [
{
"expr": "ollama_model_load_duration_seconds",
"legendFormat": "{{model}}"
}
]
},
{
"title": "Context Cache Hit Rate",
"type": "stat",
"targets": [
{
"expr": "rate(mcp_context_hits_total[5m]) / rate(mcp_context_requests_total[5m])",
"legendFormat": "Cache Hit Rate"
}
]
}
]
}
}
Production Best Practices
Security Hardening
# security-policies.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: ollama-network-policy
namespace: ai-infrastructure
spec:
podSelector:
matchLabels:
app: ollama
policyTypes:
- Ingress
- Egress
ingress:
- from:
- namespaceSelector:
matchLabels:
name: application-tier
- podSelector:
matchLabels:
app: mcp-server
ports:
- protocol: TCP
port: 11434
egress:
- to: []
ports:
- protocol: TCP
port: 443 # HTTPS only
- protocol: TCP
port: 53 # DNS
- protocol: UDP
port: 53 # DNS
Resource Quotas and Limits
yaml
# resource-quotas.yaml
apiVersion: v1
kind: ResourceQuota
metadata:
name: ai-infrastructure-quota
namespace: ai-infrastructure
spec:
hard:
requests.cpu: "100"
requests.memory: "500Gi"
requests.nvidia.com/gpu: "20"
limits.cpu: "200"
limits.memory: "1000Gi"
limits.nvidia.com/gpu: "20"
persistentvolumeclaims: "10"
requests.storage: "5Ti"
Backup and Disaster Recovery
#!/bin/bash
# backup-models.sh - Model backup automation
NAMESPACE="ai-infrastructure"
BACKUP_BUCKET="s3://your-backup-bucket/ollama-models"
DATE=$(date +%Y%m%d-%H%M%S)
# Create model snapshot
kubectl exec -n $NAMESPACE deployment/ollama -c ollama -- \
tar czf /tmp/models-backup-$DATE.tar.gz /root/.ollama/models
# Upload to cloud storage
kubectl exec -n $NAMESPACE deployment/ollama -c ollama -- \
aws s3 cp /tmp/models-backup-$DATE.tar.gz $BACKUP_BUCKET/
# Cleanup local backup
kubectl exec -n $NAMESPACE deployment/ollama -c ollama -- \
rm /tmp/models-backup-$DATE.tar.gz
echo "Backup completed: models-backup-$DATE.tar.gz"
Performance Optimization Strategies
1. Model Preloading and Warm-up
# model-warmup.sh
curl -X POST http://ollama-service:11434/api/generate \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.1:8b",
"prompt": "Hello",
"keep_alive": "24h"
}'
2. Load Balancing Configuration
# Advanced load balancing with session affinity
apiVersion: v1
kind: Service
metadata:
name: ollama-sticky-service
annotations:
service.beta.kubernetes.io/aws-load-balancer-target-group-attributes: stickiness.enabled=true,stickiness.lb_cookie.duration_seconds=3600
spec:
sessionAffinity: ClientIP
sessionAffinityConfig:
clientIP:
timeoutSeconds: 3600
3. GPU Memory Management
env:
- name: CUDA_MPS_PIPE_DIRECTORY
value: "/tmp/nvidia-mps"
- name: CUDA_MPS_LOG_DIRECTORY
value: "/tmp/nvidia-log"
- name: OLLAMA_GPU_MEMORY_FRACTION
value: "0.8" # Use 80% of GPU memory
Troubleshooting Common Issues
Model Loading Failures
# Debug model loading
kubectl logs -n ai-infrastructure deployment/ollama -c ollama --tail=100
# Check GPU availability
kubectl exec -n ai-infrastructure deployment/ollama -- nvidia-smi
# Verify model integrity
kubectl exec -n ai-infrastructure deployment/ollama -- \
ollama list
Performance Bottlenecks
# Monitor resource usage
kubectl top pods -n ai-infrastructure
# Check network latency
kubectl exec -n ai-infrastructure deployment/ollama -- \
curl -w "@curl-format.txt" -o /dev/null -s http://ollama-service:11434/api/tags
Scaling and Cost Optimization
Cluster Autoscaling Configuration
# cluster-autoscaler.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: cluster-autoscaler
namespace: kube-system
spec:
template:
spec:
containers:
- name: cluster-autoscaler
image: k8s.gcr.io/autoscaling/cluster-autoscaler:v1.21.0
command:
- ./cluster-autoscaler
- --v=4
- --stderrthreshold=info
- --cloud-provider=aws
- --skip-nodes-with-local-storage=false
- --expander=least-waste
- --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/ai-cluster
- --balance-similar-node-groups
- --scale-down-enabled=true
- --scale-down-delay-after-add=10m
- --scale-down-unneeded-time=10m
Cost Monitoring Dashboard
# Install cost monitoring
kubectl apply -f https://raw.githubusercontent.com/kubecost/cost-analyzer-helm-chart/develop/cost-analyzer/values.yaml
# Access cost dashboard
kubectl port-forward -n kubecost deployment/kubecost-cost-analyzer 9090:9090
Conclusion: Building Sustainable LLM Infrastructure
Deploying Ollama on Kubernetes with Anthropic MCP integration provides a powerful foundation for enterprise LLM workloads. This architecture offers several key advantages:
- Cost Predictability: Fixed infrastructure costs vs. per-token pricing
- Data Privacy: Complete control over sensitive information
- Performance Optimization: Hardware-specific tuning and caching
- Scalability: Dynamic resource allocation based on demand
- Vendor Independence: Flexibility to switch models and providers
Key Takeaways for DevOps Teams
- Start Small, Scale Smart: Begin with a minimal deployment and scale based on actual usage patterns
- Monitor Everything: Implement comprehensive observability from day one
- Security First: Apply defense-in-depth principles throughout your infrastructure
- Automate Operations: Use GitOps for configuration management and automated deployments
- Plan for Growth: Design your architecture to handle 10x scale from the beginning
Next Steps
- Implement CI/CD pipelines for model updates
- Set up federated learning across multiple clusters
- Explore fine-tuning workflows for domain-specific models
- Integrate with MLOps platforms for model lifecycle management
Ready to transform your organization’s AI capabilities? Start with a proof-of-concept deployment using the configurations in this guide, then gradually expand based on your specific requirements and learning.