Join our Discord Server
Collabnix Team The Collabnix Team is a diverse collective of Docker, Kubernetes, and IoT experts united by a passion for cloud-native technologies. With backgrounds spanning across DevOps, platform engineering, cloud architecture, and container orchestration, our contributors bring together decades of combined experience from various industries and technical domains.

Production-Ready LLM Infrastructure: Deploying Ollama on Kubernetes with Anthropic MCP Best Practices

7 min read

Introduction: Building Enterprise-Grade LLM Infrastructure

The landscape of AI infrastructure is rapidly evolving, and organizations are increasingly looking to deploy Large Language Models (LLMs) in production environments. While cloud-based APIs offer convenience, they come with limitations: cost unpredictability, latency concerns, data privacy restrictions, and vendor lock-in.

Enter the powerful combination of Ollama, Kubernetes, and Anthropic’s Model Context Protocol (MCP) – a stack that enables you to build production-ready, self-hosted LLM infrastructure with enterprise-grade reliability, security, and scalability.

This comprehensive guide will walk you through implementing this stack, sharing battle-tested best practices learned from real-world deployments. Whether you’re a DevOps engineer planning your first LLM deployment or an architect designing multi-tenant AI infrastructure, this guide provides the practical knowledge you need.

Understanding the Technology Stack

Ollama: Simplified LLM Operations

Ollama is an open-source platform that simplifies running large language models locally. It provides:

  • Model Management: Easy installation, switching, and version control of LLMs
  • Optimized Performance: Hardware-specific optimizations for CPU and GPU inference
  • RESTful API: Clean HTTP interface for application integration
  • Memory Efficiency: Smart model loading and unloading based on usage

Kubernetes: Enterprise Orchestration

Kubernetes brings production-ready capabilities to LLM workloads:

  • Auto-scaling: Dynamic resource allocation based on demand
  • High Availability: Multi-node deployment with automatic failover
  • Resource Management: GPU scheduling and memory isolation
  • Rolling Updates: Zero-downtime deployments and model updates

Anthropic MCP: Intelligent Context Management

The Model Context Protocol (MCP) enables:

  • Standardized Communication: Universal protocol for AI model interactions
  • Context Preservation: Maintaining conversation state across requests
  • Tool Integration: Seamless connection to external systems and APIs
  • Security: Built-in authentication and authorization mechanisms

Prerequisites and Infrastructure Planning

Hardware Requirements





Software Prerequisites

# Verify your environment
kubectl version --client
helm version
docker --version
nvidia-docker --version  # For GPU support

# Check Kubernetes cluster resources
kubectl top nodes
kubectl describe nodes | grep -A 5 "Allocated resources"

Step 1: Setting Up the Kubernetes Namespace and RBAC

Let’s start with proper namespace isolation and security:

# ollama-namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: ai-infrastructure
  labels:
    name: ai-infrastructure
    security-level: standard
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: ollama-service-account
  namespace: ai-infrastructure
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: ai-infrastructure
  name: ollama-role
rules:
- apiGroups: [""]
  resources: ["pods", "services", "configmaps", "secrets"]
  verbs: ["get", "list", "watch", "create", "update", "patch"]
- apiGroups: ["apps"]
  resources: ["deployments", "replicasets"]
  verbs: ["get", "list", "watch", "create", "update", "patch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: ollama-rolebinding
  namespace: ai-infrastructure
subjects:
- kind: ServiceAccount
  name: ollama-service-account
  namespace: ai-infrastructure
roleRef:
  kind: Role
  name: ollama-role
  apiGroup: rbac.authorization.k8s.io

Step 2: Configuring Persistent Storage with Performance Optimization

# ollama-storage.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: fast-nvme
provisioner: kubernetes.io/no-provisioner
parameters:
  type: nvme-ssd
  replication-type: none
  fsType: ext4
mountOptions:
  - noatime
  - nodiratime
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: ollama-models-pvc
  namespace: ai-infrastructure
  labels:
    app: ollama
    component: storage
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: fast-nvme
  resources:
    requests:
      storage: 500Gi  # Adjust based on models you plan to run
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: ollama-cache-pvc
  namespace: ai-infrastructure
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: fast-nvme
  resources:
    requests:
      storage: 100Gi  # For temporary files and caching

Step 3: Ollama Deployment with GPU Support and Resource Management

# ollama-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama
  namespace: ai-infrastructure
  labels:
    app: ollama
    version: v1.0.0
spec:
  replicas: 2  # Start with 2 for HA
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0  # Ensure zero downtime
  selector:
    matchLabels:
      app: ollama
  template:
    metadata:
      labels:
        app: ollama
        version: v1.0.0
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8080"
        prometheus.io/path: "/metrics"
    spec:
      serviceAccountName: ollama-service-account
      nodeSelector:
        kubernetes.io/arch: amd64
        node-type: gpu-enabled  # Use GPU-enabled nodes
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      containers:
      - name: ollama
        image: ollama/ollama:latest
        ports:
        - containerPort: 11434
          name: api
          protocol: TCP
        - containerPort: 8080
          name: metrics
          protocol: TCP
        env:
        - name: OLLAMA_HOST
          value: "0.0.0.0"
        - name: OLLAMA_ORIGINS
          value: "*"
        - name: OLLAMA_NUM_PARALLEL
          value: "4"  # Adjust based on your hardware
        - name: OLLAMA_MAX_LOADED_MODELS
          value: "3"  # Memory management
        - name: OLLAMA_FLASH_ATTENTION
          value: "1"  # Enable optimizations
        - name: CUDA_VISIBLE_DEVICES
          value: "0"  # GPU assignment
        resources:
          requests:
            memory: "8Gi"
            cpu: "4"
            nvidia.com/gpu: 1
          limits:
            memory: "32Gi"
            cpu: "16"
            nvidia.com/gpu: 1
        volumeMounts:
        - name: ollama-models
          mountPath: /root/.ollama
        - name: ollama-cache
          mountPath: /tmp/ollama
        - name: config
          mountPath: /etc/ollama
        livenessProbe:
          httpGet:
            path: /api/tags
            port: 11434
          initialDelaySeconds: 60
          periodSeconds: 30
          timeoutSeconds: 10
          failureThreshold: 3
        readinessProbe:
          httpGet:
            path: /api/tags
            port: 11434
          initialDelaySeconds: 30
          periodSeconds: 10
          timeoutSeconds: 5
          failureThreshold: 2
        startupProbe:
          httpGet:
            path: /api/tags
            port: 11434
          initialDelaySeconds: 30
          periodSeconds: 10
          timeoutSeconds: 5
          failureThreshold: 30  # Allow up to 5 minutes for startup
      - name: model-downloader
        image: curlimages/curl:latest
        command: ["/bin/sh"]
        args:
        - -c
        - |
          # Download essential models on startup
          while ! curl -f http://localhost:11434/api/tags; do
            echo "Waiting for Ollama to be ready..."
            sleep 5
          done
          
          # Download models in background
          curl -X POST http://localhost:11434/api/pull \
            -H "Content-Type: application/json" \
            -d '{"name": "llama3.1:8b"}' &
          
          curl -X POST http://localhost:11434/api/pull \
            -H "Content-Type: application/json" \
            -d '{"name": "codellama:7b"}' &
          
          # Keep container running
          sleep infinity
        resources:
          requests:
            memory: "256Mi"
            cpu: "100m"
          limits:
            memory: "512Mi"
            cpu: "500m"
      volumes:
      - name: ollama-models
        persistentVolumeClaim:
          claimName: ollama-models-pvc
      - name: ollama-cache
        persistentVolumeClaim:
          claimName: ollama-cache-pvc
      - name: config
        configMap:
          name: ollama-config
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - ollama
              topologyKey: kubernetes.io/hostname

Step 4: Service Configuration with Load Balancing

# ollama-service.yaml
apiVersion: v1
kind: Service
metadata:
  name: ollama-service
  namespace: ai-infrastructure
  labels:
    app: ollama
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-type: "nlb"
    service.beta.kubernetes.io/aws-load-balancer-backend-protocol: "tcp"
spec:
  selector:
    app: ollama
  ports:
  - name: api
    port: 11434
    targetPort: 11434
    protocol: TCP
  - name: metrics
    port: 8080
    targetPort: 8080
    protocol: TCP
  type: LoadBalancer
  sessionAffinity: ClientIP  # For MCP context preservation
---
apiVersion: v1
kind: Service
metadata:
  name: ollama-headless
  namespace: ai-infrastructure
  labels:
    app: ollama
spec:
  clusterIP: None
  selector:
    app: ollama
  ports:
  - name: api
    port: 11434
    targetPort: 11434

Step 5: Implementing Horizontal Pod Autoscaler

# ollama-hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: ollama-hpa
  namespace: ai-infrastructure
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: ollama
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 100
        periodSeconds: 15
    scaleDown:
      stabilizationWindowSeconds: 600
      policies:
      - type: Percent
        value: 50
        periodSeconds: 60

Step 6: Anthropic MCP Integration and Configuration

# mcp-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: mcp-config
  namespace: ai-infrastructure
data:
  mcp-server.json: |
    {
      "servers": {
        "ollama-cluster": {
          "command": "node",
          "args": ["/app/mcp-server.js"],
          "env": {
            "OLLAMA_HOST": "ollama-service.ai-infrastructure.svc.cluster.local:11434",
            "MCP_LOG_LEVEL": "info",
            "CONTEXT_WINDOW_SIZE": "8192",
            "MAX_CONCURRENT_REQUESTS": "10"
          }
        }
      },
      "mcpServers": {
        "ollama-cluster": {
          "command": "node",
          "args": ["/app/mcp-server.js"]
        }
      }
    }
  mcp-server.js: |
    const { Server } = require('@anthropic-ai/mcp-server');
    const axios = require('axios');
    
    class OllamaMCPServer extends Server {
      constructor() {
        super();
        this.ollamaUrl = process.env.OLLAMA_HOST || 'localhost:11434';
        this.contextCache = new Map();
        this.setupHandlers();
      }
      
      setupHandlers() {
        // Handle model list requests
        this.addHandler('tools/list', async () => {
          const response = await axios.get(`http://${this.ollamaUrl}/api/tags`);
          return {
            tools: response.data.models.map(model => ({
              name: model.name,
              description: `LLM: ${model.name}`,
              inputSchema: {
                type: "object",
                properties: {
                  prompt: { type: "string" },
                  context: { type: "string" },
                  temperature: { type: "number", default: 0.7 }
                }
              }
            }))
          };
        });
        
        // Handle model inference
        this.addHandler('tools/call', async (request) => {
          const { name, arguments: args } = request.params;
          
          // Preserve context for conversation continuity
          const contextKey = `${request.meta?.conversationId || 'default'}`;
          let context = this.contextCache.get(contextKey) || '';
          
          const payload = {
            model: name,
            prompt: args.prompt,
            context: context,
            options: {
              temperature: args.temperature || 0.7,
              num_ctx: parseInt(process.env.CONTEXT_WINDOW_SIZE) || 8192
            }
          };
          
          try {
            const response = await axios.post(
              `http://${this.ollamaUrl}/api/generate`,
              payload
            );
            
            // Update context cache
            if (response.data.context) {
              this.contextCache.set(contextKey, response.data.context);
            }
            
            return {
              content: [{
                type: "text",
                text: response.data.response
              }]
            };
          } catch (error) {
            throw new Error(`Ollama request failed: ${error.message}`);
          }
        });
      }
    }
    
    const server = new OllamaMCPServer();
    server.connect({ stdio: true });
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: mcp-server
  namespace: ai-infrastructure
spec:
  replicas: 3
  selector:
    matchLabels:
      app: mcp-server
  template:
    metadata:
      labels:
        app: mcp-server
    spec:
      containers:
      - name: mcp-server
        image: node:18-alpine
        command: ["node", "/app/mcp-server.js"]
        ports:
        - containerPort: 3000
        env:
        - name: OLLAMA_HOST
          value: "ollama-service.ai-infrastructure.svc.cluster.local:11434"
        - name: MCP_LOG_LEVEL
          value: "info"
        volumeMounts:
        - name: mcp-config
          mountPath: /app
        resources:
          requests:
            memory: "256Mi"
            cpu: "100m"
          limits:
            memory: "512Mi"
            cpu: "500m"
      volumes:
      - name: mcp-config
        configMap:
          name: mcp-config

Step 7: Monitoring and Observability Setup

# monitoring.yaml
apiVersion: v1
kind: ServiceMonitor
metadata:
  name: ollama-metrics
  namespace: ai-infrastructure
  labels:
    app: ollama
spec:
  selector:
    matchLabels:
      app: ollama
  endpoints:
  - port: metrics
    interval: 30s
    path: /metrics
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: grafana-dashboard
  namespace: ai-infrastructure
data:
  ollama-dashboard.json: |
    {
      "dashboard": {
        "title": "Ollama LLM Infrastructure",
        "panels": [
          {
            "title": "Model Inference Rate",
            "type": "graph",
            "targets": [
              {
                "expr": "rate(ollama_requests_total[5m])",
                "legendFormat": "{{model}}"
              }
            ]
          },
          {
            "title": "GPU Utilization",
            "type": "graph",
            "targets": [
              {
                "expr": "nvidia_gpu_utilization_gpu",
                "legendFormat": "GPU {{gpu}}"
              }
            ]
          },
          {
            "title": "Model Load Time",
            "type": "graph",
            "targets": [
              {
                "expr": "ollama_model_load_duration_seconds",
                "legendFormat": "{{model}}"
              }
            ]
          },
          {
            "title": "Context Cache Hit Rate",
            "type": "stat",
            "targets": [
              {
                "expr": "rate(mcp_context_hits_total[5m]) / rate(mcp_context_requests_total[5m])",
                "legendFormat": "Cache Hit Rate"
              }
            ]
          }
        ]
      }
    }

Production Best Practices

Security Hardening

# security-policies.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: ollama-network-policy
  namespace: ai-infrastructure
spec:
  podSelector:
    matchLabels:
      app: ollama
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          name: application-tier
    - podSelector:
        matchLabels:
          app: mcp-server
    ports:
    - protocol: TCP
      port: 11434
  egress:
  - to: []
    ports:
    - protocol: TCP
      port: 443  # HTTPS only
    - protocol: TCP
      port: 53   # DNS
    - protocol: UDP
      port: 53   # DNS

Resource Quotas and Limits

yaml

# resource-quotas.yaml
apiVersion: v1
kind: ResourceQuota
metadata:
  name: ai-infrastructure-quota
  namespace: ai-infrastructure
spec:
  hard:
    requests.cpu: "100"
    requests.memory: "500Gi"
    requests.nvidia.com/gpu: "20"
    limits.cpu: "200"
    limits.memory: "1000Gi"
    limits.nvidia.com/gpu: "20"
    persistentvolumeclaims: "10"
    requests.storage: "5Ti"

Backup and Disaster Recovery

#!/bin/bash
# backup-models.sh - Model backup automation

NAMESPACE="ai-infrastructure"
BACKUP_BUCKET="s3://your-backup-bucket/ollama-models"
DATE=$(date +%Y%m%d-%H%M%S)

# Create model snapshot
kubectl exec -n $NAMESPACE deployment/ollama -c ollama -- \
  tar czf /tmp/models-backup-$DATE.tar.gz /root/.ollama/models

# Upload to cloud storage
kubectl exec -n $NAMESPACE deployment/ollama -c ollama -- \
  aws s3 cp /tmp/models-backup-$DATE.tar.gz $BACKUP_BUCKET/

# Cleanup local backup
kubectl exec -n $NAMESPACE deployment/ollama -c ollama -- \
  rm /tmp/models-backup-$DATE.tar.gz

echo "Backup completed: models-backup-$DATE.tar.gz"

Performance Optimization Strategies

1. Model Preloading and Warm-up

# model-warmup.sh
curl -X POST http://ollama-service:11434/api/generate \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.1:8b",
    "prompt": "Hello",
    "keep_alive": "24h"
  }'

2. Load Balancing Configuration

# Advanced load balancing with session affinity
apiVersion: v1
kind: Service
metadata:
  name: ollama-sticky-service
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-target-group-attributes: stickiness.enabled=true,stickiness.lb_cookie.duration_seconds=3600
spec:
  sessionAffinity: ClientIP
  sessionAffinityConfig:
    clientIP:
      timeoutSeconds: 3600

3. GPU Memory Management

env:
- name: CUDA_MPS_PIPE_DIRECTORY
  value: "/tmp/nvidia-mps"
- name: CUDA_MPS_LOG_DIRECTORY
  value: "/tmp/nvidia-log"
- name: OLLAMA_GPU_MEMORY_FRACTION
  value: "0.8"  # Use 80% of GPU memory

Troubleshooting Common Issues

Model Loading Failures

# Debug model loading
kubectl logs -n ai-infrastructure deployment/ollama -c ollama --tail=100

# Check GPU availability
kubectl exec -n ai-infrastructure deployment/ollama -- nvidia-smi

# Verify model integrity
kubectl exec -n ai-infrastructure deployment/ollama -- \
  ollama list

Performance Bottlenecks

# Monitor resource usage
kubectl top pods -n ai-infrastructure

# Check network latency
kubectl exec -n ai-infrastructure deployment/ollama -- \
  curl -w "@curl-format.txt" -o /dev/null -s http://ollama-service:11434/api/tags

Scaling and Cost Optimization

Cluster Autoscaling Configuration

# cluster-autoscaler.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: cluster-autoscaler
  namespace: kube-system
spec:
  template:
    spec:
      containers:
      - name: cluster-autoscaler
        image: k8s.gcr.io/autoscaling/cluster-autoscaler:v1.21.0
        command:
        - ./cluster-autoscaler
        - --v=4
        - --stderrthreshold=info
        - --cloud-provider=aws
        - --skip-nodes-with-local-storage=false
        - --expander=least-waste
        - --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/ai-cluster
        - --balance-similar-node-groups
        - --scale-down-enabled=true
        - --scale-down-delay-after-add=10m
        - --scale-down-unneeded-time=10m

Cost Monitoring Dashboard

# Install cost monitoring
kubectl apply -f https://raw.githubusercontent.com/kubecost/cost-analyzer-helm-chart/develop/cost-analyzer/values.yaml

# Access cost dashboard
kubectl port-forward -n kubecost deployment/kubecost-cost-analyzer 9090:9090

Conclusion: Building Sustainable LLM Infrastructure

Deploying Ollama on Kubernetes with Anthropic MCP integration provides a powerful foundation for enterprise LLM workloads. This architecture offers several key advantages:

  • Cost Predictability: Fixed infrastructure costs vs. per-token pricing
  • Data Privacy: Complete control over sensitive information
  • Performance Optimization: Hardware-specific tuning and caching
  • Scalability: Dynamic resource allocation based on demand
  • Vendor Independence: Flexibility to switch models and providers

Key Takeaways for DevOps Teams

  1. Start Small, Scale Smart: Begin with a minimal deployment and scale based on actual usage patterns
  2. Monitor Everything: Implement comprehensive observability from day one
  3. Security First: Apply defense-in-depth principles throughout your infrastructure
  4. Automate Operations: Use GitOps for configuration management and automated deployments
  5. Plan for Growth: Design your architecture to handle 10x scale from the beginning

Next Steps

  • Implement CI/CD pipelines for model updates
  • Set up federated learning across multiple clusters
  • Explore fine-tuning workflows for domain-specific models
  • Integrate with MLOps platforms for model lifecycle management

Ready to transform your organization’s AI capabilities? Start with a proof-of-concept deployment using the configurations in this guide, then gradually expand based on your specific requirements and learning.

Have Queries? Join https://launchpass.com/collabnix

Collabnix Team The Collabnix Team is a diverse collective of Docker, Kubernetes, and IoT experts united by a passion for cloud-native technologies. With backgrounds spanning across DevOps, platform engineering, cloud architecture, and container orchestration, our contributors bring together decades of combined experience from various industries and technical domains.
Join our Discord Server
Index