Join our Discord Server
Collabnix Team The Collabnix Team is a diverse collective of Docker, Kubernetes, and IoT experts united by a passion for cloud-native technologies. With backgrounds spanning across DevOps, platform engineering, cloud architecture, and container orchestration, our contributors bring together decades of combined experience from various industries and technical domains.

Building a Multi-Tenant LLM Platform on Kubernetes: Complete Guide

5 min read

As Large Language Models (LLMs) become essential infrastructure for modern applications, organizations face the challenge of serving multiple teams or customers from a shared platform. Building a multi-tenant LLM platform on Kubernetes enables efficient resource utilization, strong isolation, and scalable inference serving. In this comprehensive guide, we’ll architect and deploy a production-ready multi-tenant LLM platform with proper isolation, resource management, and security.

Understanding Multi-Tenancy Requirements for LLM Platforms

Multi-tenant LLM platforms must address several critical requirements that differ from traditional microservices:

  • Resource Isolation: LLM inference is GPU-intensive, requiring strict resource quotas to prevent noisy neighbor problems
  • Data Privacy: Tenant requests and responses must remain isolated with zero cross-contamination
  • Cost Attribution: Accurate tracking of GPU usage, token consumption, and API calls per tenant
  • Performance SLAs: Different tenants may require varying latency guarantees and throughput levels
  • Model Versioning: Supporting different model versions or fine-tuned variants per tenant

Architecture Overview

Our multi-tenant LLM platform leverages Kubernetes native features combined with specialized inference servers. The architecture consists of:

  • Namespace-based isolation: Each tenant gets a dedicated namespace with resource quotas
  • vLLM inference servers: High-performance LLM serving with PagedAttention optimization
  • Istio service mesh: Traffic management, mTLS, and request routing
  • NVIDIA GPU Operator: GPU resource management and time-slicing
  • Kyverno policies: Automated policy enforcement for security and compliance

Prerequisites and Cluster Setup

Before deploying the platform, ensure your Kubernetes cluster meets these requirements:

# Verify Kubernetes version (1.26+)
kubectl version --short

# Check for GPU nodes
kubectl get nodes -o json | jq '.items[].status.capacity."nvidia.com/gpu"'

# Install NVIDIA GPU Operator
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
helm install gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --create-namespace \
  --set driver.enabled=true

Setting Up Tenant Namespaces with Resource Quotas

Each tenant requires an isolated namespace with enforced resource limits. Here’s a comprehensive tenant provisioning configuration:

apiVersion: v1
kind: Namespace
metadata:
  name: tenant-acme
  labels:
    tenant: acme
    istio-injection: enabled
---
apiVersion: v1
kind: ResourceQuota
metadata:
  name: tenant-acme-quota
  namespace: tenant-acme
spec:
  hard:
    requests.cpu: "32"
    requests.memory: 128Gi
    requests.nvidia.com/gpu: "2"
    limits.cpu: "48"
    limits.memory: 192Gi
    limits.nvidia.com/gpu: "2"
    persistentvolumeclaims: "5"
    services.loadbalancers: "1"
---
apiVersion: v1
kind: LimitRange
metadata:
  name: tenant-acme-limits
  namespace: tenant-acme
spec:
  limits:
  - max:
      cpu: "16"
      memory: 64Gi
      nvidia.com/gpu: "1"
    min:
      cpu: "100m"
      memory: 128Mi
    type: Container

Deploying vLLM Inference Servers

vLLM provides high-throughput LLM serving with advanced features like continuous batching and PagedAttention. Here’s a production-ready deployment configuration:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-inference
  namespace: tenant-acme
  labels:
    app: llm-inference
    tenant: acme
spec:
  replicas: 2
  selector:
    matchLabels:
      app: llm-inference
  template:
    metadata:
      labels:
        app: llm-inference
        tenant: acme
    spec:
      nodeSelector:
        nvidia.com/gpu.product: NVIDIA-A100-SXM4-40GB
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      containers:
      - name: vllm
        image: vllm/vllm-openai:v0.3.0
        command:
        - python3
        - -m
        - vllm.entrypoints.openai.api_server
        args:
        - --model
        - meta-llama/Llama-2-7b-chat-hf
        - --tensor-parallel-size
        - "1"
        - --max-model-len
        - "4096"
        - --gpu-memory-utilization
        - "0.9"
        - --trust-remote-code
        env:
        - name: HUGGING_FACE_HUB_TOKEN
          valueFrom:
            secretKeyRef:
              name: hf-token
              key: token
        - name: VLLM_WORKER_MULTIPROC_METHOD
          value: spawn
        resources:
          requests:
            cpu: "8"
            memory: 32Gi
            nvidia.com/gpu: "1"
          limits:
            cpu: "12"
            memory: 48Gi
            nvidia.com/gpu: "1"
        ports:
        - containerPort: 8000
          name: http
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 120
          periodSeconds: 30
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 60
          periodSeconds: 10
        volumeMounts:
        - name: shm
          mountPath: /dev/shm
      volumes:
      - name: shm
        emptyDir:
          medium: Memory
          sizeLimit: 16Gi
---
apiVersion: v1
kind: Service
metadata:
  name: llm-inference
  namespace: tenant-acme
  labels:
    app: llm-inference
spec:
  selector:
    app: llm-inference
  ports:
  - port: 8000
    targetPort: 8000
    name: http
  type: ClusterIP

Implementing Network Isolation with Istio

Istio provides sophisticated traffic management and security policies. Configure authorization policies to ensure strict tenant isolation:

apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
  name: tenant-isolation
  namespace: tenant-acme
spec:
  action: ALLOW
  rules:
  - from:
    - source:
        namespaces: ["tenant-acme", "istio-system"]
    to:
    - operation:
        methods: ["GET", "POST"]
        paths: ["/v1/*"]
---
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: llm-routing
  namespace: tenant-acme
spec:
  hosts:
  - llm-inference.tenant-acme.svc.cluster.local
  http:
  - match:
    - headers:
        x-tenant-id:
          exact: acme
    route:
    - destination:
        host: llm-inference.tenant-acme.svc.cluster.local
        port:
          number: 8000
    timeout: 300s
    retries:
      attempts: 2
      perTryTimeout: 150s

Building a Tenant Management API

A centralized API simplifies tenant provisioning and management. Here’s a Python-based implementation using the Kubernetes client:

from kubernetes import client, config
from kubernetes.client.rest import ApiException
import yaml

class TenantManager:
    def __init__(self):
        config.load_incluster_config()
        self.core_v1 = client.CoreV1Api()
        self.apps_v1 = client.AppsV1Api()
        
    def create_tenant(self, tenant_id, gpu_quota=2, memory_quota="128Gi"):
        """Create a new tenant with dedicated namespace and resources"""
        namespace_name = f"tenant-{tenant_id}"
        
        # Create namespace
        namespace = client.V1Namespace(
            metadata=client.V1ObjectMeta(
                name=namespace_name,
                labels={
                    "tenant": tenant_id,
                    "istio-injection": "enabled"
                }
            )
        )
        
        try:
            self.core_v1.create_namespace(namespace)
            print(f"Created namespace: {namespace_name}")
        except ApiException as e:
            if e.status != 409:
                raise
            print(f"Namespace {namespace_name} already exists")
        
        # Create resource quota
        quota = client.V1ResourceQuota(
            metadata=client.V1ObjectMeta(name=f"{namespace_name}-quota"),
            spec=client.V1ResourceQuotaSpec(
                hard={
                    "requests.nvidia.com/gpu": str(gpu_quota),
                    "requests.memory": memory_quota,
                    "limits.nvidia.com/gpu": str(gpu_quota)
                }
            )
        )
        
        try:
            self.core_v1.create_namespaced_resource_quota(
                namespace=namespace_name,
                body=quota
            )
            print(f"Created resource quota for {namespace_name}")
        except ApiException as e:
            print(f"Error creating quota: {e}")
        
        return namespace_name
    
    def deploy_llm_inference(self, tenant_id, model_name, replicas=2):
        """Deploy LLM inference server for tenant"""
        namespace = f"tenant-{tenant_id}"
        
        deployment = client.V1Deployment(
            metadata=client.V1ObjectMeta(name="llm-inference"),
            spec=client.V1DeploymentSpec(
                replicas=replicas,
                selector=client.V1LabelSelector(
                    match_labels={"app": "llm-inference"}
                ),
                template=client.V1PodTemplateSpec(
                    metadata=client.V1ObjectMeta(
                        labels={"app": "llm-inference", "tenant": tenant_id}
                    ),
                    spec=client.V1PodSpec(
                        containers=[
                            client.V1Container(
                                name="vllm",
                                image="vllm/vllm-openai:v0.3.0",
                                command=["python3", "-m", "vllm.entrypoints.openai.api_server"],
                                args=["--model", model_name, "--tensor-parallel-size", "1"],
                                resources=client.V1ResourceRequirements(
                                    requests={"nvidia.com/gpu": "1", "memory": "32Gi"},
                                    limits={"nvidia.com/gpu": "1", "memory": "48Gi"}
                                )
                            )
                        ]
                    )
                )
            )
        )
        
        try:
            self.apps_v1.create_namespaced_deployment(
                namespace=namespace,
                body=deployment
            )
            print(f"Deployed LLM inference for {tenant_id}")
        except ApiException as e:
            print(f"Error deploying inference: {e}")

# Usage
manager = TenantManager()
manager.create_tenant("acme", gpu_quota=2)
manager.deploy_llm_inference("acme", "meta-llama/Llama-2-7b-chat-hf")

Implementing Cost Tracking and Monitoring

Accurate cost attribution is crucial for multi-tenant platforms. Deploy Prometheus with custom metrics for tracking:

apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
  namespace: monitoring
data:
  prometheus.yml: |
    global:
      scrape_interval: 15s
    scrape_configs:
    - job_name: 'kubernetes-pods'
      kubernetes_sd_configs:
      - role: pod
      relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_app]
        action: keep
        regex: llm-inference
      - source_labels: [__meta_kubernetes_namespace]
        target_label: tenant
        regex: tenant-(.*)
        replacement: $1
      - source_labels: [__meta_kubernetes_pod_name]
        target_label: pod
      metric_relabel_configs:
      - source_labels: [__name__]
        regex: '(vllm_.*|nvidia_gpu_.*)'
        action: keep

Security Best Practices

Implement comprehensive security measures using Kyverno policies:

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-tenant-labels
spec:
  validationFailureAction: enforce
  background: true
  rules:
  - name: check-tenant-label
    match:
      any:
      - resources:
          kinds:
          - Pod
          - Deployment
          namespaces:
          - tenant-*
    validate:
      message: "All resources must have a tenant label"
      pattern:
        metadata:
          labels:
            tenant: "?*"
---
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: restrict-image-registries
spec:
  validationFailureAction: enforce
  rules:
  - name: validate-registries
    match:
      any:
      - resources:
          kinds:
          - Pod
    validate:
      message: "Images must come from approved registries"
      pattern:
        spec:
          containers:
          - image: "docker.io/* | ghcr.io/* | vllm/*"

Horizontal Pod Autoscaling for LLM Workloads

Configure HPA based on custom metrics for intelligent scaling:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: llm-inference-hpa
  namespace: tenant-acme
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: llm-inference
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: nvidia.com/gpu
      target:
        type: Utilization
        averageUtilization: 80
  - type: Pods
    pods:
      metric:
        name: vllm_request_queue_size
      target:
        type: AverageValue
        averageValue: "10"
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 50
        periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Percent
        value: 100
        periodSeconds: 30

Troubleshooting Common Issues

Here are solutions to frequent challenges when running multi-tenant LLM platforms:

GPU Memory Exhaustion

# Check GPU memory usage per pod
kubectl exec -n tenant-acme <pod-name> -- nvidia-smi

# Reduce model memory footprint
# Add to vLLM args: --gpu-memory-utilization 0.8 --max-model-len 2048

# Enable tensor parallelism for larger models
# --tensor-parallel-size 2 (requires 2 GPUs)

Slow Inference Performance

# Check for CPU throttling
kubectl top pods -n tenant-acme

# Verify shared memory allocation
kubectl describe pod -n tenant-acme <pod-name> | grep -A 5 "shm"

# Increase batch size for better throughput
# Add to vLLM: --max-num-batched-tokens 8192

Cross-Tenant Isolation Violations

# Verify network policies
kubectl get networkpolicies -n tenant-acme

# Check Istio authorization
kubectl get authorizationpolicies -n tenant-acme

# Test isolation
kubectl run test --rm -i --tty --image=curlimages/curl -n tenant-acme -- \
  curl -v http://llm-inference.tenant-beta.svc.cluster.local:8000/health

Performance Optimization Tips

  • Use GPU time-slicing: For development environments, enable NVIDIA MIG or time-slicing to share GPUs across multiple tenants
  • Implement request queuing: Deploy a queue system (Redis/RabbitMQ) to handle burst traffic and prevent pod overload
  • Enable model caching: Use persistent volumes to cache model weights and reduce startup time
  • Optimize batch sizes: Tune vLLM’s continuous batching parameters based on your latency requirements
  • Use node affinity: Pin tenants to specific GPU node types for consistent performance

Conclusion

Building a production-ready multi-tenant LLM platform on Kubernetes requires careful attention to isolation, resource management, and security. By leveraging Kubernetes-native features like namespaces, resource quotas, and service meshes, combined with specialized tools like vLLM and GPU operators, you can create a scalable platform that serves multiple tenants efficiently.

The architecture presented in this guide provides strong isolation guarantees, flexible resource allocation, and comprehensive monitoring capabilities. As you scale your platform, continue to monitor GPU utilization, adjust resource quotas based on actual usage patterns, and implement automated policies to maintain security and compliance across all tenants.

Start with a single tenant deployment to validate your configuration, then gradually onboard additional tenants while monitoring performance metrics and cost attribution. With proper planning and implementation, your multi-tenant LLM platform will provide reliable, secure, and cost-effective AI inference services at scale.

Have Queries? Join https://launchpass.com/collabnix

Collabnix Team The Collabnix Team is a diverse collective of Docker, Kubernetes, and IoT experts united by a passion for cloud-native technologies. With backgrounds spanning across DevOps, platform engineering, cloud architecture, and container orchestration, our contributors bring together decades of combined experience from various industries and technical domains.
Join our Discord Server
Index