As Large Language Models (LLMs) become essential infrastructure for modern applications, organizations face the challenge of serving multiple teams or customers from a shared platform. Building a multi-tenant LLM platform on Kubernetes enables efficient resource utilization, strong isolation, and scalable inference serving. In this comprehensive guide, we’ll architect and deploy a production-ready multi-tenant LLM platform with proper isolation, resource management, and security.
Understanding Multi-Tenancy Requirements for LLM Platforms
Multi-tenant LLM platforms must address several critical requirements that differ from traditional microservices:
- Resource Isolation: LLM inference is GPU-intensive, requiring strict resource quotas to prevent noisy neighbor problems
- Data Privacy: Tenant requests and responses must remain isolated with zero cross-contamination
- Cost Attribution: Accurate tracking of GPU usage, token consumption, and API calls per tenant
- Performance SLAs: Different tenants may require varying latency guarantees and throughput levels
- Model Versioning: Supporting different model versions or fine-tuned variants per tenant
Architecture Overview
Our multi-tenant LLM platform leverages Kubernetes native features combined with specialized inference servers. The architecture consists of:
- Namespace-based isolation: Each tenant gets a dedicated namespace with resource quotas
- vLLM inference servers: High-performance LLM serving with PagedAttention optimization
- Istio service mesh: Traffic management, mTLS, and request routing
- NVIDIA GPU Operator: GPU resource management and time-slicing
- Kyverno policies: Automated policy enforcement for security and compliance
Prerequisites and Cluster Setup
Before deploying the platform, ensure your Kubernetes cluster meets these requirements:
# Verify Kubernetes version (1.26+)
kubectl version --short
# Check for GPU nodes
kubectl get nodes -o json | jq '.items[].status.capacity."nvidia.com/gpu"'
# Install NVIDIA GPU Operator
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
helm install gpu-operator nvidia/gpu-operator \
--namespace gpu-operator \
--create-namespace \
--set driver.enabled=true
Setting Up Tenant Namespaces with Resource Quotas
Each tenant requires an isolated namespace with enforced resource limits. Here’s a comprehensive tenant provisioning configuration:
apiVersion: v1
kind: Namespace
metadata:
name: tenant-acme
labels:
tenant: acme
istio-injection: enabled
---
apiVersion: v1
kind: ResourceQuota
metadata:
name: tenant-acme-quota
namespace: tenant-acme
spec:
hard:
requests.cpu: "32"
requests.memory: 128Gi
requests.nvidia.com/gpu: "2"
limits.cpu: "48"
limits.memory: 192Gi
limits.nvidia.com/gpu: "2"
persistentvolumeclaims: "5"
services.loadbalancers: "1"
---
apiVersion: v1
kind: LimitRange
metadata:
name: tenant-acme-limits
namespace: tenant-acme
spec:
limits:
- max:
cpu: "16"
memory: 64Gi
nvidia.com/gpu: "1"
min:
cpu: "100m"
memory: 128Mi
type: Container
Deploying vLLM Inference Servers
vLLM provides high-throughput LLM serving with advanced features like continuous batching and PagedAttention. Here’s a production-ready deployment configuration:
apiVersion: apps/v1
kind: Deployment
metadata:
name: llm-inference
namespace: tenant-acme
labels:
app: llm-inference
tenant: acme
spec:
replicas: 2
selector:
matchLabels:
app: llm-inference
template:
metadata:
labels:
app: llm-inference
tenant: acme
spec:
nodeSelector:
nvidia.com/gpu.product: NVIDIA-A100-SXM4-40GB
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
containers:
- name: vllm
image: vllm/vllm-openai:v0.3.0
command:
- python3
- -m
- vllm.entrypoints.openai.api_server
args:
- --model
- meta-llama/Llama-2-7b-chat-hf
- --tensor-parallel-size
- "1"
- --max-model-len
- "4096"
- --gpu-memory-utilization
- "0.9"
- --trust-remote-code
env:
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: hf-token
key: token
- name: VLLM_WORKER_MULTIPROC_METHOD
value: spawn
resources:
requests:
cpu: "8"
memory: 32Gi
nvidia.com/gpu: "1"
limits:
cpu: "12"
memory: 48Gi
nvidia.com/gpu: "1"
ports:
- containerPort: 8000
name: http
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 120
periodSeconds: 30
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 60
periodSeconds: 10
volumeMounts:
- name: shm
mountPath: /dev/shm
volumes:
- name: shm
emptyDir:
medium: Memory
sizeLimit: 16Gi
---
apiVersion: v1
kind: Service
metadata:
name: llm-inference
namespace: tenant-acme
labels:
app: llm-inference
spec:
selector:
app: llm-inference
ports:
- port: 8000
targetPort: 8000
name: http
type: ClusterIP
Implementing Network Isolation with Istio
Istio provides sophisticated traffic management and security policies. Configure authorization policies to ensure strict tenant isolation:
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
name: tenant-isolation
namespace: tenant-acme
spec:
action: ALLOW
rules:
- from:
- source:
namespaces: ["tenant-acme", "istio-system"]
to:
- operation:
methods: ["GET", "POST"]
paths: ["/v1/*"]
---
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: llm-routing
namespace: tenant-acme
spec:
hosts:
- llm-inference.tenant-acme.svc.cluster.local
http:
- match:
- headers:
x-tenant-id:
exact: acme
route:
- destination:
host: llm-inference.tenant-acme.svc.cluster.local
port:
number: 8000
timeout: 300s
retries:
attempts: 2
perTryTimeout: 150s
Building a Tenant Management API
A centralized API simplifies tenant provisioning and management. Here’s a Python-based implementation using the Kubernetes client:
from kubernetes import client, config
from kubernetes.client.rest import ApiException
import yaml
class TenantManager:
def __init__(self):
config.load_incluster_config()
self.core_v1 = client.CoreV1Api()
self.apps_v1 = client.AppsV1Api()
def create_tenant(self, tenant_id, gpu_quota=2, memory_quota="128Gi"):
"""Create a new tenant with dedicated namespace and resources"""
namespace_name = f"tenant-{tenant_id}"
# Create namespace
namespace = client.V1Namespace(
metadata=client.V1ObjectMeta(
name=namespace_name,
labels={
"tenant": tenant_id,
"istio-injection": "enabled"
}
)
)
try:
self.core_v1.create_namespace(namespace)
print(f"Created namespace: {namespace_name}")
except ApiException as e:
if e.status != 409:
raise
print(f"Namespace {namespace_name} already exists")
# Create resource quota
quota = client.V1ResourceQuota(
metadata=client.V1ObjectMeta(name=f"{namespace_name}-quota"),
spec=client.V1ResourceQuotaSpec(
hard={
"requests.nvidia.com/gpu": str(gpu_quota),
"requests.memory": memory_quota,
"limits.nvidia.com/gpu": str(gpu_quota)
}
)
)
try:
self.core_v1.create_namespaced_resource_quota(
namespace=namespace_name,
body=quota
)
print(f"Created resource quota for {namespace_name}")
except ApiException as e:
print(f"Error creating quota: {e}")
return namespace_name
def deploy_llm_inference(self, tenant_id, model_name, replicas=2):
"""Deploy LLM inference server for tenant"""
namespace = f"tenant-{tenant_id}"
deployment = client.V1Deployment(
metadata=client.V1ObjectMeta(name="llm-inference"),
spec=client.V1DeploymentSpec(
replicas=replicas,
selector=client.V1LabelSelector(
match_labels={"app": "llm-inference"}
),
template=client.V1PodTemplateSpec(
metadata=client.V1ObjectMeta(
labels={"app": "llm-inference", "tenant": tenant_id}
),
spec=client.V1PodSpec(
containers=[
client.V1Container(
name="vllm",
image="vllm/vllm-openai:v0.3.0",
command=["python3", "-m", "vllm.entrypoints.openai.api_server"],
args=["--model", model_name, "--tensor-parallel-size", "1"],
resources=client.V1ResourceRequirements(
requests={"nvidia.com/gpu": "1", "memory": "32Gi"},
limits={"nvidia.com/gpu": "1", "memory": "48Gi"}
)
)
]
)
)
)
)
try:
self.apps_v1.create_namespaced_deployment(
namespace=namespace,
body=deployment
)
print(f"Deployed LLM inference for {tenant_id}")
except ApiException as e:
print(f"Error deploying inference: {e}")
# Usage
manager = TenantManager()
manager.create_tenant("acme", gpu_quota=2)
manager.deploy_llm_inference("acme", "meta-llama/Llama-2-7b-chat-hf")
Implementing Cost Tracking and Monitoring
Accurate cost attribution is crucial for multi-tenant platforms. Deploy Prometheus with custom metrics for tracking:
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
namespace: monitoring
data:
prometheus.yml: |
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app]
action: keep
regex: llm-inference
- source_labels: [__meta_kubernetes_namespace]
target_label: tenant
regex: tenant-(.*)
replacement: $1
- source_labels: [__meta_kubernetes_pod_name]
target_label: pod
metric_relabel_configs:
- source_labels: [__name__]
regex: '(vllm_.*|nvidia_gpu_.*)'
action: keep
Security Best Practices
Implement comprehensive security measures using Kyverno policies:
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: require-tenant-labels
spec:
validationFailureAction: enforce
background: true
rules:
- name: check-tenant-label
match:
any:
- resources:
kinds:
- Pod
- Deployment
namespaces:
- tenant-*
validate:
message: "All resources must have a tenant label"
pattern:
metadata:
labels:
tenant: "?*"
---
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: restrict-image-registries
spec:
validationFailureAction: enforce
rules:
- name: validate-registries
match:
any:
- resources:
kinds:
- Pod
validate:
message: "Images must come from approved registries"
pattern:
spec:
containers:
- image: "docker.io/* | ghcr.io/* | vllm/*"
Horizontal Pod Autoscaling for LLM Workloads
Configure HPA based on custom metrics for intelligent scaling:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: llm-inference-hpa
namespace: tenant-acme
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: llm-inference
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: nvidia.com/gpu
target:
type: Utilization
averageUtilization: 80
- type: Pods
pods:
metric:
name: vllm_request_queue_size
target:
type: AverageValue
averageValue: "10"
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 50
periodSeconds: 60
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Percent
value: 100
periodSeconds: 30
Troubleshooting Common Issues
Here are solutions to frequent challenges when running multi-tenant LLM platforms:
GPU Memory Exhaustion
# Check GPU memory usage per pod
kubectl exec -n tenant-acme <pod-name> -- nvidia-smi
# Reduce model memory footprint
# Add to vLLM args: --gpu-memory-utilization 0.8 --max-model-len 2048
# Enable tensor parallelism for larger models
# --tensor-parallel-size 2 (requires 2 GPUs)
Slow Inference Performance
# Check for CPU throttling
kubectl top pods -n tenant-acme
# Verify shared memory allocation
kubectl describe pod -n tenant-acme <pod-name> | grep -A 5 "shm"
# Increase batch size for better throughput
# Add to vLLM: --max-num-batched-tokens 8192
Cross-Tenant Isolation Violations
# Verify network policies
kubectl get networkpolicies -n tenant-acme
# Check Istio authorization
kubectl get authorizationpolicies -n tenant-acme
# Test isolation
kubectl run test --rm -i --tty --image=curlimages/curl -n tenant-acme -- \
curl -v http://llm-inference.tenant-beta.svc.cluster.local:8000/health
Performance Optimization Tips
- Use GPU time-slicing: For development environments, enable NVIDIA MIG or time-slicing to share GPUs across multiple tenants
- Implement request queuing: Deploy a queue system (Redis/RabbitMQ) to handle burst traffic and prevent pod overload
- Enable model caching: Use persistent volumes to cache model weights and reduce startup time
- Optimize batch sizes: Tune vLLM’s continuous batching parameters based on your latency requirements
- Use node affinity: Pin tenants to specific GPU node types for consistent performance
Conclusion
Building a production-ready multi-tenant LLM platform on Kubernetes requires careful attention to isolation, resource management, and security. By leveraging Kubernetes-native features like namespaces, resource quotas, and service meshes, combined with specialized tools like vLLM and GPU operators, you can create a scalable platform that serves multiple tenants efficiently.
The architecture presented in this guide provides strong isolation guarantees, flexible resource allocation, and comprehensive monitoring capabilities. As you scale your platform, continue to monitor GPU utilization, adjust resource quotas based on actual usage patterns, and implement automated policies to maintain security and compliance across all tenants.
Start with a single tenant deployment to validate your configuration, then gradually onboard additional tenants while monitoring performance metrics and cost attribution. With proper planning and implementation, your multi-tenant LLM platform will provide reliable, secure, and cost-effective AI inference services at scale.