As Large Language Models (LLMs) become integral to production applications, managing their API traffic efficiently has become critical. LLM gateways serve as the control plane between your applications and LLM providers like OpenAI, Anthropic, or self-hosted models. In this comprehensive guide, we’ll explore advanced gateway patterns focusing on rate limiting and load balancing strategies that ensure reliability, cost optimization, and optimal performance.
Why LLM Gateways Matter in Production
Unlike traditional REST APIs, LLM endpoints present unique challenges:
- Variable response times: Token generation can take seconds to minutes
- Cost per request: Each API call incurs significant costs based on token usage
- Rate limits: Provider-imposed constraints on requests per minute (RPM) and tokens per minute (TPM)
- Model availability: Different models have varying capacity and regional availability
An LLM gateway acts as a unified interface that abstracts these complexities, providing intelligent routing, rate limiting, and load balancing capabilities.
Understanding Rate Limiting Patterns for LLM APIs
Rate limiting for LLMs requires sophisticated approaches beyond simple request counting. You need to account for both request frequency and token consumption.
Token-Aware Rate Limiting
Traditional rate limiters count requests, but LLM providers limit both requests and tokens. A token-aware rate limiter tracks both dimensions:
import time
from collections import deque
from threading import Lock
class TokenAwareRateLimiter:
def __init__(self, max_rpm=3500, max_tpm=90000, window_seconds=60):
self.max_rpm = max_rpm
self.max_tpm = max_tpm
self.window_seconds = window_seconds
self.requests = deque()
self.tokens = deque()
self.lock = Lock()
def can_proceed(self, estimated_tokens):
with self.lock:
now = time.time()
cutoff = now - self.window_seconds
# Remove old entries
while self.requests and self.requests[0] < cutoff:
self.requests.popleft()
while self.tokens and self.tokens[0][0] < cutoff:
self.tokens.popleft()
# Check limits
current_rpm = len(self.requests)
current_tpm = sum(t[1] for t in self.tokens)
if current_rpm >= self.max_rpm:
return False, "RPM limit exceeded"
if current_tpm + estimated_tokens > self.max_tpm:
return False, "TPM limit exceeded"
# Record request
self.requests.append(now)
self.tokens.append((now, estimated_tokens))
return True, "OK"
# Usage
limiter = TokenAwareRateLimiter(max_rpm=3500, max_tpm=90000)
can_proceed, message = limiter.can_proceed(estimated_tokens=1500)
if can_proceed:
# Make LLM API call
pass
else:
print(f"Rate limit: {message}")
Implementing Rate Limiting with Kong Gateway
Kong provides a robust platform for LLM gateway patterns. Here’s how to configure token-aware rate limiting:
apiVersion: configuration.konghq.com/v1
kind: KongPlugin
metadata:
name: llm-rate-limiting
namespace: llm-gateway
plugin: rate-limiting-advanced
config:
limit:
- 3500
window_size:
- 60
window_type: sliding
retry_after_jitter_max: 0
enforce_consumer_groups: false
consumer_groups: []
dictionary_name: kong_rate_limiting_counters
sync_rate: -1
namespace: llm-gateway
strategy: redis
redis:
host: redis-service
port: 6379
timeout: 2000
database: 0
---
apiVersion: configuration.konghq.com/v1
kind: KongPlugin
metadata:
name: llm-request-transformer
namespace: llm-gateway
plugin: request-transformer-advanced
config:
add:
headers:
- X-RateLimit-Limit:3500
- X-RateLimit-Window:60
Apply the plugin to your LLM service:
kubectl apply -f llm-rate-limiting-plugin.yaml
kubectl annotate service openai-service \
konghq.com/plugins=llm-rate-limiting,llm-request-transformer
Load Balancing Strategies for LLM Gateways
Load balancing LLM traffic requires strategies that account for model-specific characteristics, provider quotas, and cost optimization.
Multi-Provider Load Balancing
Distributing traffic across multiple LLM providers (OpenAI, Anthropic, Azure OpenAI) increases reliability and helps avoid rate limits:
import random
import httpx
from typing import List, Dict
class LLMLoadBalancer:
def __init__(self, providers: List[Dict]):
self.providers = providers
self.health_status = {p['name']: True for p in providers}
async def weighted_round_robin(self, prompt: str, max_tokens: int):
# Filter healthy providers
available = [
p for p in self.providers
if self.health_status[p['name']]
]
if not available:
raise Exception("No healthy providers available")
# Weighted selection based on priority
weights = [p.get('weight', 1) for p in available]
provider = random.choices(available, weights=weights)[0]
return await self._call_provider(provider, prompt, max_tokens)
async def _call_provider(self, provider: Dict, prompt: str, max_tokens: int):
try:
async with httpx.AsyncClient() as client:
response = await client.post(
provider['endpoint'],
headers={
'Authorization': f"Bearer {provider['api_key']}",
'Content-Type': 'application/json'
},
json={
'model': provider['model'],
'messages': [{'role': 'user', 'content': prompt}],
'max_tokens': max_tokens
},
timeout=30.0
)
return response.json()
except Exception as e:
self.health_status[provider['name']] = False
raise e
# Configuration
providers = [
{
'name': 'openai-primary',
'endpoint': 'https://api.openai.com/v1/chat/completions',
'api_key': 'sk-...',
'model': 'gpt-4',
'weight': 3
},
{
'name': 'azure-openai-backup',
'endpoint': 'https://your-resource.openai.azure.com/openai/deployments/gpt-4/chat/completions?api-version=2023-05-15',
'api_key': 'azure-key',
'model': 'gpt-4',
'weight': 2
},
{
'name': 'anthropic-fallback',
'endpoint': 'https://api.anthropic.com/v1/messages',
'api_key': 'sk-ant-...',
'model': 'claude-3-opus-20240229',
'weight': 1
}
]
balancer = LLMLoadBalancer(providers)
Kubernetes-Native Load Balancing with Nginx Ingress
For self-hosted LLM deployments, configure Nginx Ingress for intelligent load balancing:
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: llm-gateway-ingress
namespace: llm-inference
annotations:
nginx.ingress.kubernetes.io/rewrite-target: /
nginx.ingress.kubernetes.io/load-balance: "ewma"
nginx.ingress.kubernetes.io/upstream-hash-by: "$request_uri"
nginx.ingress.kubernetes.io/proxy-connect-timeout: "300"
nginx.ingress.kubernetes.io/proxy-send-timeout: "300"
nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
nginx.ingress.kubernetes.io/proxy-body-size: "10m"
nginx.ingress.kubernetes.io/rate-limit: "100"
nginx.ingress.kubernetes.io/limit-rps: "10"
spec:
ingressClassName: nginx
rules:
- host: llm-gateway.example.com
http:
paths:
- path: /v1/completions
pathType: Prefix
backend:
service:
name: llm-inference-service
port:
number: 8000
---
apiVersion: v1
kind: Service
metadata:
name: llm-inference-service
namespace: llm-inference
annotations:
service.kubernetes.io/topology-aware-hints: "auto"
spec:
type: ClusterIP
sessionAffinity: ClientIP
sessionAffinityConfig:
clientIP:
timeoutSeconds: 300
ports:
- port: 8000
targetPort: 8000
protocol: TCP
selector:
app: llm-inference
Building a Complete LLM Gateway with Envoy
Envoy provides fine-grained control over LLM traffic with advanced rate limiting and load balancing capabilities:
static_resources:
listeners:
- name: llm_listener
address:
socket_address:
address: 0.0.0.0
port_value: 10000
filter_chains:
- filters:
- name: envoy.filters.network.http_connection_manager
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
stat_prefix: llm_gateway
route_config:
name: llm_route
virtual_hosts:
- name: llm_service
domains: ["*"]
routes:
- match:
prefix: "/v1/"
route:
cluster: llm_cluster
timeout: 300s
retry_policy:
retry_on: "5xx"
num_retries: 2
per_try_timeout: 150s
rate_limits:
- actions:
- request_headers:
header_name: "x-api-key"
descriptor_key: "api_key"
http_filters:
- name: envoy.filters.http.ratelimit
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.http.ratelimit.v3.RateLimit
domain: llm_gateway
failure_mode_deny: false
rate_limit_service:
grpc_service:
envoy_grpc:
cluster_name: rate_limit_cluster
transport_api_version: V3
- name: envoy.filters.http.router
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
clusters:
- name: llm_cluster
connect_timeout: 5s
type: STRICT_DNS
lb_policy: LEAST_REQUEST
load_assignment:
cluster_name: llm_cluster
endpoints:
- lb_endpoints:
- endpoint:
address:
socket_address:
address: openai-api.example.com
port_value: 443
- endpoint:
address:
socket_address:
address: azure-openai.example.com
port_value: 443
transport_socket:
name: envoy.transport_sockets.tls
typed_config:
"@type": type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.UpstreamTlsContext
- name: rate_limit_cluster
type: STRICT_DNS
connect_timeout: 1s
lb_policy: ROUND_ROBIN
protocol_selection: USE_CONFIGURED_PROTOCOL
http2_protocol_options: {}
load_assignment:
cluster_name: rate_limit_cluster
endpoints:
- lb_endpoints:
- endpoint:
address:
socket_address:
address: redis-ratelimit
port_value: 8081
Implementing Circuit Breaking for LLM Resilience
Circuit breakers prevent cascading failures when LLM providers experience issues:
import asyncio
from enum import Enum
from datetime import datetime, timedelta
class CircuitState(Enum):
CLOSED = "closed"
OPEN = "open"
HALF_OPEN = "half_open"
class CircuitBreaker:
def __init__(self, failure_threshold=5, timeout_seconds=60, success_threshold=2):
self.failure_threshold = failure_threshold
self.timeout_seconds = timeout_seconds
self.success_threshold = success_threshold
self.failure_count = 0
self.success_count = 0
self.state = CircuitState.CLOSED
self.last_failure_time = None
async def call(self, func, *args, **kwargs):
if self.state == CircuitState.OPEN:
if datetime.now() - self.last_failure_time > timedelta(seconds=self.timeout_seconds):
self.state = CircuitState.HALF_OPEN
self.success_count = 0
else:
raise Exception("Circuit breaker is OPEN")
try:
result = await func(*args, **kwargs)
self._on_success()
return result
except Exception as e:
self._on_failure()
raise e
def _on_success(self):
if self.state == CircuitState.HALF_OPEN:
self.success_count += 1
if self.success_count >= self.success_threshold:
self.state = CircuitState.CLOSED
self.failure_count = 0
else:
self.failure_count = 0
def _on_failure(self):
self.failure_count += 1
self.last_failure_time = datetime.now()
if self.failure_count >= self.failure_threshold:
self.state = CircuitState.OPEN
# Usage with LLM calls
circuit_breaker = CircuitBreaker(failure_threshold=5, timeout_seconds=60)
async def call_llm_with_protection(prompt):
try:
return await circuit_breaker.call(make_llm_request, prompt)
except Exception as e:
print(f"Circuit breaker prevented call: {e}")
return None
Monitoring and Observability
Effective monitoring is crucial for LLM gateway operations. Deploy Prometheus metrics collection:
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
namespace: monitoring
data:
prometheus.yml: |
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'llm-gateway'
kubernetes_sd_configs:
- role: pod
namespaces:
names:
- llm-gateway
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app]
action: keep
regex: llm-gateway
- source_labels: [__meta_kubernetes_pod_name]
target_label: pod
- source_labels: [__meta_kubernetes_namespace]
target_label: namespace
metric_relabel_configs:
- source_labels: [__name__]
regex: '(llm_requests_total|llm_request_duration_seconds.*|llm_tokens_used_total|llm_rate_limit_exceeded_total)'
action: keep
Best Practices and Troubleshooting
Best Practices
- Implement token estimation: Use tiktoken or similar libraries to estimate tokens before making requests
- Use exponential backoff: When rate limits are hit, implement exponential backoff with jitter
- Cache responses: Implement semantic caching for similar queries to reduce API calls
- Monitor costs: Track token usage per user/tenant to prevent cost overruns
- Set request timeouts: LLM requests can hang; always set reasonable timeouts (30-120 seconds)
- Implement graceful degradation: Have fallback responses when all providers are unavailable
Common Issues and Solutions
Issue: Rate limits exceeded despite gateway limits
Solution: Account for both request and token limits. Implement token bucket algorithm:
kubectl logs -n llm-gateway deployment/llm-gateway | grep "rate_limit"
# Check if TPM or RPM is the limiting factor
# Adjust limits in ConfigMap
kubectl edit configmap llm-gateway-config -n llm-gateway
# Update max_tpm and max_rpm values
kubectl rollout restart deployment/llm-gateway -n llm-gateway
Issue: Uneven load distribution
Solution: Verify service mesh configuration and check for session affinity:
kubectl describe service llm-inference-service -n llm-inference | grep -A 5 "Session Affinity"
# If session affinity is causing issues, disable it:
kubectl patch service llm-inference-service -n llm-inference \
-p '{"spec":{"sessionAffinity":"None"}}'
Issue: High latency spikes
Solution: Implement connection pooling and increase timeout values:
import httpx
from httpx import Limits
# Configure connection pooling
limits = Limits(
max_keepalive_connections=100,
max_connections=200,
keepalive_expiry=30.0
)
client = httpx.AsyncClient(
limits=limits,
timeout=httpx.Timeout(120.0, connect=10.0)
)
Conclusion
Implementing robust LLM gateway patterns for rate limiting and load balancing is essential for production-grade AI applications. By combining token-aware rate limiting, intelligent load balancing across multiple providers, circuit breaking, and comprehensive monitoring, you can build resilient systems that handle LLM traffic efficiently while optimizing costs and maintaining reliability.
Start with simple rate limiting and gradually add sophistication as your traffic grows. Monitor your metrics closely, and adjust your thresholds based on real-world usage patterns. The patterns and code examples provided here offer a solid foundation for building production-ready LLM gateways that scale with your needs.