Join our Discord Server
Collabnix Team The Collabnix Team is a diverse collective of Docker, Kubernetes, and IoT experts united by a passion for cloud-native technologies. With backgrounds spanning across DevOps, platform engineering, cloud architecture, and container orchestration, our contributors bring together decades of combined experience from various industries and technical domains.

LLM Gateway Patterns: Rate Limiting and Load Balancing Guide

6 min read

As Large Language Models (LLMs) become integral to production applications, managing their API traffic efficiently has become critical. LLM gateways serve as the control plane between your applications and LLM providers like OpenAI, Anthropic, or self-hosted models. In this comprehensive guide, we’ll explore advanced gateway patterns focusing on rate limiting and load balancing strategies that ensure reliability, cost optimization, and optimal performance.

Why LLM Gateways Matter in Production

Unlike traditional REST APIs, LLM endpoints present unique challenges:

  • Variable response times: Token generation can take seconds to minutes
  • Cost per request: Each API call incurs significant costs based on token usage
  • Rate limits: Provider-imposed constraints on requests per minute (RPM) and tokens per minute (TPM)
  • Model availability: Different models have varying capacity and regional availability

An LLM gateway acts as a unified interface that abstracts these complexities, providing intelligent routing, rate limiting, and load balancing capabilities.

Understanding Rate Limiting Patterns for LLM APIs

Rate limiting for LLMs requires sophisticated approaches beyond simple request counting. You need to account for both request frequency and token consumption.

Token-Aware Rate Limiting

Traditional rate limiters count requests, but LLM providers limit both requests and tokens. A token-aware rate limiter tracks both dimensions:

import time
from collections import deque
from threading import Lock

class TokenAwareRateLimiter:
    def __init__(self, max_rpm=3500, max_tpm=90000, window_seconds=60):
        self.max_rpm = max_rpm
        self.max_tpm = max_tpm
        self.window_seconds = window_seconds
        self.requests = deque()
        self.tokens = deque()
        self.lock = Lock()
    
    def can_proceed(self, estimated_tokens):
        with self.lock:
            now = time.time()
            cutoff = now - self.window_seconds
            
            # Remove old entries
            while self.requests and self.requests[0] < cutoff:
                self.requests.popleft()
            while self.tokens and self.tokens[0][0] < cutoff:
                self.tokens.popleft()
            
            # Check limits
            current_rpm = len(self.requests)
            current_tpm = sum(t[1] for t in self.tokens)
            
            if current_rpm >= self.max_rpm:
                return False, "RPM limit exceeded"
            if current_tpm + estimated_tokens > self.max_tpm:
                return False, "TPM limit exceeded"
            
            # Record request
            self.requests.append(now)
            self.tokens.append((now, estimated_tokens))
            return True, "OK"

# Usage
limiter = TokenAwareRateLimiter(max_rpm=3500, max_tpm=90000)
can_proceed, message = limiter.can_proceed(estimated_tokens=1500)

if can_proceed:
    # Make LLM API call
    pass
else:
    print(f"Rate limit: {message}")

Implementing Rate Limiting with Kong Gateway

Kong provides a robust platform for LLM gateway patterns. Here’s how to configure token-aware rate limiting:

apiVersion: configuration.konghq.com/v1
kind: KongPlugin
metadata:
  name: llm-rate-limiting
  namespace: llm-gateway
plugin: rate-limiting-advanced
config:
  limit:
    - 3500
  window_size:
    - 60
  window_type: sliding
  retry_after_jitter_max: 0
  enforce_consumer_groups: false
  consumer_groups: []
  dictionary_name: kong_rate_limiting_counters
  sync_rate: -1
  namespace: llm-gateway
  strategy: redis
  redis:
    host: redis-service
    port: 6379
    timeout: 2000
    database: 0
---
apiVersion: configuration.konghq.com/v1
kind: KongPlugin
metadata:
  name: llm-request-transformer
  namespace: llm-gateway
plugin: request-transformer-advanced
config:
  add:
    headers:
      - X-RateLimit-Limit:3500
      - X-RateLimit-Window:60

Apply the plugin to your LLM service:

kubectl apply -f llm-rate-limiting-plugin.yaml
kubectl annotate service openai-service \
  konghq.com/plugins=llm-rate-limiting,llm-request-transformer

Load Balancing Strategies for LLM Gateways

Load balancing LLM traffic requires strategies that account for model-specific characteristics, provider quotas, and cost optimization.

Multi-Provider Load Balancing

Distributing traffic across multiple LLM providers (OpenAI, Anthropic, Azure OpenAI) increases reliability and helps avoid rate limits:

import random
import httpx
from typing import List, Dict

class LLMLoadBalancer:
    def __init__(self, providers: List[Dict]):
        self.providers = providers
        self.health_status = {p['name']: True for p in providers}
    
    async def weighted_round_robin(self, prompt: str, max_tokens: int):
        # Filter healthy providers
        available = [
            p for p in self.providers 
            if self.health_status[p['name']]
        ]
        
        if not available:
            raise Exception("No healthy providers available")
        
        # Weighted selection based on priority
        weights = [p.get('weight', 1) for p in available]
        provider = random.choices(available, weights=weights)[0]
        
        return await self._call_provider(provider, prompt, max_tokens)
    
    async def _call_provider(self, provider: Dict, prompt: str, max_tokens: int):
        try:
            async with httpx.AsyncClient() as client:
                response = await client.post(
                    provider['endpoint'],
                    headers={
                        'Authorization': f"Bearer {provider['api_key']}",
                        'Content-Type': 'application/json'
                    },
                    json={
                        'model': provider['model'],
                        'messages': [{'role': 'user', 'content': prompt}],
                        'max_tokens': max_tokens
                    },
                    timeout=30.0
                )
                return response.json()
        except Exception as e:
            self.health_status[provider['name']] = False
            raise e

# Configuration
providers = [
    {
        'name': 'openai-primary',
        'endpoint': 'https://api.openai.com/v1/chat/completions',
        'api_key': 'sk-...',
        'model': 'gpt-4',
        'weight': 3
    },
    {
        'name': 'azure-openai-backup',
        'endpoint': 'https://your-resource.openai.azure.com/openai/deployments/gpt-4/chat/completions?api-version=2023-05-15',
        'api_key': 'azure-key',
        'model': 'gpt-4',
        'weight': 2
    },
    {
        'name': 'anthropic-fallback',
        'endpoint': 'https://api.anthropic.com/v1/messages',
        'api_key': 'sk-ant-...',
        'model': 'claude-3-opus-20240229',
        'weight': 1
    }
]

balancer = LLMLoadBalancer(providers)

Kubernetes-Native Load Balancing with Nginx Ingress

For self-hosted LLM deployments, configure Nginx Ingress for intelligent load balancing:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: llm-gateway-ingress
  namespace: llm-inference
  annotations:
    nginx.ingress.kubernetes.io/rewrite-target: /
    nginx.ingress.kubernetes.io/load-balance: "ewma"
    nginx.ingress.kubernetes.io/upstream-hash-by: "$request_uri"
    nginx.ingress.kubernetes.io/proxy-connect-timeout: "300"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "300"
    nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
    nginx.ingress.kubernetes.io/proxy-body-size: "10m"
    nginx.ingress.kubernetes.io/rate-limit: "100"
    nginx.ingress.kubernetes.io/limit-rps: "10"
spec:
  ingressClassName: nginx
  rules:
  - host: llm-gateway.example.com
    http:
      paths:
      - path: /v1/completions
        pathType: Prefix
        backend:
          service:
            name: llm-inference-service
            port:
              number: 8000
---
apiVersion: v1
kind: Service
metadata:
  name: llm-inference-service
  namespace: llm-inference
  annotations:
    service.kubernetes.io/topology-aware-hints: "auto"
spec:
  type: ClusterIP
  sessionAffinity: ClientIP
  sessionAffinityConfig:
    clientIP:
      timeoutSeconds: 300
  ports:
  - port: 8000
    targetPort: 8000
    protocol: TCP
  selector:
    app: llm-inference

Building a Complete LLM Gateway with Envoy

Envoy provides fine-grained control over LLM traffic with advanced rate limiting and load balancing capabilities:

static_resources:
  listeners:
  - name: llm_listener
    address:
      socket_address:
        address: 0.0.0.0
        port_value: 10000
    filter_chains:
    - filters:
      - name: envoy.filters.network.http_connection_manager
        typed_config:
          "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
          stat_prefix: llm_gateway
          route_config:
            name: llm_route
            virtual_hosts:
            - name: llm_service
              domains: ["*"]
              routes:
              - match:
                  prefix: "/v1/"
                route:
                  cluster: llm_cluster
                  timeout: 300s
                  retry_policy:
                    retry_on: "5xx"
                    num_retries: 2
                    per_try_timeout: 150s
              rate_limits:
              - actions:
                - request_headers:
                    header_name: "x-api-key"
                    descriptor_key: "api_key"
          http_filters:
          - name: envoy.filters.http.ratelimit
            typed_config:
              "@type": type.googleapis.com/envoy.extensions.filters.http.ratelimit.v3.RateLimit
              domain: llm_gateway
              failure_mode_deny: false
              rate_limit_service:
                grpc_service:
                  envoy_grpc:
                    cluster_name: rate_limit_cluster
                transport_api_version: V3
          - name: envoy.filters.http.router
            typed_config:
              "@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
  
  clusters:
  - name: llm_cluster
    connect_timeout: 5s
    type: STRICT_DNS
    lb_policy: LEAST_REQUEST
    load_assignment:
      cluster_name: llm_cluster
      endpoints:
      - lb_endpoints:
        - endpoint:
            address:
              socket_address:
                address: openai-api.example.com
                port_value: 443
        - endpoint:
            address:
              socket_address:
                address: azure-openai.example.com
                port_value: 443
    transport_socket:
      name: envoy.transport_sockets.tls
      typed_config:
        "@type": type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.UpstreamTlsContext
  
  - name: rate_limit_cluster
    type: STRICT_DNS
    connect_timeout: 1s
    lb_policy: ROUND_ROBIN
    protocol_selection: USE_CONFIGURED_PROTOCOL
    http2_protocol_options: {}
    load_assignment:
      cluster_name: rate_limit_cluster
      endpoints:
      - lb_endpoints:
        - endpoint:
            address:
              socket_address:
                address: redis-ratelimit
                port_value: 8081

Implementing Circuit Breaking for LLM Resilience

Circuit breakers prevent cascading failures when LLM providers experience issues:

import asyncio
from enum import Enum
from datetime import datetime, timedelta

class CircuitState(Enum):
    CLOSED = "closed"
    OPEN = "open"
    HALF_OPEN = "half_open"

class CircuitBreaker:
    def __init__(self, failure_threshold=5, timeout_seconds=60, success_threshold=2):
        self.failure_threshold = failure_threshold
        self.timeout_seconds = timeout_seconds
        self.success_threshold = success_threshold
        self.failure_count = 0
        self.success_count = 0
        self.state = CircuitState.CLOSED
        self.last_failure_time = None
    
    async def call(self, func, *args, **kwargs):
        if self.state == CircuitState.OPEN:
            if datetime.now() - self.last_failure_time > timedelta(seconds=self.timeout_seconds):
                self.state = CircuitState.HALF_OPEN
                self.success_count = 0
            else:
                raise Exception("Circuit breaker is OPEN")
        
        try:
            result = await func(*args, **kwargs)
            self._on_success()
            return result
        except Exception as e:
            self._on_failure()
            raise e
    
    def _on_success(self):
        if self.state == CircuitState.HALF_OPEN:
            self.success_count += 1
            if self.success_count >= self.success_threshold:
                self.state = CircuitState.CLOSED
                self.failure_count = 0
        else:
            self.failure_count = 0
    
    def _on_failure(self):
        self.failure_count += 1
        self.last_failure_time = datetime.now()
        
        if self.failure_count >= self.failure_threshold:
            self.state = CircuitState.OPEN

# Usage with LLM calls
circuit_breaker = CircuitBreaker(failure_threshold=5, timeout_seconds=60)

async def call_llm_with_protection(prompt):
    try:
        return await circuit_breaker.call(make_llm_request, prompt)
    except Exception as e:
        print(f"Circuit breaker prevented call: {e}")
        return None

Monitoring and Observability

Effective monitoring is crucial for LLM gateway operations. Deploy Prometheus metrics collection:

apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
  namespace: monitoring
data:
  prometheus.yml: |
    global:
      scrape_interval: 15s
    
    scrape_configs:
    - job_name: 'llm-gateway'
      kubernetes_sd_configs:
      - role: pod
        namespaces:
          names:
          - llm-gateway
      relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_app]
        action: keep
        regex: llm-gateway
      - source_labels: [__meta_kubernetes_pod_name]
        target_label: pod
      - source_labels: [__meta_kubernetes_namespace]
        target_label: namespace
      
      metric_relabel_configs:
      - source_labels: [__name__]
        regex: '(llm_requests_total|llm_request_duration_seconds.*|llm_tokens_used_total|llm_rate_limit_exceeded_total)'
        action: keep

Best Practices and Troubleshooting

Best Practices

  • Implement token estimation: Use tiktoken or similar libraries to estimate tokens before making requests
  • Use exponential backoff: When rate limits are hit, implement exponential backoff with jitter
  • Cache responses: Implement semantic caching for similar queries to reduce API calls
  • Monitor costs: Track token usage per user/tenant to prevent cost overruns
  • Set request timeouts: LLM requests can hang; always set reasonable timeouts (30-120 seconds)
  • Implement graceful degradation: Have fallback responses when all providers are unavailable

Common Issues and Solutions

Issue: Rate limits exceeded despite gateway limits

Solution: Account for both request and token limits. Implement token bucket algorithm:

kubectl logs -n llm-gateway deployment/llm-gateway | grep "rate_limit"
# Check if TPM or RPM is the limiting factor

# Adjust limits in ConfigMap
kubectl edit configmap llm-gateway-config -n llm-gateway
# Update max_tpm and max_rpm values

kubectl rollout restart deployment/llm-gateway -n llm-gateway

Issue: Uneven load distribution

Solution: Verify service mesh configuration and check for session affinity:

kubectl describe service llm-inference-service -n llm-inference | grep -A 5 "Session Affinity"

# If session affinity is causing issues, disable it:
kubectl patch service llm-inference-service -n llm-inference \
  -p '{"spec":{"sessionAffinity":"None"}}'

Issue: High latency spikes

Solution: Implement connection pooling and increase timeout values:

import httpx
from httpx import Limits

# Configure connection pooling
limits = Limits(
    max_keepalive_connections=100,
    max_connections=200,
    keepalive_expiry=30.0
)

client = httpx.AsyncClient(
    limits=limits,
    timeout=httpx.Timeout(120.0, connect=10.0)
)

Conclusion

Implementing robust LLM gateway patterns for rate limiting and load balancing is essential for production-grade AI applications. By combining token-aware rate limiting, intelligent load balancing across multiple providers, circuit breaking, and comprehensive monitoring, you can build resilient systems that handle LLM traffic efficiently while optimizing costs and maintaining reliability.

Start with simple rate limiting and gradually add sophistication as your traffic grows. Monitor your metrics closely, and adjust your thresholds based on real-world usage patterns. The patterns and code examples provided here offer a solid foundation for building production-ready LLM gateways that scale with your needs.

Have Queries? Join https://launchpass.com/collabnix

Collabnix Team The Collabnix Team is a diverse collective of Docker, Kubernetes, and IoT experts united by a passion for cloud-native technologies. With backgrounds spanning across DevOps, platform engineering, cloud architecture, and container orchestration, our contributors bring together decades of combined experience from various industries and technical domains.
Join our Discord Server
Index