Benchmarking LLMs: A Comprehensive Performance Guide
Introduction
In the rapidly evolving landscape of AI development, understanding how your models perform locally is crucial for building responsive, production-ready applications. Just as NVIDIA’s Jetson AI Lab provides comprehensive benchmarking for edge AI devices, Docker Model Runner brings similar capabilities to your local development environment—whether you’re on a MacBook with Apple Silicon, a Windows PC with NVIDIA GPU, or a Linux workstation.
This tutorial will walk you through benchmarking Large Language Models (LLMs) using Docker Model Runner, focusing on critical performance metrics that matter for real-world applications. We’ll measure speed, throughput, and resource utilization—not model quality or accuracy—to help you answer questions like:
- How long will users wait before seeing their first response token?
- How many concurrent users can my local setup handle?
- Which quantization offers the best speed-to-quality trade-off?
- How does my hardware perform under different loads?
What You’ll Learn
By the end of this tutorial, you’ll be able to:
- Set up Docker Model Runner for performance testing
- Benchmark single-user and multi-user scenarios
- Measure critical performance metrics (TTFT, throughput, latency)
- Compare different models and quantization levels
- Optimize your local LLM deployment for production
Prerequisites
Before you begin, ensure you have:
- Docker Desktop 4.40+ (macOS on Apple Silicon) or Docker Engine (Linux with NVIDIA GPU)
- Hardware Requirements:
- Minimum 16GB RAM (32GB+ recommended)
- For best performance: VRAM + RAM ≥ model size
- Apple Silicon (M1/M2/M3/M4) or NVIDIA GPU
- Basic familiarity with Docker and command-line tools
Understanding Performance Metrics
Before diving into benchmarking, let’s define the three key metrics we’ll focus on:
1. Time to First Token (TTFT)
What it measures: How long a user waits before the model starts generating a response.
Why it matters: This is the perceived latency from the user’s perspective. Lower TTFT means your application feels snappier and more responsive.
Target values:
- Excellent: < 500ms
- Good: 500ms – 1000ms
- Acceptable: 1000ms – 2000ms
- Poor: > 2000ms
2. Time Per Output Token (TPOT)
What it measures: The average time to generate each token after the first one.
Why it matters: This determines how fast text streams to the user. Lower TPOT means faster reading experience.
Target values:
- Excellent: < 50ms (20+ tokens/sec)
- Good: 50ms – 100ms (10-20 tokens/sec)
- Acceptable: 100ms – 200ms (5-10 tokens/sec)
- Poor: > 200ms (< 5 tokens/sec)
3. Throughput
What it measures: Total tokens generated per second across all requests.
Why it matters: This indicates how many users your system can serve simultaneously.
Target values:
- Single user: 10-30 tokens/sec
- Multi-user (8 concurrent): 40-100 tokens/sec total
Part 1: Setting Up Docker Model Runner
Enable Docker Model Runner
First, enable Docker Model Runner in Docker Desktop or Docker Engine:
Via Docker Desktop GUI:
- Open Docker Desktop
- Navigate to Settings → Features in development
- Enable “Docker Model Runner”
- Click “Apply & Restart”
Via CLI (Linux):
docker model version
Docker Model Runner version v1.0.4
Docker Engine Kind: Docker Engine
Choose Your Model
For this tutorial, we’ll use Llama 3.2 3B with 4-bit quantization (Q4_K_M), which offers an excellent balance between performance and quality. This model:
- Requires ~2GB RAM
- Runs efficiently on most modern hardware
- Provides good inference speed
- Maintains reasonable quality for testing

# Pull the model from Docker Hub
docker model pull ai/llama3.2:3B-Q4_K_M
# Verify the model is available
docker model list
Alternative models to consider:
ai/smollm3:Q4_K_M– Smaller, faster (135M parameters)ai/llama3.2:8B-Q4_K_M– Larger, more capableai/qwen3:8B-Q4_K_M– Excellent for tool callingai/phi4:Q4_K_M– Microsoft’s efficient model
Part 2: Setting Up the Benchmark Environment
Install Python Dependencies
We’ll use Python to interact with Docker Model Runner’s OpenAI-compatible API:
# Create a virtual environment
python3 -m venv dmr-bench
source dmr-bench/bin/activate # On Windows: dmr-bench\Scripts\activate
# Install required packages
pip install openai requests python-dotenv pandas matplotlib
Create the Benchmark Script
Create a file called benchmark_dmr.py:
#!/usr/bin/env python3
"""
Docker Model Runner Benchmark Script
Measures TTFT, TPOT, and throughput for local LLMs
"""
import time
import json
import asyncio
from typing import List, Dict
from openai import OpenAI
import statistics
# Configure Docker Model Runner client
client = OpenAI(
base_url="http://localhost:12434/v1",
api_key="dmr-local" # Not used but required by OpenAI SDK
)
class BenchmarkMetrics:
def __init__(self):
self.ttft_list = []
self.tpot_list = []
self.total_tokens = 0
self.total_time = 0
self.request_count = 0
def add_request(self, ttft: float, tpot: float, tokens: int, duration: float):
self.ttft_list.append(ttft)
self.tpot_list.append(tpot)
self.total_tokens += tokens
self.total_time += duration
self.request_count += 1
def get_summary(self) -> Dict:
return {
"avg_ttft_ms": statistics.mean(self.ttft_list) * 1000,
"p50_ttft_ms": statistics.median(self.ttft_list) * 1000,
"p95_ttft_ms": statistics.quantiles(self.ttft_list, n=20)[18] * 1000 if len(self.ttft_list) > 1 else 0,
"avg_tpot_ms": statistics.mean(self.tpot_list) * 1000,
"tokens_per_sec": self.total_tokens / self.total_time if self.total_time > 0 else 0,
"total_requests": self.request_count,
"total_tokens": self.total_tokens,
"total_duration_sec": self.total_time
}
def generate_test_prompt(input_tokens: int = 2048) -> str:
"""Generate a synthetic prompt of approximately the specified length"""
# Average English word is ~4.7 characters, tokens are ~4 characters
words_needed = input_tokens * 4 // 5 # Rough approximation
base_text = """You are an expert AI assistant helping with software development.
Please analyze the following code and provide detailed recommendations for
optimization, security improvements, and best practices. """
# Pad with realistic code-like text
padding = "function example() { console.log('test'); } " * (words_needed // 6)
return base_text + padding
async def benchmark_single_request(
model: str,
prompt: str,
max_tokens: int = 128
) -> Dict:
"""Benchmark a single inference request"""
start_time = time.time()
first_token_time = None
tokens_generated = 0
try:
stream = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
max_tokens=max_tokens,
stream=True,
temperature=0.7
)
for chunk in stream:
if chunk.choices[0].delta.content:
if first_token_time is None:
first_token_time = time.time()
tokens_generated += 1
end_time = time.time()
ttft = first_token_time - start_time if first_token_time else 0
total_duration = end_time - start_time
generation_time = end_time - first_token_time if first_token_time else total_duration
tpot = generation_time / tokens_generated if tokens_generated > 1 else 0
return {
"ttft": ttft,
"tpot": tpot,
"tokens": tokens_generated,
"duration": total_duration,
"success": True
}
except Exception as e:
print(f"Error during request: {e}")
return {"success": False, "error": str(e)}
async def run_benchmark(
model: str,
num_requests: int = 10,
concurrency: int = 1,
input_tokens: int = 2048,
output_tokens: int = 128
):
"""Run benchmark with specified concurrency"""
print(f"\n{'='*70}")
print(f"Benchmark Configuration:")
print(f" Model: {model}")
print(f" Requests: {num_requests}")
print(f" Concurrency: {concurrency}")
print(f" Input tokens: ~{input_tokens}")
print(f" Output tokens: {output_tokens}")
print(f"{'='*70}\n")
metrics = BenchmarkMetrics()
prompt = generate_test_prompt(input_tokens)
# Warm-up request (not counted)
print("Running warm-up request...")
await benchmark_single_request(model, prompt, output_tokens)
print("Warm-up complete. Starting benchmark...\n")
# Create request batches based on concurrency
batches = [num_requests // concurrency] * concurrency
remainder = num_requests % concurrency
for i in range(remainder):
batches[i] += 1
batch_num = 0
for batch_size in batches:
batch_num += 1
print(f"Running batch {batch_num}/{len(batches)} ({batch_size} requests)...")
tasks = [
benchmark_single_request(model, prompt, output_tokens)
for _ in range(batch_size)
]
results = await asyncio.gather(*tasks)
for result in results:
if result.get("success"):
metrics.add_request(
result["ttft"],
result["tpot"],
result["tokens"],
result["duration"]
)
print(f" Completed {metrics.request_count}/{num_requests} requests")
return metrics.get_summary()
def print_results(results: Dict, scenario: str):
"""Pretty print benchmark results"""
print(f"\n{'='*70}")
print(f"Results: {scenario}")
print(f"{'='*70}")
print(f"Time to First Token (TTFT):")
print(f" Average: {results['avg_ttft_ms']:.2f}ms")
print(f" Median (P50): {results['p50_ttft_ms']:.2f}ms")
print(f" 95th Percentile (P95): {results['p95_ttft_ms']:.2f}ms")
print(f"\nTime Per Output Token (TPOT):")
print(f" Average: {results['avg_tpot_ms']:.2f}ms")
print(f"\nThroughput:")
print(f" Tokens/sec: {results['tokens_per_sec']:.2f}")
print(f" Total tokens: {results['total_tokens']}")
print(f" Total requests: {results['total_requests']}")
print(f" Total duration: {results['total_duration_sec']:.2f}s")
print(f"{'='*70}\n")
async def main():
"""Main benchmark execution"""
model = "ai/llama3.2:3B-Q4_K_M"
# Verify model is running
print("Verifying model availability...")
try:
client.models.list()
print(f"✓ Model Runner is accessible at http://localhost:12434")
except Exception as e:
print(f"✗ Error: Cannot connect to Model Runner. Is it running?")
print(f" Error: {e}")
return
# Benchmark 1: Single User Performance
print("\n" + "="*70)
print("BENCHMARK 1: Single User Performance (Concurrency = 1)")
print("="*70)
results_single = await run_benchmark(
model=model,
num_requests=10,
concurrency=1,
input_tokens=2048,
output_tokens=128
)
print_results(results_single, "Single User (Concurrency = 1)")
# Benchmark 2: Multi-User Performance
print("\n" + "="*70)
print("BENCHMARK 2: Multi-User Performance (Concurrency = 4)")
print("="*70)
results_multi = await run_benchmark(
model=model,
num_requests=20,
concurrency=4,
input_tokens=2048,
output_tokens=128
)
print_results(results_multi, "Multi-User (Concurrency = 4)")
# Comparison
print("\n" + "="*70)
print("PERFORMANCE COMPARISON")
print("="*70)
print(f"{'Metric':<30} {'Single User':<20} {'Multi-User':<20}")
print("-" * 70)
print(f"{'Avg TTFT':<30} {results_single['avg_ttft_ms']:>10.2f}ms {results_multi['avg_ttft_ms']:>10.2f}ms")
print(f"{'Avg TPOT':<30} {results_single['avg_tpot_ms']:>10.2f}ms {results_multi['avg_tpot_ms']:>10.2f}ms")
print(f"{'Throughput (tokens/sec)':<30} {results_single['tokens_per_sec']:>10.2f} {results_multi['tokens_per_sec']:>10.2f}")
print("="*70 + "\n")
if __name__ == "__main__":
asyncio.run(main())
Part 3: Running Your First Benchmark
Step 1: Start the Model
First, ensure your model is loaded and ready:
# Start the model in server mode
docker model run ai/llama3.2:3B-Q4_K_M
# In another terminal, verify it's running
curl http://localhost:12434/v1/models
Result:
curl http://localhost:12434/v1/models
{"object":"list","data":[{"id":"ai/llama3.2:latest","object":"model","created":1742916473,"owned_by":"docker"}]}
Step 2: Run the Benchmark
Execute the benchmark script:
python3 benchmark_dmr.py
Expected Output
You should see output similar to this:
======================================================================
BENCHMARK 1: Single User Performance (Concurrency = 1)
======================================================================
Running warm-up request...
Warm-up complete. Starting benchmark...
Running batch 1/1 (10 requests)...
Completed 10/10 requests
======================================================================
Results: Single User (Concurrency = 1)
======================================================================
Time to First Token (TTFT):
Average: 247.32ms
Median (P50): 241.18ms
95th Percentile (P95): 312.45ms
Time Per Output Token (TPOT):
Average: 45.67ms
Throughput:
Tokens/sec: 21.89
Total tokens: 1,280
Total requests: 10
Total duration: 58.47s
======================================================================
Part 4: Understanding Your Results
Interpreting TTFT (Time to First Token)
What the numbers mean:
- < 500ms: Excellent – feels instant to users
- 500ms – 1s: Good – barely noticeable delay
- 1s – 2s: Acceptable – users will notice but tolerate
- 2s: Poor – users may perceive as slow
Factors affecting TTFT:
- Model size: Larger models take longer to process the prompt
- Input length: Longer prompts increase TTFT linearly
- Hardware: Better CPU/GPU reduces TTFT
- Quantization: Higher quantization (Q8) slower than lower (Q4)
Optimization tips:
# Use smaller quantization for faster TTFT
docker model pull ai/llama3.2:3B-Q4_K_S # Faster but lower quality
# Or use a smaller model entirely
docker model pull ai/smollm3:Q4_K_M # Much faster, less capable
Interpreting TPOT (Time Per Output Token)
What the numbers mean:
- < 50ms (20+ tokens/sec): Excellent – text streams smoothly
- 50-100ms (10-20 tokens/sec): Good – readable streaming
- 100-200ms (5-10 tokens/sec): Acceptable – slow but usable
- 200ms (< 5 tokens/sec): Poor – painfully slow
Optimization tips:
# For M-series Macs, Metal acceleration is automatic
# For NVIDIA GPUs on Linux, ensure GPU access:
docker run --gpus all ...
# Reduce context window if not needed
# Modify your code to use smaller max_tokens
Interpreting Throughput
Throughput = Total tokens / Total time
What it tells you:
- Single user: Your best-case scenario
- Multi-user: Real-world performance under load
Typical patterns:
Single user: 20 tokens/sec ← Maximum possible for one request
Multi-user: 60 tokens/sec ← 4 concurrent requests, 15 tokens/sec each
Part 5: Advanced Benchmarking Scenarios
Scenario 1: Comparing Quantization Levels
Different quantizations trade quality for speed. Let’s benchmark them:
async def compare_quantizations():
"""Compare different quantization levels"""
quantizations = [
"ai/llama3.2:3B-Q4_K_S", # Smallest, fastest
"ai/llama3.2:3B-Q4_K_M", # Balanced
"ai/llama3.2:3B-Q6_K", # Higher quality
"ai/llama3.2:3B-Q8_0", # Highest quality
]
results = {}
for model in quantizations:
print(f"\nBenchmarking {model}...")
# Pull model if not available
import subprocess
subprocess.run(["docker", "model", "pull", model])
# Run benchmark
result = await run_benchmark(
model=model,
num_requests=5,
concurrency=1,
input_tokens=1024,
output_tokens=100
)
results[model] = result
# Print comparison
print("\n" + "="*90)
print("QUANTIZATION COMPARISON")
print("="*90)
print(f"{'Model':<35} {'TTFT (ms)':<15} {'TPOT (ms)':<15} {'Throughput':<15}")
print("-"*90)
for model, metrics in results.items():
quant = model.split(':')[1]
print(f"{quant:<35} {metrics['avg_ttft_ms']:<15.2f} {metrics['avg_tpot_ms']:<15.2f} {metrics['tokens_per_sec']:<15.2f}")
print("="*90)
# Add to main()
# await compare_quantizations()
Scenario 2: Stress Testing with High Concurrency
Simulate heavy load to find your system’s limits:
async def stress_test():
"""Find the breaking point of your system"""
model = "ai/llama3.2:3B-Q4_K_M"
concurrency_levels = [1, 2, 4, 8, 16, 32]
print("\n" + "="*70)
print("STRESS TEST: Finding System Limits")
print("="*70)
for concurrency in concurrency_levels:
print(f"\nTesting concurrency level: {concurrency}")
result = await run_benchmark(
model=model,
num_requests=concurrency * 2, # 2 requests per concurrent user
concurrency=concurrency,
input_tokens=1024,
output_tokens=100
)
print(f" Throughput: {result['tokens_per_sec']:.2f} tokens/sec")
print(f" Avg TTFT: {result['avg_ttft_ms']:.2f}ms")
# Stop if performance degrades significantly
if result['avg_ttft_ms'] > 5000: # 5 second TTFT threshold
print(f"\n⚠ Performance degraded significantly at concurrency={concurrency}")
print(f" Recommended max concurrency: {concurrency // 2}")
break
# Add to main()
# await stress_test()
Scenario 3: Different Context Lengths
Test how input length affects performance:
async def benchmark_context_lengths():
"""Test performance across different input sizes"""
model = "ai/llama3.2:3B-Q4_K_M"
context_lengths = [512, 1024, 2048, 4096, 8192]
print("\n" + "="*70)
print("CONTEXT LENGTH IMPACT")
print("="*70)
results = {}
for length in context_lengths:
print(f"\nTesting input length: {length} tokens")
result = await run_benchmark(
model=model,
num_requests=5,
concurrency=1,
input_tokens=length,
output_tokens=128
)
results[length] = result
# Print comparison
print("\n" + "="*80)
print(f"{'Context Length':<20} {'TTFT (ms)':<15} {'TPOT (ms)':<15} {'Throughput':<15}")
print("-"*80)
for length, metrics in results.items():
print(f"{length:<20} {metrics['avg_ttft_ms']:<15.2f} {metrics['avg_tpot_ms']:<15.2f} {metrics['tokens_per_sec']:<15.2f}")
print("="*80)
# Add to main()
# await benchmark_context_lengths()
Part 6: Optimizing Performance
Hardware Optimization
For Apple Silicon (M1/M2/M3/M4):
# Ensure you're using the latest Docker Desktop
# Metal acceleration is automatic for Apple Silicon
# Monitor resource usage
docker stats
# Check Model Runner logs
tail -f ~/Library/Containers/com.docker.docker/Data/log/host/dmr.log
For NVIDIA GPUs on Linux:
# Install NVIDIA Container Toolkit
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker
# Verify GPU access
docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi
Model Selection Guidelines
For fastest response (TTFT < 500ms):
- Use Q4_K_S quantization
- Choose models < 3B parameters
- Examples:
smollm3:Q4_K_M,llama3.2:1B-Q4_K_M
For best quality at reasonable speed:
- Use Q4_K_M or Q5_K_M quantization
- Choose 7B-8B parameter models
- Examples:
llama3.2:8B-Q4_K_M,qwen3:8B-Q5_K_M
For production deployment:
- Test with expected load (use concurrency benchmarks)
- Monitor resource usage (CPU, RAM, GPU)
- Set up appropriate rate limiting
- Consider model caching strategies
Configuration Tuning
Create a configuration file model_config.json:
{
"model": "ai/llama3.2:3B-Q4_K_M",
"inference_params": {
"temperature": 0.7,
"top_p": 0.9,
"max_tokens": 2048,
"stream": true
},
"performance": {
"n_ctx": 4096,
"n_batch": 512,
"n_threads": 8,
"n_gpu_layers": -1
}
}
Part 7: Real-World Application Patterns
Pattern 1: RAG (Retrieval-Augmented Generation)
RAG applications have unique characteristics:
- Long input contexts (2K-8K tokens with retrieved documents)
- Shorter outputs (200-500 tokens)
- Latency sensitive (users waiting for answers)
Recommended configuration:
# Benchmark for RAG workload
await run_benchmark(
model="ai/llama3.2:8B-Q4_K_M", # Larger model for better comprehension
num_requests=20,
concurrency=2, # Typical user load
input_tokens=4096, # Context + retrieved docs
output_tokens=256 # Concise answers
)
Expected performance:
- TTFT: 800-1500ms (acceptable for search-like experience)
- TPOT: 60-100ms
- Throughput: 10-15 tokens/sec per user
Pattern 2: Code Generation
Code generation has different requirements:
- Medium input (1K-2K tokens of context)
- Longer output (500-2000 tokens of code)
- Quality matters more than speed
Recommended configuration:
await run_benchmark(
model="ai/qwen3:8B-Q5_K_M", # Better for code
num_requests=10,
concurrency=1, # Usually single developer
input_tokens=1024,
output_tokens=1024 # Full function/class generation
)
Pattern 3: Chatbot / Conversational AI
Chatbots need:
- Variable input length (growing with conversation history)
- Quick responses (200-300 tokens)
- Low latency (feels like texting)
Recommended configuration:
await run_benchmark(
model="ai/llama3.2:3B-Q4_K_M", # Fast and capable
num_requests=50,
concurrency=8, # Multiple concurrent chats
input_tokens=512, # Recent conversation
output_tokens=200 # Brief responses
)
Part 8: Monitoring and Observability
Creating a Performance Dashboard
Install monitoring tools:
pip install psutil prometheus-client flask
Create monitor_dmr.py:
from flask import Flask, jsonify
from prometheus_client import Counter, Histogram, generate_latest
import psutil
import time
app = Flask(__name__)
# Metrics
request_counter = Counter('dmr_requests_total', 'Total requests')
ttft_histogram = Histogram('dmr_ttft_seconds', 'Time to first token')
tpot_histogram = Histogram('dmr_tpot_seconds', 'Time per output token')
@app.route('/metrics')
def metrics():
return generate_latest()
@app.route('/health')
def health():
# Check Model Runner availability
try:
import requests
response = requests.get('http://localhost:12434/v1/models', timeout=2)
dmr_status = "healthy" if response.status_code == 200 else "unhealthy"
except:
dmr_status = "unreachable"
return jsonify({
"status": "ok" if dmr_status == "healthy" else "degraded",
"model_runner": dmr_status,
"system": {
"cpu_percent": psutil.cpu_percent(),
"memory_percent": psutil.virtual_memory().percent,
"disk_percent": psutil.disk_usage('/').percent
}
})
if __name__ == '__main__':
app.run(port=9090)
System Resource Monitoring
Create resource_monitor.py:
import psutil
import time
import sys
from datetime import datetime
def monitor_resources(duration_seconds=60):
"""Monitor system resources during benchmark"""
print(f"\n{'='*70}")
print(f"Resource Monitor - {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print(f"Monitoring for {duration_seconds} seconds...")
print(f"{'='*70}\n")
print(f"{'Time':<10} {'CPU %':<10} {'RAM %':<10} {'RAM GB':<15} {'GPU %':<10}")
print("-" * 70)
start_time = time.time()
samples = []
while time.time() - start_time < duration_seconds:
cpu_percent = psutil.cpu_percent(interval=1)
ram = psutil.virtual_memory()
ram_percent = ram.percent
ram_gb = ram.used / (1024**3)
# Try to get GPU utilization (NVIDIA only)
try:
import subprocess
result = subprocess.run(
['nvidia-smi', '--query-gpu=utilization.gpu', '--format=csv,noheader,nounits'],
capture_output=True,
text=True
)
gpu_percent = result.stdout.strip()
except:
gpu_percent = "N/A"
elapsed = int(time.time() - start_time)
print(f"{elapsed:<10} {cpu_percent:<10.1f} {ram_percent:<10.1f} {ram_gb:<15.2f} {gpu_percent:<10}")
samples.append({
'cpu': cpu_percent,
'ram': ram_percent,
'ram_gb': ram_gb,
'time': elapsed
})
time.sleep(1)
# Summary
print(f"\n{'='*70}")
print("Resource Usage Summary")
print(f"{'='*70}")
avg_cpu = sum(s['cpu'] for s in samples) / len(samples)
max_cpu = max(s['cpu'] for s in samples)
avg_ram = sum(s['ram'] for s in samples) / len(samples)
max_ram = max(s['ram_gb'] for s in samples)
print(f"CPU Usage:")
print(f" Average: {avg_cpu:.1f}%")
print(f" Peak: {max_cpu:.1f}%")
print(f"\nRAM Usage:")
print(f" Average: {avg_ram:.1f}%")
print(f" Peak: {max_ram:.2f} GB")
print(f"{'='*70}\n")
if __name__ == "__main__":
duration = int(sys.argv[1]) if len(sys.argv) > 1 else 60
monitor_resources(duration)
Usage:
# Run monitoring in one terminal
python resource_monitor.py 120
# Run benchmark in another terminal
python benchmark_dmr.py
Part 9: Troubleshooting Common Issues
Issue 1: Slow TTFT (> 2 seconds)
Possible causes:
- Large input context
- Model not fully loaded
- Insufficient RAM
Solutions:
# Check if model is loaded
docker model list
# Reduce context size
# In your code, use shorter prompts
# Use smaller quantization
docker model pull ai/llama3.2:3B-Q4_K_S
# Check available RAM
free -h # Linux
vm_stat # macOS
Issue 2: Low Throughput (< 5 tokens/sec)
Possible causes:
- Running on CPU instead of GPU/Neural Engine
- Background processes consuming resources
- Model too large for available memory
Solutions:
# For macOS: Ensure Metal acceleration is working
# Check Docker Desktop logs
~/Library/Containers/com.docker.docker/Data/log/host/dmr.log
# For Linux: Verify GPU access
docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi
# Close unnecessary applications
# Use smaller model
docker model pull ai/smollm3:Q4_K_M
Issue 3: Model Runner Connection Errors
Symptoms:
Error: Cannot connect to Model Runner
Solutions:
# Check if Model Runner is enabled
docker model --help
# Restart Docker Desktop
# macOS: Cmd+Q, then reopen
# Linux: sudo systemctl restart docker
# Verify port 12434 is available
lsof -i :12434 # Should show Model Runner process
netstat -an | grep 12434
# Check Model Runner service status
docker info | grep "Model Runner"
Issue 4: Out of Memory Errors
Symptoms:
- Model fails to load
- System becomes unresponsive
- Benchmark crashes
Solutions:
# Check current memory usage
docker stats
# Use smaller quantization or model
# Q4_K_S < Q4_K_M < Q5_K_M < Q6_K < Q8_0
# Reduce batch size in benchmark
# Modify concurrency parameter
# Clear Docker cache
docker system prune -a
# Increase Docker Desktop memory limit
# Settings → Resources → Memory
Part 10: Production Deployment Checklist
Before deploying your LLM application to production, use these benchmarks to validate:
Performance Requirements
- TTFT meets SLA: < 2 seconds for 95th percentile
- Throughput sufficient: Can handle peak concurrent users
- Resource usage stable: No memory leaks over extended runs
- Error rate acceptable: < 1% failed requests
Benchmark Validation
async def production_readiness_check():
"""Validate production readiness with comprehensive benchmarks"""
model = "ai/llama3.2:3B-Q4_K_M"
print("\n" + "="*70)
print("PRODUCTION READINESS CHECK")
print("="*70)
# Test 1: Peak load handling
print("\n1. Peak Load Test (20 concurrent users)")
peak_result = await run_benchmark(
model=model,
num_requests=100,
concurrency=20,
input_tokens=2048,
output_tokens=256
)
peak_ttft_ok = peak_result['p95_ttft_ms'] < 2000
print(f" P95 TTFT: {peak_result['p95_ttft_ms']:.2f}ms - {'✓ PASS' if peak_ttft_ok else '✗ FAIL'}")
# Test 2: Sustained load
print("\n2. Sustained Load Test (10 min, 5 concurrent)")
sustained_result = await run_benchmark(
model=model,
num_requests=300,
concurrency=5,
input_tokens=1024,
output_tokens=200
)
sustained_throughput_ok = sustained_result['tokens_per_sec'] > 25
print(f" Throughput: {sustained_result['tokens_per_sec']:.2f} tok/s - {'✓ PASS' if sustained_throughput_ok else '✗ FAIL'}")
# Test 3: Edge cases
print("\n3. Edge Case Test (very long context)")
edge_result = await run_benchmark(
model=model,
num_requests=10,
concurrency=1,
input_tokens=8192,
output_tokens=100
)
edge_ttft_ok = edge_result['avg_ttft_ms'] < 5000
print(f" Long context TTFT: {edge_result['avg_ttft_ms']:.2f}ms - {'✓ PASS' if edge_ttft_ok else '✗ FAIL'}")
# Overall assessment
all_passed = peak_ttft_ok and sustained_throughput_ok and edge_ttft_ok
print("\n" + "="*70)
if all_passed:
print("✓ PRODUCTION READY")
print("Your configuration meets performance requirements.")
else:
print("✗ NOT READY FOR PRODUCTION")
print("Some benchmarks failed. Consider:")
if not peak_ttft_ok:
print(" - Use smaller model or better hardware for peak load")
if not sustained_throughput_ok:
print(" - Increase concurrency limit or add load balancing")
if not edge_ttft_ok:
print(" - Optimize long-context handling or set context limits")
print("="*70)
Monitoring Setup
# Set up Prometheus monitoring
cat > prometheus.yml << EOF
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'dmr_monitoring'
static_configs:
- targets: ['localhost:9090']
EOF
# Run Prometheus
docker run -d \
--name prometheus \
-p 9091:9090 \
-v $(pwd)/prometheus.yml:/etc/prometheus/prometheus.yml \
prom/prometheus
# Set up Grafana
docker run -d \
--name grafana \
-p 3000:3000 \
grafana/grafana
Conclusion
You’ve now learned how to comprehensively benchmark LLMs running on Docker Model Runner. Key takeaways:
- Measure what matters: TTFT, TPOT, and throughput are the critical metrics for user experience
- Test realistic scenarios: Single-user, multi-user, and stress tests reveal different aspects of performance
- Optimize for your use case: RAG, code generation, and chat have different performance profiles
- Monitor in production: Use Prometheus and custom dashboards to track performance over time
- Validate before deploying: Run production readiness checks to ensure your system can handle real-world load
Next Steps
- Experiment with different models: Try models from Docker Hub and HuggingFace
- Build a demo application: Use benchmarks to inform your architecture decisions
- Integrate with CI/CD: Add performance regression tests to your pipeline
- Join the community: Share your findings on Collabnix Discord
Additional Resources
- Docker Model Runner Documentation
- Docker Model Runner GitHub
- Collabnix AI/ML Tutorials
- Docker Hub AI Models
- NVIDIA Jetson AI Lab (inspiration for this tutorial)