What is Docker Model Runner?
Docker Model Runner is a revolutionary feature that enables developers to run AI models locally with zero setup complexity. Built into Docker Desktop 4.40+, it brings LLM (Large Language Model) inference directly into your containerized development workflow.
Key Benefits
- ✅ No extra infrastructure – Runs natively on your machine
- ✅ OpenAI-compatible API – Drop-in replacement for OpenAI calls
- ✅ GPU acceleration – Optimized for Apple Silicon and NVIDIA GPUs
- ✅ OCI artifacts – Package GGUF files as OCI Artifacts and publish them to any Container Registry
- ✅ Host-based execution – Maximum performance, no VM overhead
🚀 Quick Setup Guide
Prerequisites
- Docker Desktop 4.40+ (4.41+ for Windows GPU support)
- macOS: Apple Silicon (M1/M2/M3) for optimal performance
- Windows: NVIDIA GPU (for GPU acceleration)
- Linux: Docker Engine with Model Runner
Enable Docker Model Runner
Docker Desktop (GUI)
- Open Docker Desktop Settings
- Navigate to Features in development → Beta
- Enable “Docker Model Runner”
- Apply & Restart
Docker Desktop (CLI)
# Enable Model Runner
docker desktop enable model-runner
# Enable with TCP support (for host access)
docker desktop enable model-runner --tcp 12434
# Check status
docker desktop status
Docker Engine (Linux)
sudo apt-get update
sudo apt-get install docker-model-plugin
📋 Essential Commands
Model Management
Pull Models
# Pull latest version
docker model pull ai/smollm2
List Models
# List all local models
docker model ls
Remove Models
# Remove specific model
docker model rm ai/smollm2
Running Models
Interactive Mode
# Quick inference
docker model run ai/smollm2 "Explain Docker in one sentence"
Model Information
# Inspect model details
docker model inspect ai/smollm2
🔗 API Integration
OpenAI-Compatible Endpoints
From Containers
# Base URL for container access
http://model-runner.docker.internal/engines/llama.cpp/v1/
From Host (with TCP enabled)
# Base URL for host access
http://localhost:12434/engines/llama.cpp/v1/
Chat Completions API
cURL Example
curl http://localhost:12434/engines/llama.cpp/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "ai/smollm2",
"messages": [
{
"role": "system",
"content": "You are a helpful coding assistant."
},
{
"role": "user",
"content": "Write a Docker Compose file for a web app"
}
],
"temperature": 0.7,
"max_tokens": 500
}'
Python Example
import openai
# Configure client for local Model Runner
client = openai.OpenAI(
base_url="http://model-runner.docker.internal/engines/llama.cpp/v1",
api_key="not-needed" # Local inference doesn't need API key
)
# Chat completion
response = client.chat.completions.create(
model="ai/smollm2",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain containerization benefits"}
],
temperature=0.7,
max_tokens=200
)
print(response.choices[0].message.content)
Node.js Example
import OpenAI from 'openai';
const openai = new OpenAI({
baseURL: 'http://model-runner.docker.internal/engines/llama.cpp/v1',
apiKey: 'not-needed'
});
async function chatWithModel() {
const completion = await openai.chat.completions.create({
model: 'ai/smollm2',
messages: [
{ role: 'system', content: 'You are a DevOps expert.' },
{ role: 'user', content: 'Best practices for Docker in production?' }
],
temperature: 0.8,
max_tokens: 300
});
console.log(completion.choices[0].message.content);
}
🐳 Docker Compose Integration
services:
chat:
image: my-chat-app
depends_on:
- ai_runner
ai_runner:
provider:
type: model
options:
model: ai/smollm2
🐳 Docker Model Management Endpoints
POST /models/create
GET /models
GET /models/{namespace}/{name}
DELETE /models/{namespace}/{name}
OpenAI Endpoints:
GET /engines/llama.cpp/v1/models
GET /engines/llama.cpp/v1/models/{namespace}/{name}
POST /engines/llama.cpp/v1/chat/completions
POST /engines/llama.cpp/v1/completions
POST /engines/llama.cpp/v1/embeddings