Ever tried building a GenAI application and hit a wall? 🧱
I know I have. You start with excitement about implementing that cool chatbot or content generator, but then reality hits. You’re either sending sensitive data to third-party APIs with usage limits and costs that quickly add up 💸, or you’re wrestling with complex local setups that eat up your development time ⏱️. And let’s not forget the frustration of testing iteratively when each model call requires an internet connection or depletes your API credits.
This is where Docker’s new Model Runner enters the picture, and it might just be the solution we’ve all been waiting for. 🎉
Where Model Runner Fits in Your Workflow?

If you’re already using Docker (and who isn’t these days?😉), Model Runner slots perfectly into your existing development environment.
Introduced in Docker Desktop 4.40+ for Mac, with Windows support coming in April 2025, it brings LLM inference capabilities right into the Docker ecosystem you already know and trust. 💙
Think of it as extending Docker’s container philosophy to AI models. Just as containers made application deployment consistent and portable, Model Runner does the same for AI model inference. With the new docker model CLI, you can pull, run, and manage models as easily as you do containers. 🐳
Solving Real Development Headaches

Model Runner addresses several pain points that have been slowing down GenAI development:
- 🔒 “How do I keep my data private during development?” Model Runner runs everything locally – your prompts and data never leave your machine.
- 💰 “These API costs are killing my budget.” With locally running models, you can test and iterate to your heart’s content without worrying about per-token pricing.
- 🛠️ “Setting up local inference is too complex.” The docker model CLI makes it as simple as docker model pull ai/llama3.2:1B-Q8_0 and you’re ready to go.
- 💻 “My laptop can’t handle these models.” Model Runner is optimized for Apple Silicon’s GPU capabilities, making even 1B+ parameter models run smoothly on your MacBook.
- 🔄 “I need to switch models frequently during development.” With Model Runner, switching between different models is just a command away, without changing your application code.
🧩 Integrating with Your GenAI Stack
The beauty of Model Runner is how easily it integrates with the rest of your GenAI development stack. It offers OpenAI-compatible endpoints, meaning you can use existing libraries and code patterns you might already be familiar with. 🤯
You have multiple connection options:
- 📦 From within containers via http://model-runner.docker.internal/
- 🔌 From the host via the Docker Socket
- 🌐 From the host via TCP when host support is enabled
This flexibility means you can build your frontend in React, your backend in Go, Python, or Node.js, and have everything communicate seamlessly with the locally running LLM. You can even enable a host-side TCP endpoint to connect from non-containerized applications.Amazing, right? ✨
🤔 How is Model Runner different from Ollama
While Docker Model Runner and Ollama both enable local LLM inference, they serve different purposes and workflows. Here’s how Model Runner stands apart:
🐳 1. Native Docker Integration
Unlike Ollama, which operates as a standalone tool, Model Runner is fully integrated into the Docker ecosystem. The docker model CLI treats AI models as first-class citizens alongside containers, images, and volumes. This means Docker users can manage their models using familiar commands and patterns, with no need to learn a separate toolset or workflow.
🏭 2. Production-Ready Architecture
Ollama excels at quick personal experimentation, but Model Runner is designed with production workflows in mind from the ground up:
- OCI Artifact Storage: Models are stored as standardized OCI artifacts in Docker Hub or other compatible registries, enabling proper versioning and distribution through existing CI/CD pipelines.
- Multiple Connection Methods: Model Runner offers flexible integration options via Docker socket, DNS resolution, or TCP connections.
- Better Resource Management: By running as a host-level process with direct GPU access, Model Runner achieves better performance optimization than containerized solutions.
🏢 3. Enterprise Features
Model Runner addresses enterprise needs that Ollama doesn’t focus on:
- Registry Integration: Works seamlessly with private corporate registries that support OCI artifacts
- Standardized Model Distribution: Teams can publish and share models using the same infrastructure as their container images
- Optimized Storage: Better handling of large, uncompressible model weights without unnecessary layer compression
🔄 OpenAI-Compatible API
Model Runner implements OpenAI-compatible endpoints, making it straightforward to integrate with existing codebases or switch between cloud APIs and local inference without major code changes.
This compatibility layer makes migration between development and production environments much smoother.🧈
🐳 Docker-First Development Experience
For teams already invested in Docker-based workflows, Model Runner provides a more cohesive development experience:
- Models are pulled from the same registries as container images
- Familiar Docker commands for lifecycle management
- Seamless integration with Docker Compose for multi-service applications
While Ollama remains an excellent choice for individual developers wanting to quickly experiment with LLMs, Docker Model Runner is positioned as the more robust solution for teams building production-ready GenAI applications within the Docker ecosystem.
⚙️ How Model Runner works?
Docker Model Runner uses a fundamentally different approach to running LLMs compared to traditional containerization methods. Here’s a deep dive into how it actually works:
🖥️ Host-Native Inference Engine
Unlike standard Docker containers that package everything together, Model Runner doesn’t run the AI model in a container. Instead:
- Docker Desktop runs the inference engine (currently llama.cpp) directly on your host machine
- This native process runs outside the container environment
- The model weights are loaded by this host process when needed
This architectural choice significantly improves performance by eliminating containerization overhead for resource-intensive AI workloads. 🚀
🔥 Direct GPU Access
One of the key advantages of the host-native approach is optimized GPU utilization:
- On Apple Silicon Macs, Model Runner directly accesses the Metal API
- This provides maximum GPU acceleration without virtualization layers
- You can observe this direct GPU usage in Activity Monitor when queries are processed
- The inference process appears as a separate GPU process during model operation
💾 Model Loading and Storage
Models are handled as OCI artifacts rather than container images:
- When you run
docker model pull
, the model files are downloaded from Docker Hub - These models are cached locally on your host machine’s storage
- The host-level inference engine dynamically loads these model files into memory when needed
- When you’re done, the model can be unloaded from memory but remain cached on disk
This approach is particularly beneficial for AI models because:
- Model weights are largely uncompressible, making traditional Docker image compression inefficient
- It avoids having both compressed and uncompressed versions of model weights on disk
- It provides faster deployment by eliminating the container build/run cycle
🔌 Connection Architecture
Model Runner exposes its functionality through multiple interfaces:
- 🔄 Docker Socket: /var/run/docker.sock
- Traditional Docker communication method
- Used by containers and host applications that access Docker
- 🌐 Internal DNS Resolution: model-runner.docker.internal
- Accessible from within Docker containers
- Provides a stable hostname for containerized applications
- 🔌 Optional TCP Port: (default: 12434)
- When enabled, accepts connections directly on the host
- Useful for non-containerized applications
All these interfaces implement OpenAI-compatible API endpoints, making them familiar for developers who have worked with cloud AI services.
🧠 API Implementation
The Model Runner implements several key OpenAI-compatible endpoints:
/engines/{backend}/v1/models
– List available models/engines/{backend}/v1/models/{namespace}/{name}
– Get model details/engines/{backend}/v1/chat/completions
– Generate chat completions/engines/{backend}/v1/completions
– Generate text completions/engines/{backend}/v1/embeddings
– Generate embeddings
These endpoints follow OpenAI’s API patterns, allowing developers to use existing client libraries designed for OpenAI.
📊 Logging and Monitoring
The inference engine logs detailed information about each run:
- Logs are stored at
~/Library/Containers/com.docker.docker/Data/log/host/inference-llama.cpp.log
- They include token processing statistics, timing information, and error messages
- These logs are valuable for debugging and performance optimization
By leveraging host-native processes with direct GPU access while maintaining Docker’s management capabilities, Model Runner delivers the best of both worlds: the performance of native AI inference with the convenience and standardization of the Docker ecosystem.
💻 Supported Platform
Docker Model Runner currently has limited platform support as it’s an experimental feature, with planned expansions coming soon:
Currently Supported:
- macOS with Apple Silicon (M-series chips)
- Requires Docker Desktop 4.40 or newer
- Optimized for Apple’s Metal API for GPU acceleration
- Works with M1, M2, M3, and M4 chips
- Uses llama.cpp as the inference engine
🚀 Getting Started
Let’s walk through setting up Docker Model Runner and building your first GenAI application step by step.
Step 1: Set Up Docker Desktop with Model Runner
- Install the latest version of Docker Desktop 4.40+ from the Docker website.
- Launch Docker Desktop and go to Settings (gear icon).
- Navigate to the Features section.

- Locate “Docker Model Runner” and ensure it’s enabled.
Optionally enable “Host-side TCP support” if you want to access models outside of Docker containers.
Click Apply & Restart to save your changes.
Step 2: Verify Model Runner CLI is enabled
Open your terminal or command prompt.
$ docker model --help
Usage: docker model COMMAND
Docker Model Runner
Commands:
inspect Display detailed information on one model
list List the available models that can be run with the Docker Model Runner
pull Download a model
rm Remove a model downloaded from Docker Hub
run Run a model with the Docker Model Runner
status Check if the Docker Model Runner is running
version Show the Docker Model Runner version
Run 'docker model COMMAND --help' for more information on a command.
Run docker model status to verify that Model Runner is active.
$ docker model status
You should see a message confirming: “Docker Model Runner is running”.
Step 3. List the available model
docker model ls
MODEL PARAMETERS QUANTIZATION ARCHITECTURE FORMAT MODEL ID CREATED SIZE
Step 3: Find and Download Your First Model
Download a model with:
docker model pull ai/llama3.2:1B-Q8_0
This will download a 1.2B parameter Llama 3.2 model optimized for inference.
Verify the model was downloaded by running docker model ls again.
Step 4: Test Your Model With a Simple Prompt
Run a quick test with:
$ docker model run ai/llama3.2:1B-Q8_0
Create a GenAI Application with Model Runner
Now let’s build a simple web application that uses our model:
Clone the example repository:
git clone https://github.com/dockersamples/genai-app-demo
cd genai-app-demo
Review the key configuration files:
backend.env - Contains environment variables for the API
docker-compose.yml - Defines our application services
Set the required environment variables in backend.env:
BASE_URL=http://model-runner.docker.internal/engines/llama.cpp/v1/
MODEL=ai/llama3.2:1B-Q8_0
API_KEY=${API_KEY:-modelrunner}
Start the application:
docker compose up -d
Access your GenAI application at http://localhost:3000
Try asking questions in the chat interface and watch as your local model processes them!
Step 6: Monitor Performance
- On your Mac, press Command + Spacebar to open Spotlight.
- Type “Activity Monitor” and open it.
- Navigate to the GPU tab.
- Watch the GPU activity spike when you send prompts to your model.
Step 7: Experiment with TCP Connection Method
For applications that need to connect outside of Docker:
- Go back to Docker Desktop Settings.
- Ensure “Enable host-side TCP support” is checked under Model Runner.
- Note the port number (default is 12434).
- Update your backend.env to use this connection method:
BASE_URL=http://host.docker.internal:12434/engines/llama.cpp/v1/
MODEL=ai/llama3.2:1B-Q8_0
API_KEY=${API_KEY:-modelrunner}
Restart your application:
docker compose down
docker compose up -d
- Test the application again at
http://localhost:3000
Step 8: Clean Up When Done
- Stop your application:
docker compose down
Remove the model if you’re done with it:
docker model rm ai/llama3.2:1B-Q8_0
Next Steps
Now that you’ve set up your first GenAI application with Docker Model Runner, you can:
- Try other available models like
ai/gemma3
,ai/mistral
, orai/phi4
Explore the OpenAI-compatible API endpoints for more advanced integration - Build your own frontend or backend that connects to Model Runner
- Check out additional demos from the Docker samples repository
With Docker Model Runner, you’ve brought powerful AI capabilities right into your development workflow without leaving the Docker ecosystem or sending data to external services. Happy building!