Join our Discord Server
Ajeet Raina Ajeet Singh Raina is a former Docker Captain, Community Leader and Distinguished Arm Ambassador. He is a founder of Collabnix blogging site and has authored more than 700+ blogs on Docker, Kubernetes and Cloud-Native Technology. He runs a community Slack of 9800+ members and discord server close to 2600+ members. You can follow him on Twitter(@ajeetsraina).

Docker Model Runner: The Missing Piece for Your GenAI Development Workflow

7 min read

Ever tried building a GenAI application and hit a wall? 🧱

I know I have. You start with excitement about implementing that cool chatbot or content generator, but then reality hits. You’re either sending sensitive data to third-party APIs with usage limits and costs that quickly add up 💸, or you’re wrestling with complex local setups that eat up your development time ⏱️. And let’s not forget the frustration of testing iteratively when each model call requires an internet connection or depletes your API credits.

This is where Docker’s new Model Runner enters the picture, and it might just be the solution we’ve all been waiting for. 🎉

Where Model Runner Fits in Your Workflow?

If you’re already using Docker (and who isn’t these days?😉), Model Runner slots perfectly into your existing development environment.

Introduced in Docker Desktop 4.40+ for Mac, with Windows support coming in April 2025, it brings LLM inference capabilities right into the Docker ecosystem you already know and trust. 💙

Think of it as extending Docker’s container philosophy to AI models. Just as containers made application deployment consistent and portable, Model Runner does the same for AI model inference. With the new docker model CLI, you can pull, run, and manage models as easily as you do containers. 🐳

Solving Real Development Headaches

Model Runner addresses several pain points that have been slowing down GenAI development:

  • 🔒 “How do I keep my data private during development?” Model Runner runs everything locally – your prompts and data never leave your machine.
  • 💰 “These API costs are killing my budget.” With locally running models, you can test and iterate to your heart’s content without worrying about per-token pricing.
  • 🛠️ “Setting up local inference is too complex.” The docker model CLI makes it as simple as docker model pull ai/llama3.2:1B-Q8_0 and you’re ready to go.
  • 💻 “My laptop can’t handle these models.” Model Runner is optimized for Apple Silicon’s GPU capabilities, making even 1B+ parameter models run smoothly on your MacBook.
  • 🔄 “I need to switch models frequently during development.” With Model Runner, switching between different models is just a command away, without changing your application code.

🧩 Integrating with Your GenAI Stack

The beauty of Model Runner is how easily it integrates with the rest of your GenAI development stack. It offers OpenAI-compatible endpoints, meaning you can use existing libraries and code patterns you might already be familiar with. 🤯

You have multiple connection options:

This flexibility means you can build your frontend in React, your backend in Go, Python, or Node.js, and have everything communicate seamlessly with the locally running LLM. You can even enable a host-side TCP endpoint to connect from non-containerized applications.Amazing, right? ✨

🤔 How is Model Runner different from Ollama

While Docker Model Runner and Ollama both enable local LLM inference, they serve different purposes and workflows. Here’s how Model Runner stands apart:

🐳 1. Native Docker Integration

Unlike Ollama, which operates as a standalone tool, Model Runner is fully integrated into the Docker ecosystem. The docker model CLI treats AI models as first-class citizens alongside containers, images, and volumes. This means Docker users can manage their models using familiar commands and patterns, with no need to learn a separate toolset or workflow.

🏭 2. Production-Ready Architecture

Ollama excels at quick personal experimentation, but Model Runner is designed with production workflows in mind from the ground up:

  • OCI Artifact Storage: Models are stored as standardized OCI artifacts in Docker Hub or other compatible registries, enabling proper versioning and distribution through existing CI/CD pipelines.
  • Multiple Connection Methods: Model Runner offers flexible integration options via Docker socket, DNS resolution, or TCP connections.
  • Better Resource Management: By running as a host-level process with direct GPU access, Model Runner achieves better performance optimization than containerized solutions.

🏢 3. Enterprise Features

Model Runner addresses enterprise needs that Ollama doesn’t focus on:

  • Registry Integration: Works seamlessly with private corporate registries that support OCI artifacts
  • Standardized Model Distribution: Teams can publish and share models using the same infrastructure as their container images
  • Optimized Storage: Better handling of large, uncompressible model weights without unnecessary layer compression

🔄 OpenAI-Compatible API

Model Runner implements OpenAI-compatible endpoints, making it straightforward to integrate with existing codebases or switch between cloud APIs and local inference without major code changes.

This compatibility layer makes migration between development and production environments much smoother.🧈

🐳 Docker-First Development Experience

For teams already invested in Docker-based workflows, Model Runner provides a more cohesive development experience:

  • Models are pulled from the same registries as container images
  • Familiar Docker commands for lifecycle management
  • Seamless integration with Docker Compose for multi-service applications

While Ollama remains an excellent choice for individual developers wanting to quickly experiment with LLMs, Docker Model Runner is positioned as the more robust solution for teams building production-ready GenAI applications within the Docker ecosystem.

⚙️ How Model Runner works?

Docker Model Runner uses a fundamentally different approach to running LLMs compared to traditional containerization methods. Here’s a deep dive into how it actually works:

🖥️ Host-Native Inference Engine

Unlike standard Docker containers that package everything together, Model Runner doesn’t run the AI model in a container. Instead:

  • Docker Desktop runs the inference engine (currently llama.cpp) directly on your host machine
  • This native process runs outside the container environment
  • The model weights are loaded by this host process when needed

This architectural choice significantly improves performance by eliminating containerization overhead for resource-intensive AI workloads. 🚀

🔥 Direct GPU Access

One of the key advantages of the host-native approach is optimized GPU utilization:

  • On Apple Silicon Macs, Model Runner directly accesses the Metal API
  • This provides maximum GPU acceleration without virtualization layers
  • You can observe this direct GPU usage in Activity Monitor when queries are processed
  • The inference process appears as a separate GPU process during model operation

💾 Model Loading and Storage

Models are handled as OCI artifacts rather than container images:

  • When you run docker model pull, the model files are downloaded from Docker Hub
  • These models are cached locally on your host machine’s storage
  • The host-level inference engine dynamically loads these model files into memory when needed
  • When you’re done, the model can be unloaded from memory but remain cached on disk

This approach is particularly beneficial for AI models because:

  • Model weights are largely uncompressible, making traditional Docker image compression inefficient
  • It avoids having both compressed and uncompressed versions of model weights on disk
  • It provides faster deployment by eliminating the container build/run cycle

🔌 Connection Architecture

Model Runner exposes its functionality through multiple interfaces:

  • 🔄 Docker Socket: /var/run/docker.sock
    • Traditional Docker communication method
    • Used by containers and host applications that access Docker
  • 🌐 Internal DNS Resolution: model-runner.docker.internal
    • Accessible from within Docker containers
    • Provides a stable hostname for containerized applications
  • 🔌 Optional TCP Port: (default: 12434)
    • When enabled, accepts connections directly on the host
    • Useful for non-containerized applications

All these interfaces implement OpenAI-compatible API endpoints, making them familiar for developers who have worked with cloud AI services.

🧠 API Implementation

The Model Runner implements several key OpenAI-compatible endpoints:

  • /engines/{backend}/v1/models – List available models
  • /engines/{backend}/v1/models/{namespace}/{name} – Get model details
  • /engines/{backend}/v1/chat/completions – Generate chat completions
  • /engines/{backend}/v1/completions – Generate text completions
  • /engines/{backend}/v1/embeddings – Generate embeddings

These endpoints follow OpenAI’s API patterns, allowing developers to use existing client libraries designed for OpenAI.

📊 Logging and Monitoring

The inference engine logs detailed information about each run:

  • Logs are stored at ~/Library/Containers/com.docker.docker/Data/log/host/inference-llama.cpp.log
  • They include token processing statistics, timing information, and error messages
  • These logs are valuable for debugging and performance optimization

By leveraging host-native processes with direct GPU access while maintaining Docker’s management capabilities, Model Runner delivers the best of both worlds: the performance of native AI inference with the convenience and standardization of the Docker ecosystem.

💻 Supported Platform

Docker Model Runner currently has limited platform support as it’s an experimental feature, with planned expansions coming soon:

Currently Supported:

  • macOS with Apple Silicon (M-series chips)
  • Requires Docker Desktop 4.40 or newer
  • Optimized for Apple’s Metal API for GPU acceleration
  • Works with M1, M2, M3, and M4 chips
  • Uses llama.cpp as the inference engine

🚀 Getting Started

Let’s walk through setting up Docker Model Runner and building your first GenAI application step by step.

Step 1: Set Up Docker Desktop with Model Runner

  • Install the latest version of Docker Desktop 4.40+ from the Docker website.
  • Launch Docker Desktop and go to Settings (gear icon).
  • Navigate to the Features section.
Image1
  • Locate “Docker Model Runner” and ensure it’s enabled.

Optionally enable “Host-side TCP support” if you want to access models outside of Docker containers.

Click Apply & Restart to save your changes.

Step 2: Verify Model Runner CLI is enabled

Open your terminal or command prompt.

$ docker model --help
Usage:  docker model COMMAND

Docker Model Runner

Commands:
  inspect     Display detailed information on one model
  list        List the available models that can be run with the Docker Model Runner
  pull        Download a model
  rm          Remove a model downloaded from Docker Hub
  run         Run a model with the Docker Model Runner
  status      Check if the Docker Model Runner is running
  version     Show the Docker Model Runner version

Run 'docker model COMMAND --help' for more information on a command.

Run docker model status to verify that Model Runner is active.

$ docker model status

You should see a message confirming: “Docker Model Runner is running”.

Step 3. List the available model

docker model ls
MODEL  PARAMETERS  QUANTIZATION  ARCHITECTURE  FORMAT  MODEL ID  CREATED  SIZE

Step 3: Find and Download Your First Model

Download a model with:

docker model pull ai/llama3.2:1B-Q8_0

This will download a 1.2B parameter Llama 3.2 model optimized for inference.

Verify the model was downloaded by running docker model ls again.

Step 4: Test Your Model With a Simple Prompt

Run a quick test with:

$ docker model run ai/llama3.2:1B-Q8_0 

Create a GenAI Application with Model Runner

Now let’s build a simple web application that uses our model:

Clone the example repository:

git clone https://github.com/dockersamples/genai-app-demo
cd genai-app-demo

Review the key configuration files:

backend.env - Contains environment variables for the API
docker-compose.yml - Defines our application services

Set the required environment variables in backend.env:

BASE_URL=http://model-runner.docker.internal/engines/llama.cpp/v1/
MODEL=ai/llama3.2:1B-Q8_0
API_KEY=${API_KEY:-modelrunner}

Start the application:

docker compose up -d

Access your GenAI application at http://localhost:3000

Try asking questions in the chat interface and watch as your local model processes them!

Step 6: Monitor Performance

  • On your Mac, press Command + Spacebar to open Spotlight.
  • Type “Activity Monitor” and open it.
  • Navigate to the GPU tab.
  • Watch the GPU activity spike when you send prompts to your model.

Step 7: Experiment with TCP Connection Method

For applications that need to connect outside of Docker:

  • Go back to Docker Desktop Settings.
  • Ensure “Enable host-side TCP support” is checked under Model Runner.
  • Note the port number (default is 12434).
  • Update your backend.env to use this connection method:
BASE_URL=http://host.docker.internal:12434/engines/llama.cpp/v1/
MODEL=ai/llama3.2:1B-Q8_0
API_KEY=${API_KEY:-modelrunner}

Restart your application:

docker compose down
docker compose up -d
  • Test the application again at http://localhost:3000

Step 8: Clean Up When Done

  • Stop your application:
docker compose down

Remove the model if you’re done with it:

docker model rm ai/llama3.2:1B-Q8_0

Next Steps

Now that you’ve set up your first GenAI application with Docker Model Runner, you can:

  • Try other available models like ai/gemma3ai/mistral, or ai/phi4 Explore the OpenAI-compatible API endpoints for more advanced integration
  • Build your own frontend or backend that connects to Model Runner
  • Check out additional demos from the Docker samples repository

With Docker Model Runner, you’ve brought powerful AI capabilities right into your development workflow without leaving the Docker ecosystem or sending data to external services. Happy building!

Have Queries? Join https://launchpass.com/collabnix

Ajeet Raina Ajeet Singh Raina is a former Docker Captain, Community Leader and Distinguished Arm Ambassador. He is a founder of Collabnix blogging site and has authored more than 700+ blogs on Docker, Kubernetes and Cloud-Native Technology. He runs a community Slack of 9800+ members and discord server close to 2600+ members. You can follow him on Twitter(@ajeetsraina).

Tesla Model 3 Report: An In-Depth Analysis by CrewAI…

Tesla Model 3 Report: A Technical Maintenance Perspective Tesla Model 3 Report: A Technical Maintenance Perspective First and foremost, we introduce the to you....
Adesoji Alu
5 min read

Leave a Reply

Collabnixx
Chatbot
Join our Discord Server
Index