Join our Discord Server
Docker Model Runner

How Docker Model Runner works

Estimated reading: 2 minutes 383 views

With Docker Model Runner, the AI model DOES NOT run in a container. Instead, Docker Model Runner uses a host-installed inference server (llama.cpp for now) that runs natively on your Mac rather than containerizing the model. We do plan to support additional inference (i.e MLX) in future releases.

1. Host-level process:

  • Docker Desktop runs llama.cpp directly on your host machine
  • This allows direct access to the hardware GPU acceleration on Apple Silicon

2. GPU acceleration:

  • By running directly on the host, the inference server can access Apple’s Metal API
  • This provides direct GPU acceleration without the overhead of containerization
  • You can see the GPU usage in Activity Monitor when queries are being processed

3. Model Loading Process:

  • When you run docker model pull, the model files are downloaded from the Docker Hub
  • These models are cached locally on the host machine’s storage 
  • Models are dynamically loaded into memory by llama.cpp when needed

Note: Unlike traditional Docker containers that package large AI models with the model runtime (which results in slow deployment), Model Runner separates the model from the runtime, allowing for faster deployment. 
Model Runner enables local LLM execution. It runs large language models (LLMs) directly on your machine rather than sending data to external API services. This means your data never leaves your infrastructure. All you need to do is pull the model from Docker Hub and start integrating it with your application.

Models are stored as OCI artifacts in Docker Hub, which means the standardized format will be supported by any other Docker Registries. So it will also work if you use your company’s internal artifact registries. It also means that anyone can build and push models to the registry or to the Docker Hub (tooling coming soon). Docker will be providing a part of that ecosystem initially working with the most popular models. 

Working with models as OCI artifacts has a number of advantages over the traditional approach of packaging an AI runtime and the model in a Docker image for a container. For example, you get no compression of the layers which is very good because model weights are largely uncompressible. So you get faster deployments, and lower disk requirements because you don’t need to have both the compressed and uncompressed versions of the model weights on your disk improving UX a lot. 

Leave a Reply

Share this Doc

How Docker Model Runner works

Or copy link

CONTENTS
Join our Discord Server