Running Ollama with Nvidia GPU Acceleration: A Docker Compose Guide

Table of Contents

Large Language Models (LLMs) are revolutionizing various fields, pushing the boundaries of what machines can achieve. However, their complexity demands ever-increasing processing power. This is where accelerators like Nvidia GPUs come into play, offering a significant boost for training and inference tasks.

In this blog post, we’ll guide you through running Ollama, a popular self-hosted LLM server, with Docker Compose and leverage the raw power of your Nvidia GPU. We’ll delve into the configuration details, ensuring you get the most out of your LLM experience.

Understanding LLM Hallucination: Implications and Solutions for Large Language Models

Prerequisites:

Docker and Docker Compose: Ensure you have Docker and Docker Compose installed and running on your system. You can find installation instructions on the official Docker website: https://docs.docker.com/engine/install/
Nvidia GPU: Your system must have an Nvidia GPU installed and configured. Verify this by running nvidia-smi in your terminal. If the command doesn’t work or returns an error, refer to Nvidia’s documentation for configuration guidance: https://docs.nvidia.com/

Getting Started

Now, let’s explore the key components of the docker-compose.yml file that facilitates running Ollama with GPU acceleration:

Docker Compose Version

The version property specifies the Docker Compose version being used. While some might mention 3.9, it’s recommended to stick with the officially documented version, currently 3.8. This ensures compatibility and stability.

Ollama Service Definition

The services section defines the ollama service, which encapsulates the Ollama container. Here’s a breakdown of its important properties:

image: This specifies the Docker image for Ollama. The default is ollama/ollama, but you can use a specific version if needed (refer to Ollama’s documentation for available versions).
deploy: This section configures resource reservations for the Ollama container. This is where the magic happens for harnessing the GPU power. resources: Defines the resource requirements for the container.
reservations: This nested property allows you to reserve specific devices for the container. devices: Defines a device reservation. Within this nested configuration, we specify:
driver: Sets the device driver to nvidia to indicate we’re requesting an Nvidia GPU. capabilities: Lists the capabilities requested by Ollama. In this case, we specify “gpu” to signify our desire to leverage the GPU for processing.
count: This value determines how many Nvidia GPUs you want to reserve for Ollama. Use all to utilize all available GPUs or specify a specific number if you have multiple GPUs and want to dedicate a subset for Ollama.

Persistent Volume Definition

The volumes section defines a persistent volume named ollama. This volume ensures that any data generated by Ollama, such as trained models or configurations, persists even after container restarts. It’s mounted at the /root/.ollama directory within the Ollama container.

Putting it All Together

Here’s the complete docker-compose.yml configuration for running Ollama with Nvidia GPU acceleration using Docker Compose:

services:
  ollama:
    container_name: ollama
    image: ollama/ollama  # Replace with specific Ollama version if needed
    deploy:
      resources:
        reservations:
          devices:
          - driver: nvidia
            capabilities: ["gpu"]
            count: all  # Adjust count for the number of GPUs you want to use
    volumes:
      - ollama:/root/.ollama
    restart: always

volumes:
  ollama:

Running Ollama with GPU Acceleration:

With the configuration file ready, save it as docker-compose.yml in your desired directory. Now, you can run the following command to start Ollama with GPU support:

docker-compose up -d

The -d flag ensures the container runs in the background.

Verification:

After running the command, you can check Ollama’s logs to see if the Nvidia GPU is being utilized. Look for messages indicating “Nvidia GPU detected via cudart” or similar wording within the logs. This confirmation signifies successful GPU integration with Ollama.

Additional Considerations:

Refer to Ollama’s official documentation for any additional configuration or resource requirements based on your specific use case.
Adjust the count value in the devices section to match the number of Nvidia GPUs you want to dedicate to Ollama.
This example uses Nvidia and wouldn’t work for ROCm directly. It’s meant to illustrate the concept of resource reservations within a Compose file (which currently doesn’t support ROCm).