Join our Discord Server
Ajeet Raina Ajeet Singh Raina is a former Docker Captain, Community Leader and Distinguished Arm Ambassador. He is a founder of Collabnix blogging site and has authored more than 700+ blogs on Docker, Kubernetes and Cloud-Native Technology. He runs a community Slack of 9800+ members and discord server close to 2600+ members. You can follow him on Twitter(@ajeetsraina).

Running Ollama with Nvidia GPU Acceleration: A Docker Compose Guide

3 min read

NVIDIA Jetson devices are powerful platforms designed for edge AI applications, offering excellent GPU acceleration capabilities to run compute-intensive tasks like language model inference.

With official support for NVIDIA Jetson devices, Ollama brings the ability to manage and serve Large Language Models (LLMs) locally, ensuring privacy, performance, and offline operation. By integrating Open WebUI, you can enhance your workflow with an intuitive web interface for managing these models.

It is important to note that the NVIDIA Jetson Nano, equipped with 4GB of memory, can run smaller LLaMA models, particularly those with fewer parameters, such as the 7B models. However, due to its limited memory, running these models may require optimizations like quantization to reduce memory usage.

For instance, using 4-bit quantization can make it feasible to run these models on the Jetson Nano. It’s important to note that while the Jetson Nano can handle these smaller models, performance may be constrained compared to more powerful hardware. Additionally, some users have reported challenges in utilizing GPU acceleration with pre-built binaries on the Jetson Nano, suggesting that building from source might be necessary to achieve optimal performance.

This guide will walk you through setting up Ollama on your Jetson device, integrating it with Open WebUI, and configuring the system for optimal GPU utilization. Whether you’re a developer or an AI enthusiast, this setup allows you to harness the full potential of LLMs right on your Jetson device.

Pre-requisite

  1. Jetson Orin Nano
  2. A 5V 4Ampere Charger
  3. 64GB SD card
  4. WiFi Adapter
  5. Wireless Keyboard
  6. Wireless mouse
Image32

Software

  • Download Jetson SD card image from this link
  • Raspberry Pi Imager installed on your local system

Preparing Your Jetson Nano

  1. Unzip the SD card image
  2. Insert SD card into your system.
  3. Bring up Raspberry Pi Imager tool to flash image into the SD card

Prerequisite

  • Ensure that you have Jetpack 6.0 installed on your Jetson Orin Nano device. You can download the SDK Manager on the remote Windows or Linux and follow the tutorial from the official NVIDIA Developer site.

Step 1. Verify L4T Version

To check the L4T (Linux for Tegra) version on your NVIDIA Jetson device (e.g., Jetson Nano, Jetson Xavier), follow these steps:

Run the following command to retrieve your current L4T version.

head -n 1 /etc/nv_tegra_release

Here are the list of supported L4T versions:

  • 35.3.1
  • 35.4.1
  • 35.5.0
  • 36.3.0

If your L4T version does not match the supported versions listed above, you may need to re-flash the system on your NVIDIA Jetson device. You might need to use SDK Manager on another computer to re-flash the device. You can download the SDK Manager and follow the tutorial from the official NVIDIA Developer site.

Step 2. Keep apt up to date:

   sudo apt update && sudo apt upgrade

Step 3. Install jetpack:

   sudo apt install jetpack

Step 4. Add users

Add your user to the docker group and restart the Docker service to apply the change:

   sudo usermod -aG docker $USER && \
   newgrp docker && \
   sudo systemctl daemon-reload && \
   sudo systemctl restart docker

Step 5. Install jetson-examples:

   pip3 install jetson-examples

Step 6. Reboot system

   sudo reboot

Step 7. Install Ollama

   reComputer run ollama

Optional: If you run the above command via ssh and encounter the error command not found: reComputer, you can resolve this by executing the following command:

   source ~/.profile

Step 8. Run a model

The smallest LLaMA model available for download is TinyLlama, a compact 1.1 billion parameter model. Despite its reduced size, TinyLlama demonstrates remarkable performance across various tasks, making it suitable for applications with limited computational resources. You can access TinyLlama through its GitHub repository or via Hugging Face.

Let’s run the tinyllama model and perform tasks like generating Python code:

ollama run tinyllama
>>> > Can you write a Python script to calculate the factorial of a number?
Sure! Here’s the code:

def factorial(n):
    if n == 0 or n == 1:
        return 1
    else:
        return n * factorial(n - 1)

num = int(input("Enter a number: "))
print(f"The factorial of {num} is {factorial(num)}")

Step 9. Install models (e.g. llama3.2) from Ollama Library

ollama pull llama3.2

Step 10. Install and run Open WebUI through Docker

docker run -d -p 3000:8080 --gpus all --add-host=host.docker.internal:host-gateway -v open-webui:/app/backend/data --name open-webui --restart always ghcr.io/open-webui/open-webui:cuda

Step 10. Install and run Open WebUI through docker

Once the installation is finished, you can access the GUI by visiting YOUR_SERVER_IP:3000 in your browser.

Access the API endpoints by navigating to YOUR_SERVER_IP/ollama/docs#/. For comprehensive documentation, please refer to the official resources: the Ollama API Documentation (recommended) and Open WebUI API Endpoints.

Using GPU

This installation method uses a single container image that bundles Open WebUI with Ollama, allowing for a streamlined setup via a single command. Choose the appropriate command based on your hardware setup:

sudo docker run -d -p 3000:8080 --gpus=all -v ollama:/root/.ollama -v open-webui:/app/backend/data --name open-webui --restart always ghcr.io/open-webui/open-webui:ollama

Using CPU only

For CPU Only: If you’re not using a GPU, use this command instead:

sudo docker run -d -p 3000:8080 -v ollama:/root/.ollama -v open-webui:/app/backend/data --name open-webui --restart always ghcr.io/open-webui/open-webui:ollama

Both commands facilitate a built-in, hassle-free installation of both Open WebUI and Ollama, ensuring that you can get everything up and running swiftly.

Using Docker Compose



services:
  open-webui-ollama:
    image: ghcr.io/open-webui/open-webui:ollama
    container_name: open-webui-ollama
    restart: always
    ports:
      - "3001:8080" # Ollama service runs on a different port
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: ["gpu"]
    volumes:
      - ollama:/root/.ollama
      - open-webui:/app/backend/data

  open-webui-cuda:
    image: ghcr.io/open-webui/open-webui:cuda
    container_name: open-webui-cuda
    restart: always
    ports:
      - "3002:8080" # CUDA service runs on another port
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: ["gpu"]
    volumes:
      - open-webui:/app/backend/data
    extra_hosts:
      - "host.docker.internal:host-gateway"

volumes:
  ollama:
  open-webui:

Bringing up the Stack

docker compose up -d 

Conclusion

Once configured, Open WebUI can be accessed at http://localhost:3000, while Ollama operates at http://localhost:11434. This setup provides a seamless and GPU-accelerated environment for running and managing LLMs locally on NVIDIA Jetson devices.

This guide showcases the power and versatility of NVIDIA Jetson devices when paired with Ollama and Open WebUI, enabling advanced AI workloads at the edge with ease and efficiency.

Have Queries? Join https://launchpass.com/collabnix

Ajeet Raina Ajeet Singh Raina is a former Docker Captain, Community Leader and Distinguished Arm Ambassador. He is a founder of Collabnix blogging site and has authored more than 700+ blogs on Docker, Kubernetes and Cloud-Native Technology. He runs a community Slack of 9800+ members and discord server close to 2600+ members. You can follow him on Twitter(@ajeetsraina).
Join our Discord Server
Index