NVIDIA Jetson devices are powerful platforms designed for edge AI applications, offering excellent GPU acceleration capabilities to run compute-intensive tasks like language model inference.
With official support for NVIDIA Jetson devices, Ollama brings the ability to manage and serve Large Language Models (LLMs) locally, ensuring privacy, performance, and offline operation. By integrating Open WebUI, you can enhance your workflow with an intuitive web interface for managing these models.
It is important to note that the NVIDIA Jetson Nano, equipped with 4GB of memory, can run smaller LLaMA models, particularly those with fewer parameters, such as the 7B models. However, due to its limited memory, running these models may require optimizations like quantization to reduce memory usage.
For instance, using 4-bit quantization can make it feasible to run these models on the Jetson Nano. It’s important to note that while the Jetson Nano can handle these smaller models, performance may be constrained compared to more powerful hardware. Additionally, some users have reported challenges in utilizing GPU acceleration with pre-built binaries on the Jetson Nano, suggesting that building from source might be necessary to achieve optimal performance.
This guide will walk you through setting up Ollama on your Jetson device, integrating it with Open WebUI, and configuring the system for optimal GPU utilization. Whether you’re a developer or an AI enthusiast, this setup allows you to harness the full potential of LLMs right on your Jetson device.
Pre-requisite
- Jetson Orin Nano
- A 5V 4Ampere Charger
- 64GB SD card
- WiFi Adapter
- Wireless Keyboard
- Wireless mouse
Software
- Download Jetson SD card image from this link
- Raspberry Pi Imager installed on your local system
Preparing Your Jetson Nano
- Unzip the SD card image
- Insert SD card into your system.
- Bring up Raspberry Pi Imager tool to flash image into the SD card
Prerequisite
- Ensure that you have Jetpack 6.0 installed on your Jetson Orin Nano device. You can download the SDK Manager on the remote Windows or Linux and follow the tutorial from the official NVIDIA Developer site.
Step 1. Verify L4T Version
To check the L4T (Linux for Tegra) version on your NVIDIA Jetson device (e.g., Jetson Nano, Jetson Xavier), follow these steps:
Run the following command to retrieve your current L4T version.
head -n 1 /etc/nv_tegra_release
Here are the list of supported L4T versions:
- 35.3.1
- 35.4.1
- 35.5.0
- 36.3.0
If your L4T version does not match the supported versions listed above, you may need to re-flash the system on your NVIDIA Jetson device. You might need to use SDK Manager on another computer to re-flash the device. You can download the SDK Manager and follow the tutorial from the official NVIDIA Developer site.
Step 2. Keep apt
up to date:
sudo apt update && sudo apt upgrade
Step 3. Install jetpack
:
sudo apt install jetpack
Step 4. Add users
Add your user to the docker group and restart the Docker service to apply the change:
sudo usermod -aG docker $USER && \
newgrp docker && \
sudo systemctl daemon-reload && \
sudo systemctl restart docker
Step 5. Install jetson-examples:
pip3 install jetson-examples
Step 6. Reboot system
sudo reboot
Step 7. Install Ollama
reComputer run ollama
Optional: If you run the above command via ssh
and encounter the error command not found: reComputer
, you can resolve this by executing the following command:
source ~/.profile
Step 8. Run a model
The smallest LLaMA model available for download is TinyLlama, a compact 1.1 billion parameter model. Despite its reduced size, TinyLlama demonstrates remarkable performance across various tasks, making it suitable for applications with limited computational resources. You can access TinyLlama through its GitHub repository or via Hugging Face.
Let’s run the tinyllama model and perform tasks like generating Python code:
ollama run tinyllama
>>> > Can you write a Python script to calculate the factorial of a number?
Sure! Here’s the code:
def factorial(n):
if n == 0 or n == 1:
return 1
else:
return n * factorial(n - 1)
num = int(input("Enter a number: "))
print(f"The factorial of {num} is {factorial(num)}")
Step 9. Install models (e.g. llama3.2) from Ollama Library
ollama pull llama3.2
Step 10. Install and run Open WebUI through Docker
docker run -d -p 3000:8080 --gpus all --add-host=host.docker.internal:host-gateway -v open-webui:/app/backend/data --name open-webui --restart always ghcr.io/open-webui/open-webui:cuda
Step 10. Install and run Open WebUI through docker
Once the installation is finished, you can access the GUI by visiting YOUR_SERVER_IP:3000 in your browser.
Access the API endpoints by navigating to YOUR_SERVER_IP/ollama/docs#/. For comprehensive documentation, please refer to the official resources: the Ollama API Documentation (recommended) and Open WebUI API Endpoints.
Using GPU
This installation method uses a single container image that bundles Open WebUI with Ollama, allowing for a streamlined setup via a single command. Choose the appropriate command based on your hardware setup:
sudo docker run -d -p 3000:8080 --gpus=all -v ollama:/root/.ollama -v open-webui:/app/backend/data --name open-webui --restart always ghcr.io/open-webui/open-webui:ollama
Using CPU only
For CPU Only: If you’re not using a GPU, use this command instead:
sudo docker run -d -p 3000:8080 -v ollama:/root/.ollama -v open-webui:/app/backend/data --name open-webui --restart always ghcr.io/open-webui/open-webui:ollama
Both commands facilitate a built-in, hassle-free installation of both Open WebUI and Ollama, ensuring that you can get everything up and running swiftly.
Using Docker Compose
services:
open-webui-ollama:
image: ghcr.io/open-webui/open-webui:ollama
container_name: open-webui-ollama
restart: always
ports:
- "3001:8080" # Ollama service runs on a different port
deploy:
resources:
reservations:
devices:
- capabilities: ["gpu"]
volumes:
- ollama:/root/.ollama
- open-webui:/app/backend/data
open-webui-cuda:
image: ghcr.io/open-webui/open-webui:cuda
container_name: open-webui-cuda
restart: always
ports:
- "3002:8080" # CUDA service runs on another port
deploy:
resources:
reservations:
devices:
- capabilities: ["gpu"]
volumes:
- open-webui:/app/backend/data
extra_hosts:
- "host.docker.internal:host-gateway"
volumes:
ollama:
open-webui:
Bringing up the Stack
docker compose up -d
Conclusion
Once configured, Open WebUI can be accessed at http://localhost:3000, while Ollama operates at http://localhost:11434. This setup provides a seamless and GPU-accelerated environment for running and managing LLMs locally on NVIDIA Jetson devices.
This guide showcases the power and versatility of NVIDIA Jetson devices when paired with Ollama and Open WebUI, enabling advanced AI workloads at the edge with ease and efficiency.