Using Docker GenAI Stack with GPU for Generative AI Models

The Docker GenAI Stack is a set of open-source tools and technologies that simplifies the development and deployment of Generative AI (GenAI) applications. It aims to make building and running AI models like large language models (LLMs) and other complex AI systems easier and more accessible for developers, especially those not deeply familiar with AI infrastructure.

The GenAI Stack combines four key components:

Langchain: A Python library for managing AI workflows, data pipelines, and experiments.
Docker: Containerization platform for packaging and running applications in a consistent and isolated environment.
Neo4j: Graph database for storing and managing relationships between data points, particularly valuable for LLMs with their contextual understanding.
Ollama: Tool for optimizing and deploying LLMs on various hardware platforms, including CPUs and GPUs.

By integrating these components, the GenAI Stack provides several benefits:

Faster Development: Streamlines the development process by providing pre-built modules and tools for common GenAI tasks.
Simplified Deployment: Makes it easier to deploy GenAI applications across different environments and platforms.
Improved Efficiency: Optimizes AI models for better performance and resource utilization.
Accessibility: Lowers the barrier to entry for developers new to GenAI by providing a user-friendly platform.

In essence, the Docker GenAI Stack tackles the complexity of building and deploying GenAI applications, democratizing access to this powerful technology for a wider range of developers and enabling faster innovation in the field.

Using GenAI Stack with GPU

Using a GPU with the Docker GenAI Stack offers several key advantages for working with Generative AI (GenAI) models, particularly large language models (LLMs):

1. Significantly faster training and inference

GPUs are specially designed for parallel processing, making them vastly faster than CPUs for tasks involving large amounts of data, like training and running LLMs. This translates to:

Reduced training time: Your models will be trained in a fraction of the time compared to a CPU-only setup, speeding up development and experimentation.
Real-time responsiveness: Inference, or drawing conclusions from the model, becomes much faster, allowing for real-time interactions with LLMs, for example in chatbots or voice assistants.

2. Increased model capacity and complexity

GPUs enable training and running larger and more complex LLMs that wouldn’t be feasible on CPUs. This allows you to:

Handle larger datasets: Train on more data to create more accurate and comprehensive models.
Build models with higher parameter counts: This leads to LLMs with greater capabilities and more nuanced understanding of language.
Explore more advanced architectures: Leverage the power of GPUs to experiment with cutting-edge LLM architectures.

3. Improved resource efficiency:

Running GenAI workloads on GPUs can provide better resource utilization, leading to:

Reduced power consumption: Compared to CPUs, GPUs can handle the same workload with less power, saving energy and operating costs.
Lower overall hardware cost: In some cases, using a single GPU-powered system can be more cost-effective than a larger number of CPU-based machines for GenAI tasks.

4. Simplified development and deployment:

The GenAI Stack is designed to leverage the capabilities of GPUs seamlessly, making it:

Easier to build and deploy GPU-accelerated GenAI applications: The stack handles resource allocation and configuration for you, streamlining the process.
More portable: You can easily move your GenAI workflow across different environments with GPU support without major adjustments.

Getting Started

Prerequisites

This tutorial has been tested on one of Ubuntu Cloud instance with NVIDIA GPU. If you want to test it on Docker Desktop, I recommend using Docker Desktop Windows with WSL2 backend. Currently GPU support in Docker Desktop is only available on Windows with the WSL2 backend.

1. Download the required NVIDIA driver

Visit the official NVIDIA drivers page to download and install the proper drivers. Reboot your system once you have done so.

2. Install nvidia-container-runtime

Follow the instructions at ( https://nvidia.github.io/nvidia-container-runtime/) and then run this command:

 apt-get install nvidia-container-runtime

Ensure the nvidia-container-runtime-hook is accessible from $PATH.

 which nvidia-container-runtime-hook

3. Restart the Docker daemon

4. Expose GPUs for use

Include the --gpus flag when you start a container to access GPU resources. Specify how many GPUs to use. For example:

docker run -it --rm --gpus all ubuntu nvidia-smi
Unable to find image 'ubuntu:latest' locally
latest: Pulling from library/ubuntu
29202e855b20: Pull complete
Digest: sha256:e6173d4dc55e76b87c4af8db8821b1feae4146dd47341e4d431118c7dd060a74
Status: Downloaded newer image for ubuntu:latest
Mon Jan 29 02:36:51 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.125.06   Driver Version: 525.125.06   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GRID A100D-4C       On   | 00000000:06:00.0 Off |                    0 |
| N/A   N/A    P0    N/A /  N/A |      0MiB /  4096MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Exposes that specific GPU

docker run -it --rm --gpus '"device=0"' ubuntu nvidia-smi
Mon Jan 29 02:42:33 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.125.06   Driver Version: 525.125.06   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GRID A100D-4C       On   | 00000000:06:00.0 Off |                    0 |
| N/A   N/A    P0    N/A /  N/A |      0MiB /  4096MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Pls note that NVIDIA GPUs can only be accessed by systems running a single engine.

Set NVIDIA capabilities

You can set capabilities manually. For example, on Ubuntu you can run the following:

 docker run --gpus 'all,capabilities=utility' --rm ubuntu nvidia-smi
Mon Jan 29 02:43:47 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.125.06   Driver Version: 525.125.06   CUDA Version: N/A      |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GRID A100D-4C       On   | 00000000:06:00.0 Off |                    0 |
| N/A   N/A    P0    N/A /  N/A |      0MiB /  4096MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+--------------------------

Clone the repo

git clone https://github.com/docker/genai-stack
cd genai-stack

Setting up the environment variables

cat .env
OPENAI_API_KEY=sk-EsNJzI5uMBCXXXXXXXX0Htnig8KIil4x
OLLAMA_BASE_URL=http://llm-gpu:11434
NEO4J_URI=neo4j://database:7687
NEO4J_USERNAME=neo4j
NEO4J_PASSWORD=XXXXX
LLM=llama2 #or any Ollama model tag, or gpt-4 or gpt-3.5
EMBEDDING_MODEL=sentence_transformer #or openai or ollama

LANGCHAIN_ENDPOINT="https://api.smith.langchain.com"
LANGCHAIN_TRACING_V2=true # false
LANGCHAIN_PROJECT=default
LANGCHAIN_API_KEY=ls__cbXXXXXXX1d6106dd

Bringing up Docker GenAI Stack

docker compose --profile linux-gpu up

Verifying if all the required services are up and running

NAME                      IMAGE                   COMMAND                                                                  SERVICE     CREATED         STATUS                     PORTS
genai-stack-api-1         genai-stack-api         "uvicorn api:app --host 0.0.0.0 --port 8504"                             api         8 minutes ago   Up 6 minutes (healthy)     0.0.0.0:8504->8504/tcp, :::8504->8504/tcp
genai-stack-bot-1         genai-stack-bot         "streamlit run bot.py --server.port=8501 --server.address=0.0.0.0"       bot         8 minutes ago   Up 6 minutes (healthy)     0.0.0.0:8501->8501/tcp, :::8501->8501/tcp
genai-stack-database-1    neo4j:5.11              "tini -g -- /startup/docker-entrypoint.sh neo4j"                         database    8 minutes ago   Up 8 minutes (healthy)     0.0.0.0:7474->7474/tcp, :::7474->7474/tcp, 7473/tcp, 0.0.0.0:7687->7687/tcp, :::7687->7687/tcp
genai-stack-front-end-1   genai-stack-front-end   "npm run dev"                                                            front-end   8 minutes ago   Up 5 minutes               0.0.0.0:8505->8505/tcp, :::8505->8505/tcp
genai-stack-llm-gpu-1     ollama/ollama:latest    "/bin/ollama serve"                                                      llm-gpu     4 weeks ago     Up 20 minutes              11434/tcp
genai-stack-loader-1      genai-stack-loader      "streamlit run loader.py --server.port=8502 --server.address=0.0.0.0"    loader      8 minutes ago   Up 6 minutes (unhealthy)   0.0.0.0:8502->8502/tcp, :::8502->8502/tcp, 0.0.0.0:8081->8080/tcp, :::8081->8080/tcp
genai-stack-pdf_bot-1     genai-stack-pdf_bot     "streamlit run pdf_bot.py --server.port=8503 --server.address=0.0.0.0"   pdf_bot     8 minutes ago   Up 6 minutes (healthy)     0.0.0.0:8503->8503/tcp, :::8503->8503/tcp

genai-stack-pull-model-1  | pulling 8934d96d3f08... 100% ▕▏ 3.8 GB
genai-stack-pull-model-1  | pulling 8c17c2ebb0ea... 100% ▕▏ 7.0 KB
genai-stack-pull-model-1  | pulling 7c23fb36d801... 100% ▕▏ 4.8 KB
genai-stack-pull-model-1  | pulling 2e0493f67d0c... 100% ▕▏   59 B
genai-stack-pull-model-1  | pulling fa304d675061... 100% ▕▏   91 B
genai-stack-pull-model-1  | pulling 42ba7f8a01dd... 100% ▕▏  557 B
...

Here’s what’s in this repo:

Name	Main files	Compose name	URLs	Description
Support Bot	`bot.py`	`bot`	http://localhost:8501	Main usecase. Fullstack Python application.
Stack Overflow Loader	`loader.py`	`loader`	http://localhost:8502	Load SO data into the database (create vector embeddings etc). Fullstack Python application.
PDF Reader	`pdf_bot.py`	`pdf_bot`	http://localhost:8503	Read local PDF and ask it questions. Fullstack Python application.
Standalone Bot API	`api.py`	`api`	http://localhost:8504	Standalone HTTP API streaming (SSE) + non-streaming endpoints Python.
Standalone Bot UI	`front-end/`	`front-end`	http://localhost:8505	Standalone client that uses the Standalone Bot API to interact with the model. JavaScript (Svelte) front-end.

The database can be explored at http://localhost:7474.

Verifying if GPU is being consumed or not

Let’s try the PDF bot sample app and see if GPU is being used or not.

Open http://HostIP:8503/ to access the PDF bot.

You can clearly see that new process under GPU section that indicates GPU is being leveraged.

Using Nvitop

You can also use tools like nvitop that can help you know the real-time usage of GPU device.