The Docker GenAI Stack is a set of open-source tools and technologies that simplifies the development and deployment of Generative AI (GenAI) applications. It aims to make building and running AI models like large language models (LLMs) and other complex AI systems easier and more accessible for developers, especially those not deeply familiar with AI infrastructure.
The GenAI Stack combines four key components:
- Langchain: A Python library for managing AI workflows, data pipelines, and experiments.
- Docker: Containerization platform for packaging and running applications in a consistent and isolated environment.
- Neo4j: Graph database for storing and managing relationships between data points, particularly valuable for LLMs with their contextual understanding.
- Ollama: Tool for optimizing and deploying LLMs on various hardware platforms, including CPUs and GPUs.
By integrating these components, the GenAI Stack provides several benefits:
- Faster Development: Streamlines the development process by providing pre-built modules and tools for common GenAI tasks.
- Simplified Deployment: Makes it easier to deploy GenAI applications across different environments and platforms.
- Improved Efficiency: Optimizes AI models for better performance and resource utilization.
- Accessibility: Lowers the barrier to entry for developers new to GenAI by providing a user-friendly platform.
In essence, the Docker GenAI Stack tackles the complexity of building and deploying GenAI applications, democratizing access to this powerful technology for a wider range of developers and enabling faster innovation in the field.
Using GenAI Stack with GPU
Using a GPU with the Docker GenAI Stack offers several key advantages for working with Generative AI (GenAI) models, particularly large language models (LLMs):
1. Significantly faster training and inference
GPUs are specially designed for parallel processing, making them vastly faster than CPUs for tasks involving large amounts of data, like training and running LLMs. This translates to:
- Reduced training time: Your models will be trained in a fraction of the time compared to a CPU-only setup, speeding up development and experimentation.
- Real-time responsiveness: Inference, or drawing conclusions from the model, becomes much faster, allowing for real-time interactions with LLMs, for example in chatbots or voice assistants.
2. Increased model capacity and complexity
GPUs enable training and running larger and more complex LLMs that wouldn’t be feasible on CPUs. This allows you to:
- Handle larger datasets: Train on more data to create more accurate and comprehensive models.
- Build models with higher parameter counts: This leads to LLMs with greater capabilities and more nuanced understanding of language.
- Explore more advanced architectures: Leverage the power of GPUs to experiment with cutting-edge LLM architectures.
3. Improved resource efficiency:
Running GenAI workloads on GPUs can provide better resource utilization, leading to:
- Reduced power consumption: Compared to CPUs, GPUs can handle the same workload with less power, saving energy and operating costs.
- Lower overall hardware cost: In some cases, using a single GPU-powered system can be more cost-effective than a larger number of CPU-based machines for GenAI tasks.
4. Simplified development and deployment:
The GenAI Stack is designed to leverage the capabilities of GPUs seamlessly, making it:
- Easier to build and deploy GPU-accelerated GenAI applications: The stack handles resource allocation and configuration for you, streamlining the process.
- More portable: You can easily move your GenAI workflow across different environments with GPU support without major adjustments.
Getting Started
Prerequisites
This tutorial has been tested on one of Ubuntu Cloud instance with NVIDIA GPU. If you want to test it on Docker Desktop, I recommend using Docker Desktop Windows with WSL2 backend. Currently GPU support in Docker Desktop is only available on Windows with the WSL2 backend.
1. Download the required NVIDIA driver
Visit the official NVIDIA drivers page to download and install the proper drivers. Reboot your system once you have done so.
2. Install nvidia-container-runtime
Follow the instructions at ( https://nvidia.github.io/nvidia-container-runtime/) and then run this command:
apt-get install nvidia-container-runtime
Ensure the nvidia-container-runtime-hook is accessible from $PATH.
which nvidia-container-runtime-hook
3. Restart the Docker daemon
4. Expose GPUs for use
Include the --gpus
flag when you start a container to access GPU resources. Specify how many GPUs to use. For example:
docker run -it --rm --gpus all ubuntu nvidia-smi
Unable to find image 'ubuntu:latest' locally
latest: Pulling from library/ubuntu
29202e855b20: Pull complete
Digest: sha256:e6173d4dc55e76b87c4af8db8821b1feae4146dd47341e4d431118c7dd060a74
Status: Downloaded newer image for ubuntu:latest
Mon Jan 29 02:36:51 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.125.06 Driver Version: 525.125.06 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GRID A100D-4C On | 00000000:06:00.0 Off | 0 |
| N/A N/A P0 N/A / N/A | 0MiB / 4096MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Exposes that specific GPU
docker run -it --rm --gpus '"device=0"' ubuntu nvidia-smi
Mon Jan 29 02:42:33 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.125.06 Driver Version: 525.125.06 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GRID A100D-4C On | 00000000:06:00.0 Off | 0 |
| N/A N/A P0 N/A / N/A | 0MiB / 4096MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Pls note that NVIDIA GPUs can only be accessed by systems running a single engine.
Set NVIDIA capabilities
You can set capabilities manually. For example, on Ubuntu you can run the following:
docker run --gpus 'all,capabilities=utility' --rm ubuntu nvidia-smi
Mon Jan 29 02:43:47 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.125.06 Driver Version: 525.125.06 CUDA Version: N/A |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GRID A100D-4C On | 00000000:06:00.0 Off | 0 |
| N/A N/A P0 N/A / N/A | 0MiB / 4096MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+--------------------------
Clone the repo
git clone https://github.com/docker/genai-stack
cd genai-stack
Setting up the environment variables
cat .env
OPENAI_API_KEY=sk-EsNJzI5uMBCXXXXXXXX0Htnig8KIil4x
OLLAMA_BASE_URL=http://llm-gpu:11434
NEO4J_URI=neo4j://database:7687
NEO4J_USERNAME=neo4j
NEO4J_PASSWORD=XXXXX
LLM=llama2 #or any Ollama model tag, or gpt-4 or gpt-3.5
EMBEDDING_MODEL=sentence_transformer #or openai or ollama
LANGCHAIN_ENDPOINT="https://api.smith.langchain.com"
LANGCHAIN_TRACING_V2=true # false
LANGCHAIN_PROJECT=default
LANGCHAIN_API_KEY=ls__cbXXXXXXX1d6106dd
Bringing up Docker GenAI Stack
docker compose --profile linux-gpu up
Verifying if all the required services are up and running
NAME IMAGE COMMAND SERVICE CREATED STATUS PORTS
genai-stack-api-1 genai-stack-api "uvicorn api:app --host 0.0.0.0 --port 8504" api 8 minutes ago Up 6 minutes (healthy) 0.0.0.0:8504->8504/tcp, :::8504->8504/tcp
genai-stack-bot-1 genai-stack-bot "streamlit run bot.py --server.port=8501 --server.address=0.0.0.0" bot 8 minutes ago Up 6 minutes (healthy) 0.0.0.0:8501->8501/tcp, :::8501->8501/tcp
genai-stack-database-1 neo4j:5.11 "tini -g -- /startup/docker-entrypoint.sh neo4j" database 8 minutes ago Up 8 minutes (healthy) 0.0.0.0:7474->7474/tcp, :::7474->7474/tcp, 7473/tcp, 0.0.0.0:7687->7687/tcp, :::7687->7687/tcp
genai-stack-front-end-1 genai-stack-front-end "npm run dev" front-end 8 minutes ago Up 5 minutes 0.0.0.0:8505->8505/tcp, :::8505->8505/tcp
genai-stack-llm-gpu-1 ollama/ollama:latest "/bin/ollama serve" llm-gpu 4 weeks ago Up 20 minutes 11434/tcp
genai-stack-loader-1 genai-stack-loader "streamlit run loader.py --server.port=8502 --server.address=0.0.0.0" loader 8 minutes ago Up 6 minutes (unhealthy) 0.0.0.0:8502->8502/tcp, :::8502->8502/tcp, 0.0.0.0:8081->8080/tcp, :::8081->8080/tcp
genai-stack-pdf_bot-1 genai-stack-pdf_bot "streamlit run pdf_bot.py --server.port=8503 --server.address=0.0.0.0" pdf_bot 8 minutes ago Up 6 minutes (healthy) 0.0.0.0:8503->8503/tcp, :::8503->8503/tcp
genai-stack-pull-model-1 | pulling 8934d96d3f08... 100% ▕▏ 3.8 GB
genai-stack-pull-model-1 | pulling 8c17c2ebb0ea... 100% ▕▏ 7.0 KB
genai-stack-pull-model-1 | pulling 7c23fb36d801... 100% ▕▏ 4.8 KB
genai-stack-pull-model-1 | pulling 2e0493f67d0c... 100% ▕▏ 59 B
genai-stack-pull-model-1 | pulling fa304d675061... 100% ▕▏ 91 B
genai-stack-pull-model-1 | pulling 42ba7f8a01dd... 100% ▕▏ 557 B
...
Here’s what’s in this repo:
Name | Main files | Compose name | URLs | Description |
---|---|---|---|---|
Support Bot | bot.py | bot | http://localhost:8501 | Main usecase. Fullstack Python application. |
Stack Overflow Loader | loader.py | loader | http://localhost:8502 | Load SO data into the database (create vector embeddings etc). Fullstack Python application. |
PDF Reader | pdf_bot.py | pdf_bot | http://localhost:8503 | Read local PDF and ask it questions. Fullstack Python application. |
Standalone Bot API | api.py | api | http://localhost:8504 | Standalone HTTP API streaming (SSE) + non-streaming endpoints Python. |
Standalone Bot UI | front-end/ | front-end | http://localhost:8505 | Standalone client that uses the Standalone Bot API to interact with the model. JavaScript (Svelte) front-end. |
The database can be explored at http://localhost:7474.
Verifying if GPU is being consumed or not
Let’s try the PDF bot sample app and see if GPU is being used or not.
Open http://HostIP:8503/ to access the PDF bot.
You can clearly see that new process under GPU section that indicates GPU is being leveraged.
Using Nvitop
You can also use tools like nvitop that can help you know the real-time usage of GPU device.