Running OpenLLM on GPUs using PyTorch and vLLM backend in a Docker Container

Table of Contents

OpenLLM is a powerful platform that empowers developers to leverage the potential of open-source large language models (LLMs). It is like a Swiss Army knife for LLMs. It’s a set of tools that helps developers overcome these deployment hurdles.

OpenLLM supports a vast array of open-source LLMs, including popular choices like Llama 2 and Mistral. This flexibility allows developers to pick the LLM that best aligns with their specific needs. The beauty of OpenLLM is that you can fine-tune any LLM with your own data to tailor its responses to your unique domain or application.

OpenLLM adopts an API structure that mirrors OpenAI’s, making it a breeze for developers familiar with OpenAI to transition their applications to leverage open-source LLMs.

Is OpenLLM a standalone product?

No. it’s a building block designed to integrate with other powerful tools easily. They currently offer integration with OpenAI’s Compatible Endpoints, LlamaIndex, LangChain, and Transformers Agents.

OpenLLM goes beyond just running large language models. It’s designed to be a versatile tool that can be integrated with other powerful AI frameworks and services. This allows you to build more complex and efficient AI applications. Here’s a breakdown of the integrations OpenLLM currently offers:

OpenAI’s Compatible Endpoints: This integration allows OpenLLM to mimic the API structure of OpenAI, a popular cloud-based platform for LLMs. This lets you use familiar tools and code designed for OpenAI with your OpenLLM models.
LlamaIndex: This is likely a search engine or index specifically designed for large language models. By integrating with LlamaIndex, you can efficiently search for specific information or capabilities within your OpenLLM models.
LangChain: This suggests a tool or framework for chaining together different NLP (Natural Language Processing) tasks. With LangChain integration, you can create multi-step workflows that combine OpenLLM’s capabilities with other NLP tools for more advanced tasks.
Transformers Agents: This likely refers to an integration with the Transformers library, a popular framework for building and using NLP models. This allows you to leverage the functionalities of Transformers along with OpenLLM for building robust NLP applications.

By taking advantage of these integrations, you can unlock the full potential of OpenLLM and create powerful AI solutions that combine the strengths of different tools and platforms.

What problems does OpenLLM solve?

OpenLLM works with a bunch of different LLMs, from Llama 2 to Flan-T5. This means developers can pick the best LLM for their specific needs.
Deploying LLMs can be a headache, but OpenLLM streamlines the process. It’s like having a clear instruction manual for setting things up.
Data security is a big concern with AI. OpenLLM helps ensure that LLMs are deployed in a way that follows data protection regulations.
As your LLM-powered service gets more popular, you need it to handle the extra traffic. OpenLLM helps build a flexible architecture that can grow with your needs.
The world of AI throws around a lot of jargon. OpenLLM integrates with various AI tools and frameworks, making it easier for developers to navigate this complex ecosystem. Blazing-Fast Performance
OpenLLM is meticulously designed for high-throughput serving, ensuring efficient handling of a large number of requests simultaneously.
OpenLLM leverages cutting-edge serving and inference techniques to deliver the fastest possible response times.

Getting Started

Download and Install Docker Desktop

Start running OpenLLM using Pytorch

$ docker run --rm -it -p 3000:3000 ghcr.io/bentoml/openllm start facebook/opt-1.3b --backend pt

Running OpenLLM using vLLM

$ docker run --rm -it -p 3000:3000 ghcr.io/bentoml/openllm start meta-llama/Llama-2-7b-chat-hf --backend vllm

You might encounter this issue if you try to run it on your Mac:

docker: Error response from daemon: no match for platform in manifest: not found.

That means that this Docker image didn’t follow the best practices and not compiled for Arm chips.

Let’s make it work.

Try adding the parameter --platform=linux/amd64 to the docker run command.

 docker run --rm -it --platform=linux/amd64 -p 3000:3000 ghcr.io/bentoml/openllm start facebook/opt-1.3b --backend pt

You will see the following result:

latest: Pulling from bentoml/openllm
e15cf30825b5: Download complete
24756bf79e78: Download complete
8a1e25ce7c4f: Download complete
e45919fa6a04: Download complete
aeea5c3a418f: Download complete
1ac41e12d207: Download complete
1103112ebfc4: Download complete
0b5b82abb9e8: Download complete
cc7f04ac52f8: Download complete
0d2012b79227: Download complete
101d4d666844: Download complete
2310831cf643: Download complete
87b8bf94a2ac: Download complete
b4b80ef7128d: Download complete
d30c94e4bd79: Download complete
8f05d7b02a83: Download complete
5ec312985191: Download complete
a6df4f5266e9: Download complete
Digest: sha256:efef229a1167e599955464bc6053326979ffc5f96ab77b2822a46a64fd8a247e
Status: Downloaded newer image for ghcr.io/bentoml/openllm:latest
PyTorch backend is deprecated and will be removed in future releases. Make sure to use vLLM instead.
config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 653/653 [00:00<00:00, 2.80MB/s]
tokenizer_config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 685/685 [00:00<00:00, 504kB/s]
vocab.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 899k/899k [00:00<00:00, 1.08MB/s]
merges.txt: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 456k/456k [00:00<00:00, 718kB/s]
special_tokens_map.json: 100%|███████████████████████████████████████████████████████████████████████████████████████| 441/441 [00:00<00:00, 1.49MB/s]
generation_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 137/137 [00:00<00:00, 439kB/s]
Fetching 7 files:  29%|███████████████████████████▋                                                                     | 2/7 [00:01<00:03,  1.31it/s]
pytorch_model.bin:  55%|█

You might see these messages:

vLLM is available, but using PyTorch backend instead. Note that vLLM is a lot more performant and should always be used in production (by explicitly set --backend vllm).
🚀Tip: run 'openllm build facebook/opt-1.3b --backend pt --serialization legacy' to create a BentoLLM for 'facebook/opt-1.3b'
2024-05-04T06:01:58+0000 [INFO] [cli] Prometheus metrics for HTTP BentoServer from "_service:svc" can be accessed at http://localhost:3000/metrics.
2024-05-04T06:01:58+0000 [INFO] [cli] Starting production HTTP BentoServer from "_service:svc" listening on http://0.0.0.0:3000 (Press CTRL+C to quit)
/usr/local/lib/python3.11/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
return self.fget.__get__(instance, owner)()

The message indicates that you’re running OpenLLM, a tool for deploying large language models (LLMs), but it’s currently using the PyTorch backend instead of the recommended vLLM backend.

Multiple Runtime Support

Different Large Language Models (LLMs) can be implemented using various runtime environments. OpenLLM offers support for these variations.

vLLM for Speed

vLLM is a high-performance runtime specifically designed for LLMs. If a model supports vLLM, OpenLLM will prioritize it by default for faster inference.

vLLM Hardware Requirements:

Using vLLM requires a GPU with at least Ampere architecture and CUDA version 11.8 or newer. This ensures compatibility with vLLM’s optimizations.

PyTorch Fallback

If vLLM isn’t available for a particular model, OpenLLM seamlessly falls back to PyTorch, a popular deep learning framework.

Manual Backend Selection

You can leverage the –backend option when starting your LLM server to explicitly choose between vLLM and PyTorch. This is useful if you want to ensure vLLM is used even if it’s not the default for that model.

Discovering Backend Options: To explore the supported backend options for each LLM, refer to the OpenLLM documentation’s “Supported models” section or simply run the openllm models command to get a list of available models and their compatible backends.

While both vLLM and PyTorch backends are available, vLLM generally offers superior performance and is recommended for production deployments.

Viewing it on the Docker Dashboard

Checking the container stats

By now, you must be able to access the frontend:

Using GPU

OpenLLM allows you to start your model server on multiple GPUs and specify the number of workers per resource assigned using the --workers-per-resource option. For example, if you have 4 available GPUs, you set the value as one divided by the number as only one instance of the Runner server will be spawned.

The amount of GPUs required depends on the model size itself. You can use the Model Memory Calculator from Hugging Face to calculate how much vRAM is needed to train and perform big model inference on a model and then plan your GPU strategy based on it.

Given you have access to GPUs and have setup nvidia-docker, you can additionally pass in --gpus to use GPU for faster inference and optimization

docker run --rm --gpus all -p 3000:3000 -it ghcr.io/bentoml/openllm start HuggingFaceH4/zephyr-7b-beta --backend vllm

Quantization

It’s a technique used to make machine learning models smaller and faster, especially when used for making predictions (inference). It works by converting the numbers used by the model (typically floating-point numbers) into smaller representations, often integers (quantized values).

Benefits of Quantization:

Faster computations: Calculations with integers are simpler than calculations with floating-point numbers, leading to faster model execution.
Reduced memory footprint: Smaller numbers require less storage space, making the model lighter and easier to deploy on devices with limited memory.
Deployment on resource-constrained devices: By reducing size and computation needs, quantization allows for running large models on devices with less power or processing capability.

OpenLLM’s Supported Quantization Techniques:

OpenLLM offers several quantization techniques to optimize your LLM for performance and resource usage:

LLM.int8(): This technique focuses on 8-bit integer matrix multiplication, likely achieved through libraries like bitsandbytes. This simplifies the core mathematical operation of LLMs.
SpQR: This technique uses a Sparse-Quantized Representation for near-lossless compression of LLM weights, again likely using bitsandbytes. It aims to reduce the model size while maintaining accuracy.
AWQ: This stands for Activation-aware Weight Quantization. It considers both the weights and activations of the model during quantization for potentially better accuracy compared to weight-only quantization methods.
GPTQ: This refers to Accurate Post-Training Quantization. It suggests a method for quantizing a pre-trained model while aiming to minimize the loss in accuracy.
SqueezeLLM: This technique combines Dense and Sparse Quantization for potentially even greater model size reduction.

Overall, by understanding quantization and the specific techniques offered by OpenLLM, you can optimize your large language models for deployment on various platforms and resource constraints.