Join our Discord Server
Adesoji Alu Adesoji brings a proven ability to apply machine learning(ML) and data science techniques to solve real-world problems. He has experience working with a variety of cloud platforms, including AWS, Azure, and Google Cloud Platform. He has a strong skills in software engineering, data science, and machine learning. He is passionate about using technology to make a positive impact on the world.

Deploying NVIDIA NIM for Generative AI Applications

3 min read

NVIDIA’s NIM (Neural Inference Microservices) provides developers an efficient way to deploy optimized AI models from various sources, including community partners and NVIDIA itself. As part of the NVIDIA AI Enterprise suite, NIM offers a streamlined path to quickly iterate on and build innovative generative AI solutions.

With NIM, you can easily deploy a microservice container in under 5 minutes on NVIDIA GPU systems in the cloud or data center, or on workstations and PCs. Alternatively, you can start prototyping applications using NIM APIs from the
NVIDIA API Catalog without deploying containers.

Key Features

Key features of NIM include:

  • Pre-built containers that deploy with a single command on NVIDIA-accelerated infrastructure anywhere
  • Maintained security and control over your enterprise data
  • Support for fine-tuned models using techniques like LoRA
  • Integration with accelerated AI inference endpoints through consistent, industry-standard APIs
  • Compatibility with popular generative AI application frameworks like LangChain, LlamaIndex, and Haystack

How to Deploy NIM in 5 Minutes

To deploy NIM, you need either an NVIDIA AI Enterprise license or NVIDIA Developer Program membership. The fastest way to obtain these is by visiting the
NVIDIA API Catalog and choosing “Get API key” from a model page (e.g., Llama 3.1 405B). Then, enter your email address to access NIM with a 90-day NVIDIA AI Enterprise license or personal email for access through the NVIDIA Developer Program membership.

TNvidiaNIM

Deployment Script


# Choose a container name for bookkeeping
export CONTAINER_NAME=llama3-8b-instruct
 
# Define the vendor name for the LLM
export VENDOR_NAME=meta
 
# Choose a LLM NIM Image from NGC
export IMG_NAME="nvcr.io/nim/${VENDOR_NAME}/${CONTAINER_NAME}:1.0.0"
 
# Choose a path on your system to cache the downloaded models
export LOCAL_NIM_CACHE="~/.cache/nim"
mkdir -p "$LOCAL_NIM_CACHE"
 
# Start the LLM NIM
docker run -it --rm --name=$CONTAINER_NAME \
  --runtime=nvidia \
  --gpus all \
  -e NGC_API_KEY \
  -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
  -u $(id -u) \
  -p 8000:8000 \
  $IMG_NAME

Test an Inference Request


curl -X 'POST' \
	'http://0.0.0.0:8000/v1/completions' \
	-H 'accept: application/json' \
	-H 'Content-Type: application/json' \
	-d '{
  	"model": "meta/llama3-8b-instruct",
  	"prompt": "Once upon a time",
  	"max_tokens": 64
	}'

Integrating NIM with Your Applications

Start with a completions curl request following the OpenAI specification. To stream outputs, set stream to True. When using Python with the OpenAI library, you don’t need to provide an API key if you’re using a NIM microservice.


from openai import OpenAI
 
client = OpenAI(
  base_url = "http://0.0.0.0:8000/v1",
  api_key="no-key-required"
)
 
completion = client.chat.completions.create(
  model="meta/llama3-8b-instruct",
  messages=[{"role":"user","content":"What is a GPU?"}]
  temperature=0.5,
  top_p=1,
  max_tokens=1024,
  stream=True
)
 
for chunk in completion:
  if chunk.choices[1].delta.content is not None:
	print(chunk.choices[1].delta.content, end="")

NIM is also integrated into application frameworks like
Haystack,
LangChain, and
LlamaIndex, bringing secure, reliable, accelerated model inferencing to developers building generative AI applications with these popular tools.

To Use NIM Microservices in Python with LangChain


from langchain_nvidia_ai_endpoints import ChatNVIDIA

llm = ChatNVIDIA(base_url="http://0.0.0.0:8000/v1", model="meta/llama3-8b-instruct", temperature=0.5, max_tokens=1024, top_p=1)

result = llm.invoke("What is a GPU?")
print(result.content)

For more information about using NIM, see the framework notebooks provided by NVIDIA.

Using NIM Hugging Face Endpoints

You can also integrate a dedicated NIM endpoint directly on Hugging Face. Hugging Face spins up instances on your preferred cloud, deploys the NVIDIA-optimized model, and enables you to start inference with just a few clicks. Navigate to the model page on Hugging Face and create a dedicated endpoint using your preferred CSP. For more information and a step-by-step guide, see the NVIDIA-Hugging Face collaboration for simplified generative AI model deployments.

Huggingface

Customizing NIM with LoRA

To get more from NIM, learn how to use the microservices with LLMs customized with LoRA adapters. NIM supports LoRA adapters trained using either HuggingFace or NVIDIA NeMo. Store the LoRA adapters in /LOCAL_PEFT_DIRECTORY and serve using a script similar to the one used for the base container:


# Choose a container name for bookkeeping
export CONTAINER_NAME=llama3-8b-instruct

# Define the vendor name for the LLM
export VENDOR_NAME=meta

# Choose a LLM NIM Image from NGC
export IMG_NAME="nvcr.io/nim/${VENDOR_NAME}/${CONTAINER_NAME}:1.0.0"

# Choose a LLM NIM image from NGC
export LOCAL_PEFT_DIRECTORY=~/loras

# Download NeMo-format LoRA. You can also download HuggingFace PEFT LoRAs
ngc registry model download-version "nim/meta/llama3-70b-instruct-lora:nemo-math-v1"

# Start the LLM NIM microservice
docker run -it --rm --name=$CONTAINER_NAME \
  --runtime=nvidia \
  --gpus all \
  -e NGC_API_KEY \
  -e NIM_PEFT_SOURCE \
  -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
  -v $LOCAL_PEFT_DIRECTORY:$NIM_PEFT_SOURCE \
  -u $(id -u) \
  -p 8000:8000 \
  $IMG_NAME

You can then deploy using the name of one of the LoRA adapters in /LOCAL_PEFT_DIRECTORY.


curl -X 'POST' \
  'http://0.0.0.0:8000/v1/completions' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
"model": "llama3-8b-instruct-lora_vhf-math-v1",
"prompt": "John buys 10 packs of magic cards. Each pack has 20 cards and 1/4 of those cards are uncommon. How many uncommon cards did he get?",
"max_tokens": 128
}'

Have Queries? Join https://launchpass.com/collabnix

Adesoji Alu Adesoji brings a proven ability to apply machine learning(ML) and data science techniques to solve real-world problems. He has experience working with a variety of cloud platforms, including AWS, Azure, and Google Cloud Platform. He has a strong skills in software engineering, data science, and machine learning. He is passionate about using technology to make a positive impact on the world.
Join our Discord Server
Index