NVIDIA’s NIM (Neural Inference Microservices) provides developers an efficient way to deploy optimized AI models from various sources, including community partners and NVIDIA itself. As part of the NVIDIA AI Enterprise suite, NIM offers a streamlined path to quickly iterate on and build innovative generative AI solutions.
With NIM, you can easily deploy a microservice container in under 5 minutes on NVIDIA GPU systems in the cloud or data center, or on workstations and PCs. Alternatively, you can start prototyping applications using NIM APIs from the
NVIDIA API Catalog without deploying containers.
Key Features
Key features of NIM include:
- Pre-built containers that deploy with a single command on NVIDIA-accelerated infrastructure anywhere
- Maintained security and control over your enterprise data
- Support for fine-tuned models using techniques like LoRA
- Integration with accelerated AI inference endpoints through consistent, industry-standard APIs
- Compatibility with popular generative AI application frameworks like LangChain, LlamaIndex, and Haystack
How to Deploy NIM in 5 Minutes
To deploy NIM, you need either an NVIDIA AI Enterprise license or NVIDIA Developer Program membership. The fastest way to obtain these is by visiting the
NVIDIA API Catalog and choosing “Get API key” from a model page (e.g., Llama 3.1 405B). Then, enter your email address to access NIM with a 90-day NVIDIA AI Enterprise license or personal email for access through the NVIDIA Developer Program membership.
Deployment Script
# Choose a container name for bookkeeping
export CONTAINER_NAME=llama3-8b-instruct
# Define the vendor name for the LLM
export VENDOR_NAME=meta
# Choose a LLM NIM Image from NGC
export IMG_NAME="nvcr.io/nim/${VENDOR_NAME}/${CONTAINER_NAME}:1.0.0"
# Choose a path on your system to cache the downloaded models
export LOCAL_NIM_CACHE="~/.cache/nim"
mkdir -p "$LOCAL_NIM_CACHE"
# Start the LLM NIM
docker run -it --rm --name=$CONTAINER_NAME \
--runtime=nvidia \
--gpus all \
-e NGC_API_KEY \
-v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
-u $(id -u) \
-p 8000:8000 \
$IMG_NAME
Test an Inference Request
curl -X 'POST' \
'http://0.0.0.0:8000/v1/completions' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "meta/llama3-8b-instruct",
"prompt": "Once upon a time",
"max_tokens": 64
}'
Integrating NIM with Your Applications
Start with a completions curl request following the OpenAI specification. To stream outputs, set stream
to True. When using Python with the OpenAI library, you don’t need to provide an API key if you’re using a NIM microservice.
from openai import OpenAI
client = OpenAI(
base_url = "http://0.0.0.0:8000/v1",
api_key="no-key-required"
)
completion = client.chat.completions.create(
model="meta/llama3-8b-instruct",
messages=[{"role":"user","content":"What is a GPU?"}]
temperature=0.5,
top_p=1,
max_tokens=1024,
stream=True
)
for chunk in completion:
if chunk.choices[1].delta.content is not None:
print(chunk.choices[1].delta.content, end="")
NIM is also integrated into application frameworks like
Haystack,
LangChain, and
LlamaIndex, bringing secure, reliable, accelerated model inferencing to developers building generative AI applications with these popular tools.
To Use NIM Microservices in Python with LangChain
from langchain_nvidia_ai_endpoints import ChatNVIDIA
llm = ChatNVIDIA(base_url="http://0.0.0.0:8000/v1", model="meta/llama3-8b-instruct", temperature=0.5, max_tokens=1024, top_p=1)
result = llm.invoke("What is a GPU?")
print(result.content)
For more information about using NIM, see the framework notebooks provided by NVIDIA.
Using NIM Hugging Face Endpoints
You can also integrate a dedicated NIM endpoint directly on Hugging Face. Hugging Face spins up instances on your preferred cloud, deploys the NVIDIA-optimized model, and enables you to start inference with just a few clicks. Navigate to the model page on Hugging Face and create a dedicated endpoint using your preferred CSP. For more information and a step-by-step guide, see the NVIDIA-Hugging Face collaboration for simplified generative AI model deployments.
Customizing NIM with LoRA
To get more from NIM, learn how to use the microservices with LLMs customized with LoRA adapters. NIM supports LoRA adapters trained using either HuggingFace or NVIDIA NeMo. Store the LoRA adapters in /LOCAL_PEFT_DIRECTORY
and serve using a script similar to the one used for the base container:
# Choose a container name for bookkeeping
export CONTAINER_NAME=llama3-8b-instruct
# Define the vendor name for the LLM
export VENDOR_NAME=meta
# Choose a LLM NIM Image from NGC
export IMG_NAME="nvcr.io/nim/${VENDOR_NAME}/${CONTAINER_NAME}:1.0.0"
# Choose a LLM NIM image from NGC
export LOCAL_PEFT_DIRECTORY=~/loras
# Download NeMo-format LoRA. You can also download HuggingFace PEFT LoRAs
ngc registry model download-version "nim/meta/llama3-70b-instruct-lora:nemo-math-v1"
# Start the LLM NIM microservice
docker run -it --rm --name=$CONTAINER_NAME \
--runtime=nvidia \
--gpus all \
-e NGC_API_KEY \
-e NIM_PEFT_SOURCE \
-v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
-v $LOCAL_PEFT_DIRECTORY:$NIM_PEFT_SOURCE \
-u $(id -u) \
-p 8000:8000 \
$IMG_NAME
You can then deploy using the name of one of the LoRA adapters in /LOCAL_PEFT_DIRECTORY
.
curl -X 'POST' \
'http://0.0.0.0:8000/v1/completions' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "llama3-8b-instruct-lora_vhf-math-v1",
"prompt": "John buys 10 packs of magic cards. Each pack has 20 cards and 1/4 of those cards are uncommon. How many uncommon cards did he get?",
"max_tokens": 128
}'