Deploying NVIDIA NIM for Generative AI Applications

Table of Contents

NVIDIA’s NIM (Neural Inference Microservices) provides developers an efficient way to deploy optimized AI models from various sources, including community partners and NVIDIA itself. As part of the NVIDIA AI Enterprise suite, NIM offers a streamlined path to quickly iterate on and build innovative generative AI solutions.

With NIM, you can easily deploy a microservice container in under 5 minutes on NVIDIA GPU systems in the cloud or data center, or on workstations and PCs. Alternatively, you can start prototyping applications using NIM APIs from the
NVIDIA API Catalog without deploying containers.

Key Features

Key features of NIM include:

Pre-built containers that deploy with a single command on NVIDIA-accelerated infrastructure anywhere
Maintained security and control over your enterprise data
Support for fine-tuned models using techniques like LoRA
Integration with accelerated AI inference endpoints through consistent, industry-standard APIs
Compatibility with popular generative AI application frameworks like LangChain, LlamaIndex, and Haystack

How to Deploy NIM in 5 Minutes

To deploy NIM, you need either an NVIDIA AI Enterprise license or NVIDIA Developer Program membership. The fastest way to obtain these is by visiting the
NVIDIA API Catalog and choosing “Get API key” from a model page (e.g., Llama 3.1 405B). Then, enter your email address to access NIM with a 90-day NVIDIA AI Enterprise license or personal email for access through the NVIDIA Developer Program membership.

Deployment Script


# Choose a container name for bookkeeping
export CONTAINER_NAME=llama3-8b-instruct
 
# Define the vendor name for the LLM
export VENDOR_NAME=meta
 
# Choose a LLM NIM Image from NGC
export IMG_NAME="nvcr.io/nim/${VENDOR_NAME}/${CONTAINER_NAME}:1.0.0"
 
# Choose a path on your system to cache the downloaded models
export LOCAL_NIM_CACHE="~/.cache/nim"
mkdir -p "$LOCAL_NIM_CACHE"
 
# Start the LLM NIM
docker run -it --rm --name=$CONTAINER_NAME \
  --runtime=nvidia \
  --gpus all \
  -e NGC_API_KEY \
  -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
  -u $(id -u) \
  -p 8000:8000 \
  $IMG_NAME

Test an Inference Request


curl -X 'POST' \
	'http://0.0.0.0:8000/v1/completions' \
	-H 'accept: application/json' \
	-H 'Content-Type: application/json' \
	-d '{
  	"model": "meta/llama3-8b-instruct",
  	"prompt": "Once upon a time",
  	"max_tokens": 64
	}'

Integrating NIM with Your Applications

Start with a completions curl request following the OpenAI specification. To stream outputs, set stream to True. When using Python with the OpenAI library, you don’t need to provide an API key if you’re using a NIM microservice.


from openai import OpenAI
 
client = OpenAI(
  base_url = "http://0.0.0.0:8000/v1",
  api_key="no-key-required"
)
 
completion = client.chat.completions.create(
  model="meta/llama3-8b-instruct",
  messages=[{"role":"user","content":"What is a GPU?"}]
  temperature=0.5,
  top_p=1,
  max_tokens=1024,
  stream=True
)
 
for chunk in completion:
  if chunk.choices[1].delta.content is not None:
	print(chunk.choices[1].delta.content, end="")

NIM is also integrated into application frameworks like
Haystack, LangChain, and LlamaIndex, bringing secure, reliable, accelerated model inferencing to developers building generative AI applications with these popular tools.

To Use NIM Microservices in Python with LangChain


from langchain_nvidia_ai_endpoints import ChatNVIDIA

llm = ChatNVIDIA(base_url="http://0.0.0.0:8000/v1", model="meta/llama3-8b-instruct", temperature=0.5, max_tokens=1024, top_p=1)

result = llm.invoke("What is a GPU?")
print(result.content)

For more information about using NIM, see the framework notebooks provided by NVIDIA.

Using NIM Hugging Face Endpoints

You can also integrate a dedicated NIM endpoint directly on Hugging Face. Hugging Face spins up instances on your preferred cloud, deploys the NVIDIA-optimized model, and enables you to start inference with just a few clicks. Navigate to the model page on Hugging Face and create a dedicated endpoint using your preferred CSP. For more information and a step-by-step guide, see the NVIDIA-Hugging Face collaboration for simplified generative AI model deployments.

Customizing NIM with LoRA

To get more from NIM, learn how to use the microservices with LLMs customized with LoRA adapters. NIM supports LoRA adapters trained using either HuggingFace or NVIDIA NeMo. Store the LoRA adapters in /LOCAL_PEFT_DIRECTORY and serve using a script similar to the one used for the base container:


# Choose a container name for bookkeeping
export CONTAINER_NAME=llama3-8b-instruct

# Define the vendor name for the LLM
export VENDOR_NAME=meta

# Choose a LLM NIM Image from NGC
export IMG_NAME="nvcr.io/nim/${VENDOR_NAME}/${CONTAINER_NAME}:1.0.0"

# Choose a LLM NIM image from NGC
export LOCAL_PEFT_DIRECTORY=~/loras

# Download NeMo-format LoRA. You can also download HuggingFace PEFT LoRAs
ngc registry model download-version "nim/meta/llama3-70b-instruct-lora:nemo-math-v1"

# Start the LLM NIM microservice
docker run -it --rm --name=$CONTAINER_NAME \
  --runtime=nvidia \
  --gpus all \
  -e NGC_API_KEY \
  -e NIM_PEFT_SOURCE \
  -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
  -v $LOCAL_PEFT_DIRECTORY:$NIM_PEFT_SOURCE \
  -u $(id -u) \
  -p 8000:8000 \
  $IMG_NAME

You can then deploy using the name of one of the LoRA adapters in /LOCAL_PEFT_DIRECTORY.


curl -X 'POST' \
  'http://0.0.0.0:8000/v1/completions' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
"model": "llama3-8b-instruct-lora_vhf-math-v1",
"prompt": "John buys 10 packs of magic cards. Each pack has 20 cards and 1/4 of those cards are uncommon. How many uncommon cards did he get?",
"max_tokens": 128
}'

Conclusion

NVIDIA’s NIM (Neural Inference Microservices) represents a significant advancement in the deployment and integration of AI models within enterprise and development environments. By offering pre-built containers that can be deployed with a single command, NIM removes many of the traditional barriers to implementing generative AI solutions.

The platform’s key strengths lie in its simplicity, flexibility, and enterprise-ready features. With deployment possible in under 5 minutes on various NVIDIA GPU systems, teams can quickly iterate on AI applications while maintaining security and control over their data. The support for LoRA fine-tuning enables customization for specific use cases, extending the utility of foundation models.

NIM’s compatibility with popular frameworks like LangChain, LlamaIndex, and Haystack further enhances its value proposition, allowing developers to leverage familiar tools while benefiting from NVIDIA’s optimized inference capabilities. Whether accessed through containers, the NVIDIA API Catalog, or Hugging Face endpoints, NIM provides consistent, industry-standard APIs that streamline development workflows.

As organizations continue to explore and implement generative AI technologies, solutions like NIM will play a crucial role in bridging the gap between cutting-edge AI research and practical business applications. By simplifying deployment and optimizing performance, NVIDIA is helping to democratize access to advanced AI capabilities while ensuring the scalability and reliability that enterprise environments demand.