Running LLMs with TensorRT-LLM on NVIDIA Jetson Orin Nano Super

Table of Contents

TensorRT-LLM is essentially a specialized tool that makes large language models (like ChatGPT) run much faster on NVIDIA hardware. Think of it this way: If a regular language model is like a car engine that can get you from point A to point B, TensorRT-LLM is like a high-performance tuning kit that makes that same engine much more efficient and powerful.

TensorRT-LLM is an open-source library built on NVIDIA’s TensorRT framework specifically designed to optimize the inference performance of large language models. It provides a streamlined path to deploy state-of-the-art LLMs on NVIDIA GPUs with maximum efficiency.

How it works?

In simple terms:

It takes AI language models that would normally be slow to respond and optimizes them to run much faster.
It’s specifically designed for NVIDIA hardware like your Jetson Orin, helping it punch above its weight when running complex AI models.
It works by restructuring the model to eliminate inefficiencies, similar to how a mechanic might tune a car engine to get better performance without changing the basic design.
The end result is that AI models that might be too slow to be practical on your device can now run with reasonable response times.

Key Features and Benefits

1. Optimized Inference Performance

TensorRT-LLM dramatically reduces inference latency while increasing throughput, allowing you to serve more requests with the same hardware. This is achieved through:

Kernel fusion: Combining multiple operations into a single GPU kernel
Precision calibration: Supporting various precision modes (FP32, FP16, INT8, INT4)
Memory optimization: Reducing memory footprint for more efficient GPU utilization

2. Support for Popular LLM Architectures

The framework provides out-of-the-box support for many widely-used language model architectures:

GPT-3/GPT-4 family models
LLaMA and LLaMA 2
Mistral
Falcon
BERT and variants
And many others

3. Advanced Deployment Capabilities

TensorRT-LLM enables sophisticated deployment scenarios with features like:

Multi-GPU and multi-node inference: Scale your models across multiple GPUs and nodes
Dynamic batching: Optimize throughput by efficiently batching incoming requests
Continuous batching: Process requests without waiting for batch formation
Tensor parallelism: Split model layers across multiple GPUs for larger models
Pipeline parallelism: Process different layers of the model on different GPUs

4. Production-Ready Optimizations

The library includes numerous production-oriented features:

In-flight batching: Process requests as they arrive without waiting for batch completion
Attention optimizations: Specialized kernels for different attention mechanisms
Custom CUDA kernels: Highly optimized operations specific to LLM inference
Streaming output generation: Generate tokens progressively rather than all at once

For your Jetson Orin setup, TensorRT-LLM lets you run sophisticated AI language models locally that would otherwise be too demanding for the hardware, making possible applications like real-time AI assistants or language processing without needing to connect to cloud services.

Setting up TensorRT LLM on Jetson Orin Nano Super

For your Jetson Orin, TensorRT-LLM is particularly valuable because it helps the relatively small but capable GPU in the Orin handle larger models that would otherwise be too demanding. It’s like having a custom-tuned engine that extracts maximum performance from the available hardware.

Hardware Setup

NVIDIA Jetson Orin Nano Super
DisplayPort
1x DC Jack
Wireless Network Adapter
NVMe SSD

I recommend NVMe SSD for storage speed and space:

18.5GB for tensorrt_llm container image
Space for models ( >10GB )

Follow this guide to get started with the initial setup, flashing the board with the SDK Manager. Once the initial setup is completed and OS gets booted up, proceed with the below steps:

Cloning the repository

This repository contains Docker containers tailored for NVIDIA’s Jetson platform, which is a series of embedded computing devices designed for edge AI applications. The repository provides:

Deployment solutions that make it easier to move AI applications from development to production on edge devices
Pre-built Docker containers optimized for the Jetson hardware architecture (ARM64-based)
Ready-to-use environments for AI and machine learning development on edge devices, eliminating complex setup processes
Performance-optimized versions of popular frameworks like TensorFlow, PyTorch, and other ML libraries that take advantage of Jetson’s GPU capabilities
Application containers for computer vision, robotics, and IoT use cases commonly deployed on Jetson devices

git clone https://github.com/dusty-nv/jetson-containers
jetson-containers/install.sh

The first part of the install.sh script prepares your system to work with Jetson containers. It does this by making sure you have the right tools installed. It first establishes some ground rules for how the script should run and figures out where it’s located on your system. Then it checks if you have Python’s package manager (pip) installed, and if not, it installs it for you. Finally, it installs all the necessary Python packages that the Jetson containers management tools need to function properly.

The second part makes the Jetson container tools easy to use from anywhere on your system. It creates shortcuts (technically called symbolic links) to the main scripts in a location that your system already knows to look for commands (/usr/local/bin). This includes the “autotag” script, which helps manage container tags, and the main “jetson-containers” script that you’ll use to work with the containers. After running this installation script, you’ll be able to type “jetson-containers” or “autotag” from any directory in your terminal, and your system will know exactly what you mean – no need to remember or type the full path to those scripts anymore.

Running convert_model.sh

This script is used to convert a Llama 2 model for optimized use with TensorRT-LLM. The script converts the Llama-2-7B-Chat model from its GPTQ-quantized format (a compressed version) into a format optimized for TensorRT-LLM, which will allow it to run efficiently on NVIDIA hardware.

./convert_model.sh 
--helpV4L2_DEVICES:+ sudo docker run --runtime nvidia -it --rm --network host --shm-size=8g --volume /tmp/argus_socket:/tmp/argus_socket --volume /etc/enctune.conf:/etc/enctune.conf --volume /etc/nv_tegra_release:/etc/nv_tegra_release --volume /tmp/nv_jetson_model:/tmp/nv_jetson_model --volume /var/run/dbus:/var/run/dbus --volume /var/run/avahi-daemon/socket:/var/run/avahi-daemon/socket --volume /var/run/docker.sock:/var/run/docker.sock --volume /home/ajeetraina/jetson-containers/data:/data -v /etc/localtime:/etc/localtime:ro -v /etc/timezone:/etc/timezone:ro --device /dev/snd -e PULSE_SERVER=unix:/run/user/1000/pulse/native -v /run/user/1000/pulse:/run/user/1000/pulse --device /dev/bus/usb --device /dev/i2c-0 --device /dev/i2c-1 --device /dev/i2c-2 --device /dev/i2c-4 --device /dev/i2c-5 --device /dev/i2c-7 --device /dev/i2c-9 --env HUGGINGFACE_TOKEN=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 --name jetson_container_20250309_183404 -e HUGGINGFACE_TOKEN=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 -e FORCE_BUILD=on dustynv/tensorrt_llm:0.12-r36.4.0 python3 /opt/TensorRT-LLM/examples/llama/convert_checkpoint.py --model_dir /data/models/huggingface/models--TheBloke--Llama-2-7B-Chat-GPTQ/snapshots/d5ad9310836dd91b6ac6133e2e47f47394386cea --output_dir /data/models/tensorrt_llm/Llama-2-7b-chat-hf-gptq --dtype float16 --quant_ckpt_path /data/models/huggingface/models--TheBloke--Llama-2-7B-Chat-GPTQ/snapshots/d5ad9310836dd91b6ac6133e2e47f47394386cea/model.safetensors --use_weight_only --weight_only_precision int4_gptq --group_size 128 --per_group/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py:128: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.  warnings.warn([TensorRT-LLM] TensorRT-LLM 
version: 0.12.00.12.0550it [01:58,  4.62it/s]

How It Works

The script uses the “jetson-containers run” command to start a container with TensorRT-LLM installed, and then runs a Python conversion script inside it.

First, it sets up:

A Hugging Face authentication token for accessing models
Forces a build of the container if needed
Uses a specific version of the TensorRT-LLM container (version 0.12)

Then it runs a conversion script that:

Takes the Llama 2 model from its source location
Specifies an output directory for the converted model
Sets the model to use float16 precision for calculations
Configures special quantization settings:
- Uses “weight-only” quantization (a technique to reduce model size)
- Specifically uses int4 GPTQ precision (4-bit integers instead of larger formats)
- Groups weights in clusters of 128
- Applies the quantization per group

This conversion process significantly reduces the model’s size and optimizes it to run efficiently on NVIDIA hardware while maintaining most of its performance capabilities.

Note: The script convert_model.sh has hardcoded values for the Llama-2-7B model and doesn’t accept command line arguments. It’s just a wrapper for a specific conversion command.

Converting Microsoft Phi-2 language model to TensorRT

The convert_phi.sh script is a custom shell script I created to help you convert the Microsoft Phi-2 language model to TensorRT format for running on your Jetson Orin Nano.

It’s essentially a wrapper that:

Uses the jetson-containers tool to run the TensorRT-LLM container
Calls the official Phi model conversion script that’s included in TensorRT-LLM
Configures the conversion with appropriate parameters for your Jetson hardware (like INT4 quantization)

Let’s go ahead and create the script called convert_phi.sh. You will require HUGGINGFACE_TOKEN for this script to work.

#!/bin/bash
jetson-containers run \
  -e HUGGINGFACE_TOKEN=hf_JFsoIvGXXXXXXgxpFsLtLz \
  -e FORCE_BUILD=on \
  dustynv/tensorrt_llm:0.12-r36.4.0 \
  python3 -c "
import os
import torch
from tensorrt_llm.builder import Builder
from tensorrt_llm.models import PretrainedModel, PretrainedConfig
from tensorrt_llm.network import net_guard
from transformers import AutoModelForCausalLM, AutoTokenizer

# Download and prepare the model
model_name = 'microsoft/phi-2'
output_dir = '/data/models/tensorrt_llm/phi-2'
os.makedirs(output_dir, exist_ok=True)

print(f'Downloading {model_name}...')
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Save tokenizer
tokenizer.save_pretrained(output_dir)

# Convert to TensorRT-LLM
print('Converting to TensorRT-LLM...')
config = PretrainedConfig(model_name=model_name)
config.set_rank(0, 1)  # Single GPU
config.dtype = 'float16'
config.use_weight_only = True
config.weight_only_precision = 'int4'

builder = Builder()
builder.platform_name = 'NVIDIA Jetson Orin'
builder.fp16 = True

with net_guard(builder):
    network = builder.create_network()
    with torch.no_grad():
        model_params = PretrainedModel.load_parameters(model)
        trt_model = PretrainedModel.from_hugging_face(config, model_params)

    builder.build_network(network, trt_model)
    builder.save_engine_to_file(os.path.join(output_dir, 'model.engine'))

print(f'Model converted and saved to {output_dir}')
print('Testing inference...')

# Test inference
from tensorrt_llm.runtime import ModelRuntime, Session

runtime = ModelRuntime(os.path.join(output_dir, 'model.engine'))
session = Session(runtime)

input_text = 'What is a Jetson Orin?'
input_ids = tokenizer.encode(input_text, return_tensors='pt').cuda()

output_ids = session.generate(input_ids, max_new_tokens=100)
output_text = tokenizer.decode(output_ids[0].tolist())

print(f'Input: {input_text}')
print(f'Output: {output_text}')

This script uses the Jetson containers environment to download, convert, and test Microsoft’s Phi-2 language model with TensorRT-LLM optimization for running on Jetson hardware.

It automates the entire process of:

Downloading the Phi-2 model from Hugging Face
Converting it to an optimized TensorRT-LLM format
Saving the tokenizer and engine
Running a quick inference test

Setup and Environment

Uses a specific TensorRT-LLM container (version 0.12)
Authenticates with Hugging Face using a token
Forces a container build if needed

Model Processing

Downloads Microsoft’s Phi-2 model, a smaller but capable language model
Converts the model to half-precision (float16) to reduce memory requirements
Applies int4 quantization (using only 4 bits per weight) for extreme compression
Specifies the Jetson Orin as the target platform
Builds a TensorRT engine file optimized for this hardware

Testing

Runs a simple inference test with the prompt “What is a Jetson Orin?”
Demonstrates that the converted model works by generating a response

This script essentially takes a standard language model and optimizes it heavily for running efficiently on resource-constrained Jetson edge devices, allowing AI capabilities in scenarios where a full-sized model wouldn’t fit or would run too slowly.

Make the script executable:

chmod +x convert_phi.sh

Run it:

./convert_phi.sh

This script will:

Download the Phi-2 model (2.7B parameters)
Convert it to TensorRT format with int4 quantization
Build an optimized engine for your Jetson Orin
Run a test inference

./convert_phi.sh
V4L2_DEVICES:
+ sudo docker run --runtime nvidia -it --rm --network host --shm-size=8g --volume /tmp/argus_socket:/tmp/argus_socket --volume /etc/enctune.conf:/etc/enctune.conf --volume /etc/nv_tegra_release:/etc/nv_tegra_release --volume /tmp/nv_jetson_model:/tmp/nv_jetson_model --volume /var/run/dbus:/var/run/dbus --volume /var/run/avahi-daemon/socket:/var/run/avahi-daemon/socket --volume /var/run/docker.sock:/var/run/docker.sock --volume /home/ajeetraina/jetson-containers/data:/data -v /etc/localtime:/etc/localtime:ro -v /etc/timezone:/etc/timezone:ro --device /dev/snd -e PULSE_SERVER=unix:/run/user/1000/pulse/native -v /run/user/1000/pulse:/run/user/1000/pulse --device /dev/bus/usb --device /dev/i2c-0 --device /dev/i2c-1 --device /dev/i2c-2 --device /dev/i2c-4 --device /dev/i2c-5 --device /dev/i2c-7 --device /dev/i2c-9 --name jetson_container_20250310_102454 -e HUGGINGFACE_TOKEN=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 -e FORCE_BUILD=on dustynv/tensorrt_llm:0.12-r36.4.0 bash -c '
cd /opt/TensorRT-LLM/examples/phi && python3 convert_checkpoint.py   --model_dir microsoft/phi-2   --output_dir /data/models/tensorrt_llm/phi-2   --dtype float16   --use_weight_only   --weight_only_precision int4
'
/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py:128: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
  warnings.warn(
[TensorRT-LLM] TensorRT-LLM version: 0.12.0
0.12.0
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  2.40it/s]
Total time of converting checkpoints: 00:00:58
ajeetraina@ajeetraina-desktop:~/jetson-containers$

The script completed without any errors, which means your Phi-2 model should now be properly converted to TensorRT format with int4 quantization.

The Phi-2 model is specifically known to work well on resource-constrained devices like the Jetson Orin. After running this script, the converted model will be stored in /data/models/tensorrt_llm/phi-2 on your system where you can use it for inference.

ls -ls data/models/tensorrt_llm/phi-2/
total 1749412
      4 -rw-r--r-- 1 root root       1080 Mar  9 21:38 added_tokens.json
      4 -rw-r--r-- 1 root root       1153 Mar 10 10:26 config.json
    448 -rw-r--r-- 1 root root     456318 Mar  9 21:38 merges.txt
1744680 -rw-r--r-- 1 root root 1786545232 Mar 10 10:26 rank0.safetensors
      4 -rw-r--r-- 1 root root        441 Mar  9 21:38 special_tokens_map.json
   3484 -rw-r--r-- 1 root root    3564952 Mar  9 21:38 tokenizer.json
      8 -rw-r--r-- 1 root root       7373 Mar  9 21:38 tokenizer_config.json
    780 -rw-r--r-- 1 root root     798156 Mar  9 21:38 vocab.json

Finding the right location where Tensor RT LLM is installed.

jetson-containers run dustynv/tensorrt_llm:0.12-r36.4.0 python3 -c "import tensorrt_llm; print(f'TensorRT-LLM is installed at: {tensorrt_llm.__path__}')"

jetson-containers run dustynv/tensorrt_llm:0.12-r36.4.0 python3 -c "import tensorrt_llm; print(f'TensorRT-LLM is installed at: {tensorrt_llm.__path__}')"
V4L2_DEVICES:
+ sudo docker run --runtime nvidia -it --rm --network host --shm-size=8g --volume /tmp/argus_socket:/tmp/argus_socket --volume /etc/enctune.conf:/etc/enctune.conf --volume /etc/nv_tegra_release:/etc/nv_tegra_release --volume /tmp/nv_jetson_model:/tmp/nv_jetson_model --volume /var/run/dbus:/var/run/dbus --volume /var/run/avahi-daemon/socket:/var/run/avahi-daemon/socket --volume /var/run/docker.sock:/var/run/docker.sock --volume /home/ajeetraina/jetson-containers/data:/data -v /etc/localtime:/etc/localtime:ro -v /etc/timezone:/etc/timezone:ro --device /dev/snd -e PULSE_SERVER=unix:/run/user/1000/pulse/native -v /run/user/1000/pulse:/run/user/1000/pulse --device /dev/bus/usb --device /dev/i2c-0 --device /dev/i2c-1 --device /dev/i2c-2 --device /dev/i2c-4 --device /dev/i2c-5 --device /dev/i2c-7 --device /dev/i2c-9 --name jetson_container_20250310_103154 dustynv/tensorrt_llm:0.12-r36.4.0 python3 -c 'import tensorrt_llm; print(f'\''TensorRT-LLM is installed at: {tensorrt_llm.__path__}'\'')'
/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py:128: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
  warnings.warn(
[TensorRT-LLM] TensorRT-LLM version: 0.12.0
TensorRT-LLM is installed at: ['/usr/local/lib/python3.10/dist-packages/tensorrt_llm']

According to the output, TensorRT-LLM is installed at /usr/local/lib/python3.10/dist-packages/tensorrt_llm in the container, which is the standard Python package location.

I could see that the model directory has the required files, including rank0.safetensors which is the model weights, but no engine file yet. Let’s continue with the direct PyTorch inference approach which might work for us.

Let’s create a simple inference script based on what we’ve learned:

#!/bin/bash

jetson-containers run \
  dustynv/tensorrt_llm:0.12-r36.4.0 \
  python3 -c "
import os
import torch
from transformers import AutoTokenizer
import tensorrt_llm

print(f'TensorRT-LLM version: {tensorrt_llm.__version__}')
print('Modules available in tensorrt_llm.runtime:')
import tensorrt_llm.runtime
for name in dir(tensorrt_llm.runtime):
    if not name.startswith('_'):
        print(f'  - {name}')

# List contents of your model directory
model_dir = '/data/models/tensorrt_llm/phi-2'
print(f'\nContents of {model_dir}:')
for file in os.listdir(model_dir):
    print(f'  - {file}')

# Let's try a direct approach with PyTorch for inference
print('\nFalling back to direct PyTorch inference...')
from transformers import AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained('microsoft/phi-2')
model = AutoModelForCausalLM.from_pretrained(
    'microsoft/phi-2', 
    torch_dtype=torch.float16, 
    device_map='auto'
)

input_text = 'What is a Jetson Orin?'
print(f'\nInput: {input_text}')

input_ids = tokenizer.encode(input_text, return_tensors='pt').to('cuda')
with torch.no_grad():
    output_ids = model.generate(input_ids, max_new_tokens=100)
output_text = tokenizer.decode(output_ids[0].tolist())

print('\nOutput:')
print(output_text)
"

This script:

Shows the available modules in TensorRT-LLM
Lists the contents of your model directory
Falls back to direct PyTorch inference with the original Phi-2 model

This should work regardless of TensorRT-LLM issues. Save it as fallback_phi.sh, make it executable, and run it.

Running inference with the Phi-2 model directly using PyTorch

#!/bin/bash

jetson-containers run \
  dustynv/pytorch:r36.4.0 \
  python3 -c "
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

print('Loading Phi-2 model...')
tokenizer = AutoTokenizer.from_pretrained('microsoft/phi-2')
model = AutoModelForCausalLM.from_pretrained(
    'microsoft/phi-2', 
    torch_dtype=torch.float16, 
    device_map='auto'
)

def generate_response(prompt):
    print(f'Input: {prompt}')
    inputs = tokenizer(prompt, return_tensors='pt').to('cuda')
    
    with torch.no_grad():
        output = model.generate(
            **inputs,
            max_new_tokens=150,
            do_sample=True,
            temperature=0.7,
            top_p=0.9,
            pad_token_id=tokenizer.eos_token_id
        )
    
    response = tokenizer.decode(output[0], skip_special_tokens=True)
    return response

# Test with Jetson Orin question
prompt = 'What is a Jetson Orin?'
response = generate_response(prompt)

print('\\nOutput:')
print(response)
"

The output you received shows a successful inference using the Phi-2 model on your Jetson Orin! Despite the warnings about attention masks and padding tokens, the model responded to your question. It provided a response about what a Jetson Orin is, though the answer isn’t completely accurate. The model seems to have confused the Jetson Orin (which is NVIDIA’s edge AI computing platform) with a fictional robot from “The Jetsons” TV show.

ack to direct PyTorch inference...
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:09<00:00,  4.59s/it]
[03/10/2025-10:36:04] Some parameters are on the meta device because they were offloaded to the cpu.

Input: What is a Jetson Orin?
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.

Output:
What is a Jetson Orin?
A Jetson Orin is a type of robot that can help people with their daily tasks. It was created by a company called Jetson, Inc. and is based on a popular TV show called The Jetsons. The Jetson Orin is a small robot that can move around and do things like clean, cook, and even talk to people. It is designed to make life easier for people by taking care of some of the things they have to do every day.

The Phi-2 model provided an incorrect response about what a Jetson Orin is. The model confused it with a fictional robot from “The Jetsons” TV show, when in reality, Jetson Orin is NVIDIA’s edge AI computing platform.

To fix this issue, there are several approaches we can take:

Use prompt engineering to guide the model toward a more accurate response by providing context in the prompt itself.
Implement a domain-specific knowledge injection by adding factual information to the prompt.
Add a RAG (Retrieval-Augmented Generation) component to provide the model with accurate information.

Here’s a modified script that implements prompt engineering to get a more accurate response:

#!/bin/bash

jetson-containers run \
  dustynv/pytorch:r36.4.0 \
  python3 -c "
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

print('Loading Phi-2 model...')
tokenizer = AutoTokenizer.from_pretrained('microsoft/phi-2')
model = AutoModelForCausalLM.from_pretrained(
    'microsoft/phi-2', 
    torch_dtype=torch.float16, 
    device_map='auto'
)

def generate_response(prompt):
    print(f'Input: {prompt}')
    inputs = tokenizer(prompt, return_tensors='pt').to('cuda')
    
    with torch.no_grad():
        output = model.generate(
            **inputs,
            max_new_tokens=200,
            do_sample=True,
            temperature=0.3,  # Lower temperature for more factual responses
            top_p=0.9,
            pad_token_id=tokenizer.eos_token_id
        )
    
    response = tokenizer.decode(output[0], skip_special_tokens=True)
    return response

# Improved prompt with factual context
prompt = '''You are a technical assistant that provides accurate information about NVIDIA products.
Fact: Jetson Orin is NVIDIA's edge AI computing platform designed for running AI workloads on edge devices.
Fact: Jetson Orin features Ampere architecture GPU cores and Arm CPU cores.
Fact: Jetson Orin is used in robotics, autonomous machines, and edge computing scenarios.

Question: What is a Jetson Orin?
Answer: '''

response = generate_response(prompt)

print('\\nOutput:')
# Only print the part after 'Answer: ' to get the clean response
answer_start = response.find('Answer: ')
if answer_start != -1:
    print(response[answer_start + 8:])
else:
    print(response)
"

This improved script:

Adds factual context about Jetson Orin before the question
Lowers the temperature parameter to 0.3 for less creative and more factual responses
Extracts only the relevant part of the response (after “Answer:”)
Provides a clearer instruction to act as a technical assistant

Here’s a revised version of the script that worked for me:

#!/bin/bash

jetson-containers run \
  dustynv/tensorrt_llm:0.12-r36.4.0 \
  python3 -c "
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

print('Loading Phi-2 model...')
tokenizer = AutoTokenizer.from_pretrained('microsoft/phi-2')
model = AutoModelForCausalLM.from_pretrained(
    'microsoft/phi-2', 
    torch_dtype=torch.float16, 
    device_map='auto'
)

def generate_response(prompt):
    print(f'Input: {prompt}')
    inputs = tokenizer(prompt, return_tensors='pt').to('cuda')
    
    with torch.no_grad():
        output = model.generate(
            **inputs,
            max_new_tokens=200,
            do_sample=True,
            temperature=0.3,
            top_p=0.9,
            pad_token_id=tokenizer.eos_token_id
        )
    
    response = tokenizer.decode(output[0], skip_special_tokens=True)
    return response

# Improved prompt with factual context
prompt = '''You are a technical assistant that provides accurate information about NVIDIA products.
Fact: Jetson Orin is NVIDIA\\'s edge AI computing platform designed for running AI workloads on edge devices.
Fact: Jetson Orin features Ampere architecture GPU cores and Arm CPU cores.
Fact: Jetson Orin is used in robotics, autonomous machines, and edge computing scenarios.

Question: What is a Jetson Orin?
Answer: '''

response = generate_response(prompt)

print('\\nOutput:')
# Only print the part after 'Answer: ' to get the clean response
answer_start = response.find('Answer: ')
if answer_start != -1:
    print(response[answer_start + 8:])
else:
    print(response)
"

./convert_phi.sh
V4L2_DEVICES:
+ sudo docker run --runtime nvidia -it --rm --network host --shm-size=8g --volume /tmp/argus_socket:/tmp/argus_socket --volume /etc/enctune.conf:/etc/enctune.conf --volume /etc/nv_tegra_release:/etc/nv_tegra_release --volume /tmp/nv_jetson_model:/tmp/nv_jetson_model --volume /var/run/dbus:/var/run/dbus --volume /var/run/avahi-daemon/socket:/var/run/avahi-daemon/socket --volume /var/run/docker.sock:/var/run/docker.sock --volume /home/ajeetraina/jetson-containers/data:/data -v /etc/localtime:/etc/localtime:ro -v /etc/timezone:/etc/timezone:ro --device /dev/snd -e PULSE_SERVER=unix:/run/user/1000/pulse/native -v /run/user/1000/pulse:/run/user/1000/pulse --device /dev/bus/usb --device /dev/i2c-0 --device /dev/i2c-1 --device /dev/i2c-2 --device /dev/i2c-4 --device /dev/i2c-5 --device /dev/i2c-7 --device /dev/i2c-9 --name jetson_container_20250311_080243 dustynv/tensorrt_llm:0.12-r36.4.0 python3 -c '
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

print('\''Loading Phi-2 model...'\'')
tokenizer = AutoTokenizer.from_pretrained('\''microsoft/phi-2'\'')
model = AutoModelForCausalLM.from_pretrained(
    '\''microsoft/phi-2'\'',
    torch_dtype=torch.float16,
    device_map='\''auto'\''
)

def generate_response(prompt):
    print(f'\''Input: {prompt}'\'')
    inputs = tokenizer(prompt, return_tensors='\''pt'\'').to('\''cuda'\'')

    with torch.no_grad():
        output = model.generate(
            **inputs,
            max_new_tokens=200,
            do_sample=True,
            temperature=0.3,
            top_p=0.9,
            pad_token_id=tokenizer.eos_token_id
        )

    response = tokenizer.decode(output[0], skip_special_tokens=True)
    return response

# Improved prompt with factual context
prompt = '\'''\'''\''You are a technical assistant that provides accurate information about NVIDIA products.
Fact: Jetson Orin is NVIDIA\'\''s edge AI computing platform designed for running AI workloads on edge devices.
Fact: Jetson Orin features Ampere architecture GPU cores and Arm CPU cores.
Fact: Jetson Orin is used in robotics, autonomous machines, and edge computing scenarios.

Question: What is a Jetson Orin?
Answer: '\'''\'''\''

response = generate_response(prompt)

print('\''\nOutput:'\'')
# Only print the part after '\''Answer: '\'' to get the clean response
answer_start = response.find('\''Answer: '\'')
if answer_start != -1:
    print(response[answer_start + 8:])
else:
    print(response)
'
/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py:128: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
  warnings.warn(
Loading Phi-2 model...
Loading checkpoint shards: 100%|████████████████████████████████████████████████████| 2/2 [00:07<00:00,  3.67s/it]
Some parameters are on the meta device because they were offloaded to the cpu.
Input: You are a technical assistant that provides accurate information about NVIDIA products.
Fact: Jetson Orin is NVIDIA's edge AI computing platform designed for running AI workloads on edge devices.
Fact: Jetson Orin features Ampere architecture GPU cores and Arm CPU cores.
Fact: Jetson Orin is used in robotics, autonomous machines, and edge computing scenarios.

Question: What is a Jetson Orin?
Answer:


Jetson Orin is NVIDIA's edge AI computing platform designed for running AI workloads on edge devices. It features Ampere architecture GPU cores and Arm CPU cores. Jetson Orin is used in robotics, autonomous machines, and edge computing scenarios.

This approach uses prompt engineering to guide the model toward providing an accurate technical description of the Jetson Orin platform, rather than confusing it with a fictional robot. The facts provided in the prompt will help steer the model’s response in the right direction.

Conclusion

As LLMs continue to grow in size and complexity, tools like TensorRT-LLM become increasingly essential for practical deployments. By significantly reducing inference costs and latency, this technology is helping democratize access to state-of-the-art AI capabilities.

Whether you’re deploying models for internal use or building customer-facing applications, TensorRT-LLM offers a powerful path to maximize the performance of your large language models on NVIDIA hardware.