Join our Discord Server
Collabnix Team The Collabnix Team is a diverse collective of Docker, Kubernetes, and IoT experts united by a passion for cloud-native technologies. With backgrounds spanning across DevOps, platform engineering, cloud architecture, and container orchestration, our contributors bring together decades of combined experience from various industries and technical domains.

Complete GPT OSS Tutorial: How to Setup, Deploy & Optimize OpenAI’s Open Source Models

6 min read

What is GPT OSS? Understanding OpenAI’s Open Source Revolution

OpenAI’s GPT OSS represents a groundbreaking shift toward open-source AI, offering developers two powerful mixture-of-experts (MoE) models:

  • gpt-oss-120b: 117B total parameters (5.1B active)
  • gpt-oss-20b: 21B total parameters (3.6B active)

Both models use innovative 4-bit quantization (MXFP4) for efficient inference while maintaining high performance. The large model runs on a single H100 GPU, while the smaller version fits in just 16GB of memory.

Key Benefits for Developers

Apache 2.0 License – Complete commercial freedom
Consumer Hardware Compatible – 20B model runs on 16GB GPUs
Advanced Reasoning – Chain-of-thought capabilities
Tool Integration – Native function calling support
Multiple Deployment Options – Local, cloud, and API access


Prerequisites: What You Need Before Starting

Hardware Requirements

For GPT OSS 20B:

  • GPU: 16GB+ VRAM (RTX 4090, RTX 5090, or data center GPUs)
  • RAM: 32GB system memory recommended
  • Storage: 50GB+ free space

For GPT OSS 120B:

  • GPU: H100 (80GB) or H200 recommended
  • RAM: 64GB+ system memory
  • Storage: 200GB+ free space

Software Prerequisites

# Python 3.9 or higher
python --version

# CUDA-compatible PyTorch installation
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

Method 1: Quick Start with Hugging Face Inference API

Step 1: Get Your Hugging Face Token

  1. Visit huggingface.co
  2. Create account or login
  3. Go to Settings → Access Tokens
  4. Create new token with “Read” permissions
  5. Copy and save your token securely

Step 2: Install Required Packages

pip install openai python-dotenv

Step 3: Set Up Environment Variables

Create a .env file in your project directory:

# .env
HF_TOKEN=your_hugging_face_token_here

Step 4: Basic API Usage

import os
from openai import OpenAI
from dotenv import load_dotenv

# Load environment variables
load_dotenv()

# Initialize client
client = OpenAI(
    base_url="https://router.huggingface.co/v1",
    api_key=os.environ["HF_TOKEN"],
)

# Make your first request
completion = client.chat.completions.create(
    model="openai/gpt-oss-20b:fireworks-ai",  # or cerebras, together-ai
    messages=[
        {
            "role": "user", 
            "content": "Explain quantum computing in simple terms"
        }
    ],
    max_tokens=500,
    temperature=0.7
)

print(completion.choices[0].message.content)

Step 5: Advanced API Usage with Responses API

# Using the new Responses API for better control
response = client.responses.create(
    model="openai/gpt-oss-120b:cerebras",
    input="Write a Python function to calculate fibonacci numbers",
    reasoning_effort="medium"  # low, medium, high
)

print(f"Response: {response.content}")
print(f"Reasoning tokens: {response.reasoning_tokens}")

Method 2: Local Installation and Inference

Step 1: Install Core Dependencies

# Install latest transformers with all optimizations
pip install --upgrade transformers>=4.55 accelerate kernels

# For optimal performance with Hopper/Blackwell GPUs
pip install torch==2.8.0 --index-url https://download.pytorch.org/whl/test/cu128
pip install git+https://github.com/triton-lang/triton.git@main#subdirectory=python/triton_kernels

Step 2: Basic Local Inference Setup

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Choose your model size
MODEL_ID = "openai/gpt-oss-20b"  # or "openai/gpt-oss-120b"

# Load tokenizer
print("Loading tokenizer...")
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

# Load model with automatic optimization
print("Loading model (this may take several minutes)...")
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    device_map="auto",           # Automatic GPU placement
    torch_dtype="auto",          # Automatic precision selection
    trust_remote_code=True,      # Required for custom model code
)

print(f"Model loaded on: {model.device}")
print(f"Model dtype: {model.dtype}")

Step 3: Create a Chat Interface

def chat_with_gpt_oss(user_message, conversation_history=None):
    """
    Simple chat interface for GPT OSS models
    """
    if conversation_history is None:
        conversation_history = []
    
    # Add user message to history
    conversation_history.append({
        "role": "user", 
        "content": user_message
    })
    
    # Prepare input
    inputs = tokenizer.apply_chat_template(
        conversation_history,
        add_generation_prompt=True,
        return_tensors="pt",
        return_dict=True,
    ).to(model.device)
    
    # Generate response
    with torch.no_grad():
        generated = model.generate(
            **inputs,
            max_new_tokens=512,
            temperature=0.7,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id,
            repetition_penalty=1.1,
        )
    
    # Decode response
    response = tokenizer.decode(
        generated[0][inputs["input_ids"].shape[-1]:], 
        skip_special_tokens=True
    )
    
    # Add to conversation history
    conversation_history.append({
        "role": "assistant", 
        "content": response
    })
    
    return response, conversation_history

# Example usage
response, history = chat_with_gpt_oss(
    "Write a Python script to scrape a website"
)
print(response)

Step 4: Enable Flash Attention 3 (H100/H200 Only)

# Add this before model loading for maximum speed
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    device_map="auto",
    torch_dtype="auto",
    # Enable Flash Attention 3 with attention sinks
    attn_implementation="flash_attention_3",
)

Method 3: Production Deployment with vLLM

Step 1: Install vLLM

pip install vllm>=0.6.5

Step 2: Create Production Server

# server.py
from vllm import LLM, SamplingParams
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import uvicorn

# Initialize FastAPI
app = FastAPI(title="GPT OSS API Server")

# Initialize vLLM engine
llm = LLM(
    model="openai/gpt-oss-20b",
    tensor_parallel_size=1,  # Adjust based on your GPU count
    max_model_len=32768,     # Context length
    dtype="auto",
    quantization="mxfp4",    # Use quantization for efficiency
)

class ChatRequest(BaseModel):
    message: str
    max_tokens: int = 512
    temperature: float = 0.7

@app.post("/chat")
async def chat_endpoint(request: ChatRequest):
    try:
        sampling_params = SamplingParams(
            temperature=request.temperature,
            max_tokens=request.max_tokens,
            stop_token_ids=[llm.get_tokenizer().eos_token_id],
        )
        
        # Format as chat
        prompt = f"User: {request.message}\nAssistant:"
        
        outputs = llm.generate([prompt], sampling_params)
        response = outputs[0].outputs[0].text
        
        return {"response": response}
        
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)

Step 3: Launch Production Server

# Start the server
python server.py

# Test your API
curl -X POST "http://localhost:8000/chat" \
     -H "Content-Type: application/json" \
     -d '{"message": "Hello, how are you?", "max_tokens": 256}'

Advanced Optimization Techniques

Memory Optimization

# For memory-constrained environments
import torch

# Enable gradient checkpointing
model.gradient_checkpointing_enable()

# Use CPU offloading for large models
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    device_map="auto",
    offload_folder="./offload",  # Offload to disk
    max_memory={0: "15GB", "cpu": "30GB"},
)

# Clear cache regularly
torch.cuda.empty_cache()

Batch Processing for Efficiency

def batch_generate(prompts, batch_size=4):
    """Process multiple prompts efficiently"""
    results = []
    
    for i in range(0, len(prompts), batch_size):
        batch_prompts = prompts[i:i + batch_size]
        
        # Tokenize batch
        inputs = tokenizer(
            batch_prompts,
            return_tensors="pt",
            padding=True,
            truncation=True,
            max_length=2048
        ).to(model.device)
        
        # Generate batch
        with torch.no_grad():
            outputs = model.generate(
                **inputs,
                max_new_tokens=256,
                pad_token_id=tokenizer.eos_token_id,
                do_sample=True,
                temperature=0.7,
            )
        
        # Decode results
        batch_results = []
        for j, output in enumerate(outputs):
            start_idx = inputs.input_ids[j].shape[0]
            decoded = tokenizer.decode(
                output[start_idx:], 
                skip_special_tokens=True
            )
            batch_results.append(decoded)
        
        results.extend(batch_results)
    
    return results

# Example usage
prompts = [
    "Explain machine learning",
    "Write a haiku about coding",
    "Describe the solar system",
    "What is blockchain?"
]

responses = batch_generate(prompts)

Fine-tuning GPT OSS Models

Step 1: Prepare Your Dataset

# Create training dataset
import json

def create_training_data(conversations):
    """Convert conversations to training format"""
    training_data = []
    
    for conv in conversations:
        formatted = {
            "messages": [
                {"role": "user", "content": conv["input"]},
                {"role": "assistant", "content": conv["output"]}
            ]
        }
        training_data.append(formatted)
    
    # Save as JSONL
    with open("training_data.jsonl", "w") as f:
        for item in training_data:
            f.write(json.dumps(item) + "\n")

# Example dataset
sample_data = [
    {
        "input": "How do I center a div in CSS?",
        "output": "You can center a div using flexbox:\n```css\n.parent {\n  display: flex;\n  justify-content: center;\n  align-items: center;\n}\n```"
    },
    # Add more examples...
]

create_training_data(sample_data)

Step 2: Fine-tuning Script

from transformers import (
    AutoModelForCausalLM, 
    AutoTokenizer, 
    TrainingArguments, 
    Trainer,
    DataCollatorForLanguageModeling
)
from datasets import load_dataset
from peft import LoraConfig, get_peft_model

# Load base model
model = AutoModelForCausalLM.from_pretrained(
    "openai/gpt-oss-20b",
    device_map="auto",
    torch_dtype=torch.float16,
)

tokenizer = AutoTokenizer.from_pretrained("openai/gpt-oss-20b")
tokenizer.pad_token = tokenizer.eos_token

# Configure LoRA for efficient fine-tuning
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.1,
    bias="none",
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, lora_config)

# Load and preprocess data
dataset = load_dataset("json", data_files="training_data.jsonl")

def preprocess_function(examples):
    # Format as chat templates
    texts = []
    for messages in examples["messages"]:
        text = tokenizer.apply_chat_template(
            messages, 
            tokenize=False, 
            add_generation_prompt=False
        )
        texts.append(text)
    
    return tokenizer(texts, truncation=True, max_length=2048)

tokenized_dataset = dataset.map(
    preprocess_function, 
    batched=True, 
    remove_columns=dataset["train"].column_names
)

# Training configuration
training_args = TrainingArguments(
    output_dir="./gpt-oss-finetuned",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8,
    num_train_epochs=3,
    learning_rate=2e-5,
    fp16=True,
    logging_steps=10,
    save_strategy="epoch",
    evaluation_strategy="no",
)

# Data collator
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, 
    mlm=False
)

# Initialize trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    data_collator=data_collator,
)

# Start fine-tuning
print("Starting fine-tuning...")
trainer.train()

# Save the model
model.save_pretrained("./gpt-oss-finetuned")
tokenizer.save_pretrained("./gpt-oss-finetuned")

Deployment on Cloud Platforms

AWS SageMaker Deployment

# deploy_sagemaker.py
import boto3
from sagemaker.huggingface import HuggingFaceModel
import sagemaker

# Initialize SageMaker session
sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()

# Create Hugging Face Model
huggingface_model = HuggingFaceModel(
    transformers_version="4.55",
    pytorch_version="2.7",
    py_version="py310",
    model_data="s3://your-bucket/gpt-oss-model/",
    role=role,
    env={
        "HF_MODEL_ID": "openai/gpt-oss-20b",
        "HF_TASK": "text-generation",
        "MAX_INPUT_LENGTH": "2048",
        "MAX_TOTAL_TOKENS": "4096",
    }
)

# Deploy to endpoint
predictor = huggingface_model.deploy(
    initial_instance_count=1,
    instance_type="ml.g5.2xlarge",  # Adjust based on model size
    endpoint_name="gpt-oss-endpoint"
)

# Test the endpoint
response = predictor.predict({
    "inputs": "Write a Python function to calculate prime numbers",
    "parameters": {
        "max_new_tokens": 256,
        "temperature": 0.7,
        "do_sample": True
    }
})

print(response)

Docker Deployment

# Dockerfile
FROM nvidia/cuda:12.1-runtime-ubuntu20.04

# Install Python and dependencies
RUN apt-get update && apt-get install -y python3 python3-pip
RUN pip3 install --upgrade pip

# Install model dependencies
COPY requirements.txt .
RUN pip3 install -r requirements.txt

# Copy application code
COPY app/ /app/
WORKDIR /app

# Expose port
EXPOSE 8000

# Run the application
CMD ["python3", "server.py"]
# requirements.txt
torch>=2.7.0
transformers>=4.55.0
vllm>=0.6.5
fastapi
uvicorn
accelerate
kernels

Troubleshooting Common Issues

Issue 1: Out of Memory Errors

Solution: Use gradient checkpointing and model offloading

# Enable memory optimizations
model.gradient_checkpointing_enable()

# Use CPU offloading
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    device_map="auto",
    max_memory={0: "14GB", "cpu": "32GB"},
    offload_folder="./offload_weights"
)

Issue 2: Slow Inference Speed

Solution: Enable optimizations and use appropriate hardware

# Use Flash Attention on compatible hardware
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    attn_implementation="flash_attention_3",
    torch_dtype=torch.float16,
)

# Enable compilation for repeated inference
model = torch.compile(model, mode="reduce-overhead")

Issue 3: Installation Problems

Common fixes:

# Update all packages
pip install --upgrade pip setuptools wheel

# Install with specific CUDA version
pip install torch --index-url https://download.pytorch.org/whl/cu121

# Clear pip cache if needed
pip cache purge

Performance Benchmarks and Best Practices

Model Performance Comparison

ModelVRAM UsageInference SpeedQuality Score
gpt-oss-20b (mxfp4)16GB45 tokens/sec8.5/10
gpt-oss-20b (bf16)48GB35 tokens/sec9.0/10
gpt-oss-120b (mxfp4)80GB25 tokens/sec9.5/10

Best Practices for Production

  1. Use vLLM for serving – 2-3x faster than native transformers
  2. Enable quantization – Reduces memory usage by 75%
  3. Batch requests – Process multiple requests simultaneously
  4. Monitor GPU temperature – Ensure adequate cooling
  5. Cache tokenizer – Avoid reloading on each request
  6. Use async processing – Better handling of concurrent requests
# Production-ready server configuration
from vllm import AsyncLLMEngine, AsyncEngineArgs
import asyncio

# Configure async engine
engine_args = AsyncEngineArgs(
    model="openai/gpt-oss-20b",
    tensor_parallel_size=1,
    max_model_len=32768,
    dtype="auto",
    quantization="mxfp4",
    max_num_seqs=32,  # Process up to 32 requests simultaneously
    enable_prefix_caching=True,  # Cache common prefixes
)

engine = AsyncLLMEngine.from_engine_args(engine_args)

Conclusion

GPT OSS models represent a significant breakthrough in open-source AI, offering enterprise-grade capabilities with complete commercial freedom. This tutorial covered:

Quick API access via Hugging Face Inference
Local installation and optimization
Production deployment with vLLM
Fine-tuning techniques using LoRA
Cloud deployment strategies
Performance optimization tips

Next Steps

  1. Experiment with different model sizes to find the best fit for your use case
  2. Try fine-tuning on domain-specific data
  3. Scale deployment based on traffic requirements
  4. Monitor performance and optimize for your specific workload

Additional Resources


Have Queries? Join https://launchpass.com/collabnix

Collabnix Team The Collabnix Team is a diverse collective of Docker, Kubernetes, and IoT experts united by a passion for cloud-native technologies. With backgrounds spanning across DevOps, platform engineering, cloud architecture, and container orchestration, our contributors bring together decades of combined experience from various industries and technical domains.
Join our Discord Server
Index