NVIDIA Accelerates OpenAI gpt&#x2d;oss Models Delivering 1.5 M TPS Inference on NVIDIA GB200 NVL72

Anu Srivastava

NVIDIA and OpenAI began pushing the boundaries of AI with the launch of NVIDIA DGX back in 2016. The collaborative AI innovation continues with the OpenAI gpt…

NVIDIA

•

Anu Srivastava

•6 min read•intermediate•

--

•View Original

DockerHugging FaceOllamaPythonTransformerTransformers

Overview

NVIDIA has optimized OpenAI's gpt-oss models for accelerated inference performance on the NVIDIA GB200 NVL72 system, achieving up to 1.5 million tokens per second (TPS). The article discusses the architecture, training, and deployment strategies for these models, highlighting their capabilities and integration with various frameworks.

What You'll Learn

1

How to deploy OpenAI gpt-oss models using vLLM

2

How to optimize inference performance with TensorRT-LLM

3

Why using NVIDIA Dynamo improves performance for long input sequences

4

How to run AI models locally on NVIDIA GeForce RTX AI PCs

Prerequisites & Requirements

Understanding of AI model deployment and inference optimization
Familiarity with NVIDIA software frameworks like TensorRT-LLM and vLLM(optional)

Key Questions Answered

What is the performance capability of the NVIDIA GB200 NVL72 system with gpt-oss models?

The NVIDIA GB200 NVL72 system can deliver up to 1.5 million tokens per second (TPS) for the gpt-oss-120b model, enabling approximately 50,000 concurrent users. This high performance is achieved through advanced architectural features of the Blackwell platform.

How does NVIDIA optimize the gpt-oss models for inference?

NVIDIA optimizes the gpt-oss models by leveraging the Blackwell architecture, utilizing FP4 precision, and integrating with frameworks like TensorRT-LLM and vLLM. These optimizations enhance performance and reduce latency during inference.

What are the main features of the gpt-oss models?

The gpt-oss models feature text-reasoning capabilities, chain-of-thought processing, and tool-calling abilities. They utilize a mixture of experts (MoE) architecture and support a context length of 128k tokens.

What deployment options are available for the gpt-oss models?

Developers can deploy the gpt-oss models using various options, including vLLM, TensorRT-LLM, NVIDIA Dynamo, and locally on NVIDIA GeForce RTX AI PCs. Each option provides unique benefits for performance and ease of use.

Key Statistics & Figures

Inference performance

1.5 million tokens per second

TPS

Training hours for gpt-oss-120b

over 2.1 million hours

This extensive training time reflects the complexity and scale of the model.

Training hours for gpt-oss-20b

about 10x less than gpt-oss-120b

Indicating a significant difference in resource requirements between the two models.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Hardware

Nvidia Blackwell

Used for optimized inference performance of gpt-oss models.

Software

Tensorrt-llm

Framework for deploying and optimizing large language models.

Software

Vllm

Framework for serving large language models with optimized performance.

Software

Nvidia Dynamo

Open-source inference serving platform for large-scale applications.

Key Actionable Insights

1
Leverage NVIDIA's optimized kernels for deploying gpt-oss models to achieve high performance.
Using optimized kernels can significantly enhance inference speed and efficiency, especially in data center environments where performance is critical.

2
Utilize NVIDIA Dynamo for large-scale applications to improve interactivity and throughput.
Dynamo's disaggregated architecture allows for better resource utilization, making it ideal for applications requiring long input sequences without compromising performance.

3
Experiment with local deployments on NVIDIA GeForce RTX AI PCs for faster iteration cycles.
Running models locally can reduce latency and enhance data privacy, allowing developers to iterate more quickly during the development process.

Common Pitfalls

1

Failing to optimize model deployment can lead to suboptimal performance.

Without leveraging NVIDIA's optimized kernels and frameworks, developers may experience increased latency and reduced throughput, impacting user experience.

2

Neglecting to test models locally can hinder development speed.

Running models locally allows for faster iterations and debugging, which is crucial for agile development processes.

Related Concepts

Nvidia GPU Architectures

Large Language Model Optimization Techniques

AI Model Deployment Strategies

The Gemma 3n model has been fully released, building on the success of previous Gemma models and bringing advanced on-device multimodal capabilities to edge devices with unprecedented performance. Explore Gemma 3n's innovations, including its mobile-first architecture, MatFormer technology, Per-Layer Embeddings, KV Cache Sharing, and new audio and MobileNet-V5 vision encoders, and how developers can start building with it today.

DockerHugging FaceTransformers

9 min read

Includes Code

Has Summary

--

NVIDIA

Advanced

Real-Time Natural Language Understanding with BERT Using TensorRT

Large scale language models (LSLMs) such as BERT, GPT-2, and XL-Net have brought about exciting leaps in state-of-the-art accuracy for many natural language…

DockerGoogle CloudTransformers

19 min read

Includes Code

Has Summary

--

These articles from NVIDIA and other leading engineering teams share similar topics with "NVIDIA Accelerates OpenAI gpt-oss Models Delivering 1.5 M TPS Inference on NVIDIA GB200 NVL72". Explore more engineering insights on Docker, Hugging Face, Google Cloud.