NVIDIA Blackwell Ultra Sets New Inference Records in MLPerf Debut

Zhihan Jiang

As large language models (LLMs) grow larger, they get smarter, with open models from leading developers now featuring hundreds of billions of parameters.

NVIDIA

•

Zhihan Jiang

•10 min read•intermediate•

--

•View Original

GPTStable DiffusionWhisper

Overview

The article discusses NVIDIA's Blackwell Ultra architecture, which sets new inference records in the MLPerf Inference v5.1 benchmark. It highlights the performance improvements achieved with the new architecture, particularly in handling large language models and reasoning tasks.

What You'll Learn

1

How to leverage the Blackwell Ultra architecture for improved AI inference performance

2

Why disaggregated serving enhances throughput for large language models

3

How to implement NVFP4 acceleration for low precision inference

Prerequisites & Requirements

Understanding of AI inference benchmarks and architectures
Familiarity with NVIDIA TensorRT and CUDA(optional)

Key Questions Answered

What performance improvements does Blackwell Ultra offer over previous architectures?

Blackwell Ultra provides 1.5x higher peak NVFP4 AI compute, 2x higher attention-layer compute, and 1.5x higher HBM3e capacity compared to its predecessor. This results in a 45% performance increase per GPU on the DeepSeek-R1 benchmark in offline scenarios.

How does disaggregated serving improve inference throughput?

Disaggregated serving decouples the context and generation phases of inference across different GPUs, allowing for independent optimization. This approach led to a nearly 1.5x increase in throughput per GPU compared to traditional aggregated serving methods.

What is the significance of NVFP4 in the context of AI inference?

NVFP4 is a four-bit floating point format developed by NVIDIA that enables higher throughput for AI inference tasks. It allows for reduced model size while maintaining target accuracy, significantly enhancing performance when used with the Blackwell and Blackwell Ultra architectures.

What benchmarks were included in MLPerf Inference v5.1?

MLPerf Inference v5.1 includes benchmarks such as DeepSeek-R1, Llama 3.1 405B, Llama 3.1 8B, and Whisper. Each benchmark has specific performance targets for tokens per second and time-to-first-token metrics.

Key Statistics & Figures

DeepSeek-R1 Offline Performance

5,842 tokens/second/GPU

Achieved using the GB300 NVL72 rack-scale system with Blackwell Ultra architecture.

DeepSeek-R1 Server Performance

2,907 tokens/second/GPU

This performance was recorded in the server scenario using the same architecture.

Performance Improvement Compared to Hopper

5x higher throughput per GPU

Blackwell Ultra delivered this improvement compared to unverified results on a Hopper-based system.

Technologies & Tools

Hardware

Blackwell Ultra

New GPU architecture enhancing AI inference performance.

Technology

Nvfp4

Four-bit floating point format for efficient low precision inference.

Software

Tensorrt

NVIDIA's library for optimizing and deploying AI models.

Software

Cuda

Parallel computing platform and application programming interface model.

Key Actionable Insights

1
Adopting the Blackwell Ultra architecture can significantly enhance AI inference performance across various models.
This is particularly beneficial for applications requiring high throughput and low latency, such as real-time language processing and reasoning tasks.

2
Implementing disaggregated serving can optimize resource utilization in large language model deployments.
By separating context and generation phases, organizations can achieve better performance and responsiveness, especially for complex inference tasks.

3
Utilizing NVFP4 acceleration can lead to substantial improvements in low precision inference tasks.
This technique is crucial for developers looking to maximize throughput while maintaining accuracy in AI applications.

Common Pitfalls

1

Co-locating context and generation phases of inference can lead to inefficient resource use.

This happens because the two phases have different resource requirements and service level agreements, which can cause bottlenecks if not managed properly.

Related Concepts

AI Inference Benchmarks

Large Language Models

Performance Optimization Techniques