NVIDIA Blackwell Delivers up to 2.6x Higher Performance in MLPerf Training v5.0

Sukru Burc Eryilmaz

The journey to create a state-of-the-art large language model (LLM) begins with a process called pretraining. Pretraining a state-of-the-art model is…

NVIDIA

•

Sukru Burc Eryilmaz

•12 min read•advanced•

--

•View Original

BERTNatural Language ProcessingStable DiffusionTransformer

Overview

The article discusses the performance improvements delivered by NVIDIA's Blackwell architecture in MLPerf Training v5.0, showcasing up to 2.6x higher performance in various benchmarks, including large language model (LLM) pretraining and fine-tuning. It highlights architectural innovations, optimizations, and the capabilities of the new GB200 NVL72 system.

What You'll Learn

1

How to leverage NVIDIA Blackwell architecture for LLM pretraining

2

Why optimizing GPU utilization is crucial for model training efficiency

3

When to apply model parallelism for large-scale training

4

How to implement CUDA Graphs for improved training performance

Prerequisites & Requirements

Understanding of large language models and their training processes
Familiarity with NVIDIA software stack including cuDNN and cuBLAS(optional)

Key Questions Answered

What performance improvements does NVIDIA Blackwell provide in MLPerf Training v5.0?

NVIDIA Blackwell architecture delivers up to 2.6x higher performance in MLPerf Training v5.0 across various benchmarks, including LLM pretraining and fine-tuning. The GB200 NVL72 system achieved significant speedups, such as 2.2x faster training for Llama 3.1 405B compared to the previous Hopper architecture.

How does the GB200 NVL72 system optimize training throughput?

The GB200 NVL72 system optimizes training throughput through architectural enhancements like increased NVLink domain size, optimized model parallelism, and the use of CUDA Graphs. These innovations allow for better GPU utilization and reduced memory overhead during training.

What benchmarks were included in MLPerf Training v5.0?

MLPerf Training v5.0 includes benchmarks for LLM pretraining, LLM fine-tuning, text-to-image generation, recommender systems, graph neural networks, natural language processing, and object detection. NVIDIA achieved the fastest training times across all seven benchmarks.

What optimizations were made for LLM fine-tuning with Blackwell?

For LLM fine-tuning, optimizations included using larger memory capacity to fit models on a single GPU, reducing communication overhead, and employing enhanced RMSNorm kernels. These changes resulted in a 2.5x faster training time for Llama 2 70B LoRA compared to the previous generation.

Key Statistics & Figures

Performance increase in LLM pretraining

2.2x

Blackwell architecture compared to Hopper when training Llama 3.1 405B with 512 GPUs.

Performance increase in LLM fine-tuning

2.5x

Blackwell GPUs achieved this speedup for Llama 2 70B LoRA compared to NVIDIA DGX H100 system.

Performance increase in text-to-image pretraining

2.6x

Blackwell GPUs delivered this improvement over H100 GPUs in the Stable Diffusion v2 benchmark.

Technologies & Tools

Hardware

Nvidia Blackwell

New architecture designed to enhance performance in AI/ML workloads.

Hardware

Nvidia Gb200 Nvl72

Rack-scale system utilizing Blackwell GPUs for optimized training.

Software

Cudnn

Library optimized for deep learning operations, including enhancements for LLMs.

Software

Cublas

Library for linear algebra operations, optimized for Blackwell architecture.

Software

Cuda Graphs

Framework for optimizing GPU memory usage and execution efficiency.

Key Actionable Insights

1
Utilizing the new second-generation Transformer Engine can significantly enhance training performance for large models.
This engine optimizes the processing of transformer layers, allowing for faster training times and better resource utilization, especially in large-scale deployments.

2
Implementing CUDA Graphs can reduce memory overhead and improve execution efficiency during model training.
By managing GPU memory more effectively, CUDA Graphs enable smoother execution of complex models, which is particularly beneficial when scaling across multiple GPUs.

3
Adopting model parallelism strategies can lead to better performance in distributed training scenarios.
By optimizing the mapping of parallel tasks, organizations can mitigate communication overhead and enhance the overall training throughput.

Common Pitfalls

1

Failing to optimize GPU utilization can lead to suboptimal training performance.

Without proper management of GPU resources and communication overhead, training times can significantly increase, hindering deployment timelines.

2

Neglecting to implement model parallelism can restrict scalability.

As model sizes grow, not using model parallelism can result in excessive communication overhead and inefficient resource use, limiting the effectiveness of distributed training.

Related Concepts

Large Language Models (llms)

Machine Learning Performance Benchmarks

Nvidia GPU Architectures

Model Parallelism Techniques