Breaking MLPerf Training Records with NVIDIA H100 GPUs

Ashraf Eassa

In MLPerf Training v3.0, the NVIDIA AI platform powered by the NVIDIA H100 Tensor Core GPU set new performance records.

NVIDIA

•

Ashraf Eassa

•14 min read•advanced•

--

•View Original

BERTEmbeddingGPTJSONMulti-Head AttentionPyTorchResNetTransformerU-Net

Overview

The article discusses how NVIDIA's H100 Tensor Core GPUs achieved record-breaking performance in the MLPerf Training v3.0 benchmarks, showcasing advancements in AI model training across various workloads. It highlights the improvements in speed and efficiency for training large language models and other AI applications, emphasizing the significance of the NVIDIA AI platform.

What You'll Learn

1

How to optimize AI training workloads using NVIDIA H100 GPUs

2

Why the NVIDIA AI platform is crucial for achieving high performance in MLPerf benchmarks

3

When to apply specific software optimizations for large language models

Prerequisites & Requirements

Understanding of AI model training and performance benchmarks
Familiarity with NVIDIA software libraries like cuDNN and TensorRT(optional)

Key Questions Answered

What records did NVIDIA H100 GPUs achieve in MLPerf Training v3.0?

NVIDIA H100 GPUs set new performance records in MLPerf Training v3.0, achieving the highest performance on a per-accelerator basis and the fastest time to train across all benchmarks. This included a 3.1x performance increase over the previous A100 GPUs and significant improvements in training times for various workloads.

How did NVIDIA improve performance for the BERT NLP workload?

NVIDIA improved the per-accelerator performance on the BERT NLP workload by 17% compared to the previous submission. Key optimizations included introducing FP8 I/O support in the cuDNN library and overlapping data preprocessing with computations to reduce iteration time.

What are the key software optimizations used in the MLPerf submissions?

The MLPerf submissions utilized several optimizations, including faster GroupBatchNorm kernels, improved convolution kernels in cuDNN, and enhancements in random number generation. These optimizations contributed to significant performance gains across various AI workloads.

What is the significance of the new DLRM_DCNv2 benchmark?

The DLRM_DCNv2 benchmark replaces the previous DLRM benchmark and introduces a multi-hot dataset and a cross layer for improved performance. This update reflects real-world applications of recommenders and utilizes the Adagrad optimizer for better training efficiency.

Key Statistics & Figures

Performance increase of H100 over A100

3.1x

This performance increase was observed in the MLPerf Training v3.0 submissions.

Time to train BERT

0.134 minutes

8 seconds

Time to train large language model (GPT-3)

10.9 minutes

This was achieved using 3,584 H100 GPUs in a joint submission with CoreWeave.

Technologies & Tools

Hardware

Nvidia H100 Tensor Core GPU

Used for high-performance AI model training in MLPerf benchmarks.

Software

Nvidia Nemo Framework

Utilized for training large language models.

Software

Nvidia Cudnn

Provides optimized deep learning routines for training efficiency.

Software

Nvidia Data Loading Library (dali)

Accelerates data loading and preprocessing for deep learning applications.

Key Actionable Insights

1
Leverage the NVIDIA H100 GPUs for training large-scale AI models to achieve faster time-to-train results.
Utilizing the H100 GPUs can significantly reduce training times for complex models like GPT-3, allowing for quicker deployment of AI applications and improved time to value.

2
Implement software optimizations such as FP8 precision and overlapping data preprocessing to enhance performance.
These optimizations can lead to substantial improvements in training efficiency, particularly for NLP models like BERT, where every millisecond counts in large-scale training scenarios.

3
Consider using the NVIDIA Data Loading Library (DALI) for efficient data preprocessing in deep learning workflows.
DALI can help minimize overhead during training by streamlining data loading and preprocessing, which is crucial for maintaining high throughput in large-scale AI applications.

Common Pitfalls

1

Neglecting to optimize data preprocessing can lead to significant overhead during training.

Many practitioners overlook the importance of efficient data handling, which can bottleneck the training process. Implementing libraries like DALI can mitigate these issues.

2

Failing to leverage FP8 precision in model training may result in suboptimal performance.

Without using FP8, models may not fully utilize the capabilities of the H100 GPUs, leading to longer training times and reduced efficiency.

Related Concepts

AI Model Training Techniques

Performance Benchmarking In AI

Large Language Models And Their Training

Optimizations In Deep Learning Frameworks