MLPerf v1.0 Training Benchmarks: Insights into a Record-Setting NVIDIA Performance

Learn about some of the major optimizations made to the NVIDIA platform that contributed to the nearly 7x increase in performance since the first MLPerf…

Vinh Nguyen
30 min readadvanced
--
View Original

Overview

The article discusses the MLPerf v1.0 training benchmarks, highlighting NVIDIA's record-setting performance across various AI workloads. It details the improvements made over previous submissions, the innovative techniques employed, and the specific benchmarks where NVIDIA excelled.

What You'll Learn

1

How to utilize CUDA Graphs for optimizing neural network training

2

Why SHARP improves interconnect bandwidth in distributed training

3

How to implement hybrid embedding for efficient DLRM training

4

When to apply spatial parallelism in image segmentation tasks

5

How to leverage asynchronous evaluation to speed up model training

Key Questions Answered

What performance records did NVIDIA set in MLPerf v1.0?
NVIDIA set 16 performance records in MLPerf v1.0, with 8 on a per-chip basis and 8 at-scale training in the commercially available solutions category. This includes significant improvements of up to 2.1x on a chip-to-chip basis and up to 3.5x at scale compared to their previous submissions.
How does NVIDIA's DLRM submission improve recommendation systems?
NVIDIA's DLRM submission utilizes HugeCTR, a GPU-accelerated recommendation framework, incorporating optimizations like hybrid embedding, optimized collectives, and a whole-iteration CUDA graph. These enhancements allow for efficient scaling and improved performance in recommendation tasks.
What innovations did NVIDIA implement for BERT training?
For BERT training, NVIDIA implemented several innovations including fused multihead attention, distributed LAMB optimization, and synchronization-free training. These optimizations significantly reduced overhead and improved training efficiency, resulting in a 3.3x speedup compared to the previous submission.
What techniques were used to optimize the 3D U-Net workload?
The 3D U-Net workload was optimized using spatial parallelism to split images across multiple GPUs, asynchronous evaluation to hide evaluation time, and caching datasets in GPU memory to reduce I/O bottlenecks. These techniques collectively improved training efficiency and scalability.

Key Statistics & Figures

Performance improvement at scale
3.5x
This improvement was achieved compared to NVIDIA's previous MLPerf v0.7 submissions.
Number of performance records set
16
NVIDIA set 8 records on a per-chip basis and 8 at-scale training records.
Speedup achieved in DLRM training
3.3x
This speedup was realized on 14 DGX-A100 nodes compared to the previous submission.
Memory bandwidth increase for A100 GPUs
30%
This increase was due to the new HBM2e GPU memory.

Technologies & Tools

Software
Cuda Graphs
Used to optimize kernel launches and reduce CPU overhead in training.
Networking
Sharp
Enhances interconnect bandwidth and offloads collective operations.
Software
Hugectr
A GPU-accelerated recommendation framework used in DLRM submissions.
Hardware
Nvidia A100 Tensor Core GPU
The main hardware used for achieving record performance in MLPerf benchmarks.

Key Actionable Insights

1
Implementing CUDA Graphs can significantly reduce kernel launch overhead in training deep learning models.
This technique is particularly beneficial for workloads with small batch sizes, where CPU overhead can become a bottleneck. By capturing the entire training iteration in a single graph, you can streamline execution and enhance performance.
2
Utilizing SHARP can double the effective interconnect bandwidth between nodes in distributed training environments.
This is crucial for large-scale AI workloads where communication overhead can hinder performance. Offloading collective operations to the network fabric can lead to more efficient data handling and faster training times.
3
Adopting hybrid embedding techniques in recommendation systems can drastically reduce communication overhead.
By deduplicating categories and optimizing gradient exchanges, you can improve the efficiency of distributed training, especially in scenarios with high category variance.
4
Asynchronous evaluation can be a game-changer for speeding up model training cycles.
By running evaluations concurrently with training and caching datasets in GPU memory, you can minimize idle time and maintain high throughput during training iterations.

Common Pitfalls

1
Failing to optimize data loading can lead to significant bottlenecks in training performance.
As training speeds increase, the I/O operations become a limiting factor. Implementing a hybrid dataloader that utilizes GPU for augmentations can alleviate this issue.