The Full Stack Optimization Powering NVIDIA MLPerf Training v2.0 Performance

Ashraf Eassa

Learn about the full-stack optimizations enabling NVIDIA platforms to deliver even more performance in MLPerf Training v2.0.

NVIDIA

•

Ashraf Eassa

•14 min read•advanced•

--

•View Original

BERTNatural Language ProcessingPythonPyTorchReinforcement LearningResNetTransformersU-Net

Overview

The article discusses NVIDIA's advancements in MLPerf Training v2.0, highlighting the full-stack optimizations that enhance performance across various AI workloads. It details specific improvements in training times for popular neural network tasks, showcasing NVIDIA's commitment to maximizing infrastructure investments for organizations.

What You'll Learn

1

How to optimize BERT training using sequence packing and layer fusion techniques

2

Why overlapping computation and communication enhances GPU utilization in deep learning models

3

How to implement asynchronous scoring to reduce evaluation overhead in large-scale training

4

When to apply CUDA Graphs for performance improvements in neural network training

Prerequisites & Requirements

Understanding of deep learning concepts and frameworks
Familiarity with NVIDIA cuDNN and CUDA Graphs(optional)

Key Questions Answered

What performance improvements were achieved in MLPerf v2.0 compared to previous versions?

NVIDIA's MLPerf v2.0 submissions demonstrated performance gains of up to 2.1x on a per-chip basis and 5.7x for max-scale training compared to MLPerf v0.7. This showcases significant advancements in training efficiency across various neural network tasks.

How does sequence packing improve BERT training efficiency?

Sequence packing allows multiple sequences to be combined into a single training sample, reducing padding overhead and improving GPU utilization. This method requires knowledge of the training set's length distribution and has shown to enhance performance by 10% in large-scale runs.

What optimizations were made to the ResNet-50 training configuration?

For ResNet-50, NVIDIA utilized a global batch size of 67,456 across 527 nodes, which minimized wasted computation by aligning the batch size with the dataset size. This adjustment led to a performance boost of 3.5% compared to previous submissions.

What role does asynchronous scoring play in the RetinaNet submission?

Asynchronous scoring in the RetinaNet submission allows the next training epoch to proceed while the previous epoch's scoring is still being processed. This approach mitigates the overhead associated with scoring large datasets, improving overall training efficiency.

Key Statistics & Figures

Max-Scale Time to Train for Recommendation (DLRM)

0.59 minutes

This represents a 5.66x improvement over previous submissions.

Per-Accelerator Time to Train for BERT

126.95 minutes

This shows a 2.69x improvement compared to MLPerf v0.7.

Performance improvement for ResNet-50

3.5%

This improvement was achieved by optimizing the max-scale training configuration.

Technologies & Tools

Hardware

Nvidia A100 Tensor Core GPU

Used for MLPerf v2.0 submissions to achieve performance improvements.

Hardware

Nvidia Dgx A100 System

Reference architecture for MLPerf submissions.

Software

Nvidia Cudnn

Used for optimizing deep learning model training performance.

Software

Cuda Graphs

Utilized to enhance performance by reducing CPU overhead.

Software

Nvidia Merlin Hugectr

An optimized framework for training deep learning recommendation models.

Key Actionable Insights

1
Implement sequence packing in your BERT training workflows to enhance efficiency and reduce padding overhead.
By merging sequences of varying lengths into single training samples, you can optimize GPU utilization and improve training times significantly, especially in large-scale scenarios.

2
Utilize asynchronous scoring in your deep learning models to minimize evaluation overhead during training.
This technique allows for continuous training while evaluations are processed, which is crucial in large-scale applications where scoring can be time-consuming.

3
Adopt CUDA Graphs to streamline the execution of neural network training and improve performance.
By capturing the model's forward and backward passes in graphs, you can reduce CPU overhead and enhance GPU utilization, leading to faster training times.

Common Pitfalls

1

Failing to optimize batch sizes can lead to wasted computation during training.

If the global batch size does not align with the dataset size, it can result in additional data being added to maintain consistency, wasting resources. It's essential to calculate the optimal batch size to avoid this issue.

Related Concepts

Deep Learning Optimization Techniques

Cuda Programming

Performance Benchmarking In AI