Boosting NVIDIA MLPerf Training v1.1 Performance with Full Stack Optimization

In MLPerf training v1.1, we optimized across the entire stack including hardware, system software, libraries, and algorithms.

Vinh Nguyen
21 min readadvanced
--
View Original

Overview

The article discusses the performance improvements achieved in the NVIDIA MLPerf Training v1.1 benchmark through full stack optimization. It highlights advancements in hardware and software integration, showcasing significant performance gains across various AI workloads.

What You'll Learn

1

How to leverage CUDA Graphs for improved performance in AI workloads

2

Why fine-grained overlapping of computations enhances GPU utilization

3

How to optimize NCCL for better inter-GPU communication

4

When to apply kernel fusion techniques in deep learning models

Prerequisites & Requirements

  • Understanding of GPU architecture and deep learning frameworks
  • Familiarity with NVIDIA software tools like NCCL and CUDA(optional)

Key Questions Answered

What performance improvements were achieved in MLPerf v1.1 compared to previous versions?
In MLPerf v1.1, NVIDIA reported up to 2.1x improvement on a chip-to-chip basis and up to 5.3x for max-scale training compared to MLPerf v0.7. This showcases significant advancements in AI training benchmarks across various workloads.
How does CUDA Graphs enhance performance in AI training?
CUDA Graphs allows capturing an entire iteration as a single graph, minimizing CPU communication during training. This optimization led to performance gains of up to 6% in workloads like ResNet-50 and BERT.
What optimizations were made to NCCL in this round?
NCCL introduced user buffer registration to avoid unnecessary data copying during communication, and fusing scaling operations into communication kernels resulted in an additional ~3% end-to-end savings in communication-heavy networks like BERT.
When should kernel fusion be applied in deep learning models?
Kernel fusion should be applied when multiple operations can be combined into a single kernel execution to reduce memory trips and improve performance. In MLPerf v1.1, fusing bias gradient reduction into matrix multiplication kernels resulted in up to 3% performance improvements.

Key Statistics & Figures

Max-Scale Training Improvement
5.3x
Compared to MLPerf v0.7 submissions.
Chip-to-Chip Improvement
2.1x
Compared to MLPerf v0.7 submissions.
Performance Gain from CUDA Graphs
6%
Observed in ResNet-50 and BERT workloads.
End-to-End Savings from NCCL Optimizations
~3%
In communication-heavy networks like BERT.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Hardware
Nvidia A100
Used as the primary GPU for MLPerf submissions.
Library
Nccl
Optimizes inter-GPU communication.
Library
Cuda Graphs
Enhances performance by capturing entire iterations.
Framework
Pytorch
Used for implementing benchmarks.
Framework
Mxnet
Also used for implementing benchmarks.

Key Actionable Insights

1
Leverage CUDA Graphs to optimize training iterations in deep learning models.
By capturing full iterations as a single graph, you can significantly reduce CPU overhead and improve throughput, especially in large-scale training scenarios.
2
Implement fine-grained overlapping of computations to maximize GPU utilization.
This technique allows independent computation tasks to run concurrently, which can lead to substantial performance improvements, particularly in complex models like Mask R-CNN.
3
Utilize NCCL’s user buffer registration for efficient inter-GPU communication.
This optimization reduces data copying overhead, enhancing performance in multi-GPU setups, which is crucial for scaling deep learning workloads.
4
Apply kernel fusion techniques to reduce memory access times.
Fusing operations can lead to better performance by minimizing the number of memory trips, which is particularly beneficial in memory-bound operations.

Common Pitfalls

1
Failing to optimize data communication between GPUs can lead to performance bottlenecks.
Without efficient communication strategies like those provided by NCCL, the overall training time can increase significantly, especially in multi-GPU setups.
2
Not leveraging CUDA Graphs can result in suboptimal GPU utilization.
If full iterations are not captured in a single graph, CPU communication overhead can increase, leading to performance degradation in large-scale training.

Related Concepts

Cuda Optimization Techniques
Nvidia Hardware Capabilities
Deep Learning Model Performance Tuning