Setting New Records at Data Center Scale Using NVIDIA H100 GPUs and NVIDIA Quantum-2 InfiniBand

Generative AI is rapidly transforming computing, unlocking new use cases and turbocharging existing ones. Large language models (LLMs), such as OpenAI’s GPT…

Overview

The article discusses how NVIDIA's H100 GPUs and Quantum-2 InfiniBand have set new performance records in data center-scale AI training, particularly for Large Language Models (LLMs) and Stable Diffusion workloads. It highlights the significant improvements in training efficiency and scalability achieved through advanced software optimizations and hardware capabilities.

What You'll Learn

1

How to achieve record performance in LLM training using NVIDIA H100 GPUs

2

Why optimizing GPU memory layout can significantly improve training speed

3

How to implement advanced techniques like FlashAttention-2 for better model performance

4

When to apply CUDA graphs to reduce runtime overhead in deep learning models

Prerequisites & Requirements

  • Understanding of Large Language Models and their training requirements
  • Familiarity with NVIDIA software stack including NeMo and cuBLAS(optional)

Key Questions Answered

What performance improvements were achieved in LLM training with NVIDIA H100 GPUs?
NVIDIA achieved a time-to-train score of 3.92 minutes using 10,752 H100 GPUs, representing a 2.8x performance boost compared to previous submissions. This was accomplished through software enhancements and increased submission scale, showcasing the GPUs' capability to handle extensive model training efficiently.
How does NVIDIA's Quantum-2 InfiniBand contribute to performance at scale?
NVIDIA's Quantum-2 InfiniBand switches and in-network computing with NVIDIA SHARP accelerated collective operations, which helped achieve record performance for various workloads, including DLRM-dcnv2 and BERT-large. This high-bandwidth networking fabric is crucial for efficient data transfer between GPUs in large-scale training scenarios.
What optimizations were made for Stable Diffusion training?
NVIDIA's submission for Stable Diffusion using 1,024 H100 GPUs reduced training time to 2.47 minutes. Key optimizations included using GroupNorm with Channels Last support and FlashAttention-2, which improved performance by 14% and 21%, respectively, demonstrating significant advancements in training efficiency.
What are the benefits of using CUDA graphs in deep learning?
Using CUDA graphs reduces runtime overhead by optimizing the execution of GPU operations, leading to increased GPU utilization. This technique was applied in NVIDIA's submissions, resulting in a 4% performance increase for the U-Net model in Stable Diffusion, showcasing its effectiveness in improving multi-GPU performance.

Key Statistics & Figures

Time-to-train for LLM using 10,752 H100 GPUs
3.92 minutes
This score represents a 2.8x performance improvement over previous submissions.
Training time for Stable Diffusion using 1,024 H100 GPUs
2.47 minutes
This is a significant reduction from the previous benchmark of 10.02 minutes using 64 GPUs.
Performance increase per H100 GPU
797 TFLOPS
This was achieved through software improvements in the latest submissions.
Maximum scale of GPUs used in MLPerf submission
10,752 H100 GPUs
This is the largest number of accelerators ever used in an MLPerf submission.

Technologies & Tools

Hardware
Nvidia H100 Tensor Core Gpus
Used for high-performance AI training and benchmarking.
Networking
Nvidia Quantum-2 Infiniband
Provides high-bandwidth networking for efficient GPU interconnectivity.
Software
Nvidia Nemo Framework
Facilitates the development and training of AI models.
Software
Nvidia Cublas
Optimizes linear algebra operations crucial for deep learning.
Software
Nvidia Transformer Engine
Enhances performance for transformer-based models.

Key Actionable Insights

1
Leverage the latest NVIDIA H100 GPUs for training large models to maximize performance and reduce costs.
With the ability to scale up to 10,752 GPUs, organizations can achieve unprecedented training speeds, making it feasible to train complex models that were previously impractical.
2
Implement memory layout optimizations such as channels-last format to enhance training efficiency.
This approach minimizes memory access overhead, leading to faster computation times, especially in convolutional networks, which is critical for achieving high throughput in AI workloads.
3
Utilize advanced techniques like FlashAttention-2 to improve model training speed and efficiency.
These optimizations can significantly reduce the time required for self-attention mechanisms in models, which is essential for large-scale training scenarios.
4
Adopt CUDA graphs to streamline GPU operations and reduce overhead in multi-GPU setups.
This technique can lead to better resource utilization and lower training times, making it a valuable strategy for deep learning practitioners.

Common Pitfalls

1
Failing to optimize memory layout can lead to significant performance bottlenecks in model training.
Many practitioners overlook the importance of memory access patterns, which can severely impact the efficiency of GPU computations, especially in large-scale models.
2
Neglecting the benefits of advanced techniques like FlashAttention-2 may result in suboptimal training speeds.
Without implementing these optimizations, users may miss out on substantial performance gains that can drastically reduce training times.

Related Concepts

Large Language Models (llms)
Generative AI
Diffusion Models
High-performance Computing (hpc)