Reproducing NVIDIA MLPerf v5.0 Training Scores for LLM Benchmarks

The previous post, NVIDIA Blackwell Delivers up to 2.6x Higher Performance in MLPerf Training v5.0, explains how the NVIDIA platform delivered the fastest time…

Michał Marcinkiewicz
11 min readadvanced
--
View Original

Overview

This article provides a comprehensive guide on reproducing NVIDIA's MLPerf v5.0 training scores for LLM benchmarks, specifically focusing on Llama 2 70B LoRA fine-tuning and Llama 3.1 405B pretraining. It details the prerequisites, cluster setup, and steps to run benchmarks, including container building, dataset downloading, and log parsing.

What You'll Learn

1

How to reproduce NVIDIA MLPerf v5.0 training scores for Llama 2 70B LoRA fine-tuning

2

How to set up a SLURM cluster for running MLPerf benchmarks

3

How to download and preprocess datasets for LLM training

Prerequisites & Requirements

  • Docker
  • Hugging Face access token
  • Understanding of SLURM and cluster management(optional)
  • Experience with NVIDIA GPUs and MLPerf benchmarks(optional)

Key Questions Answered

What are the hardware requirements for running Llama 2 70B LoRA benchmarks?
To run Llama 2 70B LoRA benchmarks, you need an NVIDIA DGX B200 or NVIDIA GB200 NVL72 system, or multiple GB200 NVL72 systems connected with InfiniBand for larger scales. The smallest NVIDIA submission for this benchmark requires at least eight GPUs.
How do you download and preprocess datasets for Llama 2 70B LoRA?
To download and preprocess datasets for Llama 2 70B LoRA, create a directory for the data, run a Docker container, and execute scripts to download the GovReport dataset and the Hugging Face model checkpoint. Ensure you have a Hugging Face token for the model download.
What steps are involved in launching benchmarks on NVIDIA MLPerf?
Launching benchmarks on NVIDIA MLPerf involves building a Docker container, downloading the dataset and model, configuring SLURM job files, and executing the sbatch command to start the training process. Log files are then parsed for performance metrics.

Key Statistics & Figures

Disk space required for Llama 3.1
At least 2.5 TB
This is necessary for the dataset and model storage.
Minimum number of GPUs for Llama 2 70B LoRA
8 GPUs
This is the smallest NVIDIA submission configuration for this benchmark.
Minimum number of GPUs for Llama 3.1 405B
256 GPUs
This is required for the smallest NVIDIA submission for this benchmark.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Containerization
Docker
Used for building and running containers to download and preprocess datasets.
Job Scheduling
Slurm
Used to manage and schedule jobs on the compute nodes during benchmarking.
Machine Learning Framework
Nvidia Nemo
Framework used for model training and evaluation.

Key Actionable Insights

1
Ensure your system meets the hardware requirements before attempting to run benchmarks.
Having the correct hardware setup is crucial for successful benchmarking and achieving optimal performance. This includes having the right number of GPUs and systems connected via InfiniBand.
2
Utilize the provided README files in the submission repositories for detailed instructions.
These README files contain essential information and scripts that can streamline the process of reproducing the benchmarks, saving time and reducing errors.
3
Monitor the MLPerf logs closely for performance metrics during training.
The logs provide valuable insights into the training process, including initialization times and evaluation accuracy, which are critical for understanding model performance.

Common Pitfalls

1
Failing to allocate sufficient disk space can lead to benchmark failures.
Ensure you have at least 2.5 TB of disk space for Llama 3.1 and 300 GB for LoRA fine-tuning to avoid interruptions during data processing.
2
Not following the correct SLURM job configuration can result in inefficient resource usage.
Properly configuring SLURM job files is essential for optimizing performance and ensuring that resources are utilized effectively during training.

Related Concepts

Nvidia Mlperf
Large Language Models
Benchmarking Techniques