NVIDIA Contributes NVIDIA GB200 NVL72 Designs to Open Compute Project

Amr Elmeleegy

During the 2024 OCP Global Summit, NVIDIA announced that it has contributed the NVIDIA GB200 NVL72 rack and compute and switch tray liquid cooled designs to the…

NVIDIA

•

Amr Elmeleegy

•9 min read•intermediate•

--

•View Original

GPTPythonPyTorch

Overview

NVIDIA has contributed the NVIDIA GB200 NVL72 designs to the Open Compute Project, enhancing the utility of design standards for modern data centers. This contribution aims to meet the high compute density demands of AI applications while reducing costs and implementation time for new data centers.

What You'll Learn

1

How to leverage NVIDIA GB200 NVL72 designs for high compute density data centers

2

Why NVLink and NVSwitch are critical for GPU communication in AI applications

3

How to implement liquid cooling solutions in high-performance computing environments

Key Questions Answered

What are the benefits of the NVIDIA GB200 NVL72 designs for data centers?

The NVIDIA GB200 NVL72 designs enhance compute density, allowing data centers to support up to 72 NVIDIA Blackwell GPUs with a communication speed of 1.8 TB/s per GPU. This significantly improves training and inference capabilities for large AI models, reducing costs and implementation time.

How does the new reference architecture with Vertiv improve data center deployment?

The new joint reference architecture with Vertiv reduces implementation time for deploying NVIDIA Blackwell clusters by up to 50%. It eliminates the need for data centers to develop their own designs from scratch, leveraging Vertiv's expertise in efficient power and cooling solutions.

What modifications were made to the rack for the GB200 NVL72?

NVIDIA implemented several modifications to the rack, including adding over 100 lbs of steel reinforcements for stability, adapting designs to support 19” EIA gear, and incorporating blind mate slide rails for easier maintenance. These changes optimize space utilization and enhance structural integrity.

What is the significance of NVLink and NVSwitch in GPU communication?

NVLink and NVSwitch are designed to enhance GPU-to-GPU communication, reducing idle time and increasing throughput. The GB200 NVL72 design allows for a communication speed of 1.8 TB/s per GPU, which is crucial for efficiently training large AI models across GPU clusters.

Key Statistics & Figures

GPU communication speed

1.8 TB/s

This speed is achieved per GPU in the NVLink domain of the GB200 NVL72 design.

Maximum number of GPUs in NVLink domain

72

The GB200 NVL72 design allows for up to 72 NVIDIA Blackwell GPUs to be interconnected.

Reduction in implementation time

up to 50%

This reduction is achieved through the new joint reference architecture with Vertiv.

Aggregate AllReduce bandwidth

260 TB/s

This bandwidth is facilitated by the NVLink cartridges in the GB200 NVL72 design.

Technologies & Tools

Hardware

Nvidia Gb200 Nvl72

Used for high compute density in AI data centers.

Interconnect Technology

Nvlink

Facilitates high-speed communication between GPUs.

Interconnect Technology

Nvswitch

Enhances GPU-to-GPU communication efficiency.

Cooling Technology

Liquid Cooling

Manages thermal demands for high-performance computing.

Key Actionable Insights

1
Consider adopting NVIDIA's GB200 NVL72 designs to enhance your data center's compute density and efficiency.
With the increasing demands of AI workloads, utilizing these designs can significantly improve your infrastructure's performance and reduce costs associated with scaling.

2
Implement liquid cooling solutions as outlined in the GB200 NVL72 design to manage thermal demands effectively.
As AI models grow in size and complexity, maintaining optimal temperatures is crucial for performance and longevity of hardware.

3
Leverage the new reference architecture with Vertiv to streamline your data center's deployment process.
This collaboration can help reduce setup time and improve energy efficiency, allowing for quicker scaling of AI capabilities.

Common Pitfalls

1

Neglecting the importance of GPU interconnectivity can lead to inefficiencies in AI model training.

Without proper interconnect solutions like NVLink and NVSwitch, GPUs may remain idle, waiting for data to be communicated, which increases the total cost of ownership.

2

Overlooking thermal management can result in hardware failures.

As compute density increases, so do thermal demands. Implementing robust cooling solutions is essential to maintain performance and prevent overheating.

Related Concepts

High Compute Density Data Centers

AI Model Training Techniques

GPU Interconnect Technologies

Liquid Cooling Solutions In Data Centers