NVIDIA GH200 Grace Hopper Superchip Delivers Outstanding Performance in MLPerf Inference v4.1

Amr Elmeleegy

In the latest round of MLPerf Inference – a suite of standardized, peer-reviewed inference benchmarks – the NVIDIA platform delivered outstanding performance…

NVIDIA

•

Amr Elmeleegy

•6 min read•intermediate•

--

•View Original

GPTOracleRetrieval Augmented Generation

Overview

The article discusses the performance of the NVIDIA GH200 Grace Hopper Superchip in the latest MLPerf Inference v4.1 benchmarks, highlighting its innovative architecture that combines a Grace CPU and Hopper GPU. It emphasizes the significant improvements in AI performance and efficiency, making it a strong contender for generative AI workloads.

What You'll Learn

1

How to leverage the NVIDIA GH200 for high-performance AI inference

2

Why the architecture of GH200 improves memory access and performance

3

When to use multiple GH200 Superchips for demanding AI workloads

Key Questions Answered

What performance improvements does the NVIDIA GH200 offer over the H100?

The NVIDIA GH200 delivers up to 1.4x more performance per accelerator compared to the H100 Tensor Core GPU across demanding benchmarks like Mixtral 8x7B and Llama 2 70B. Additionally, it shows a throughput increase of up to 22x on the GPT-J benchmark compared to the best CPU-only submissions.

How does the GH200 architecture enhance memory efficiency?

The GH200 architecture allows the CPU and GPU to share a single per-process page table, enabling all threads to access system-allocated memory without the need for copying data back and forth. This results in a bandwidth of 900 GB/s to the GPU, significantly improving efficiency.

What are the key features of the GH200 NVL2?

The GH200 NVL2 connects two GH200 Superchips using NVLink, delivering 8 petaflops of AI performance in a single node. It features 144 Arm Neoverse cores, up to 960GB of LPDDR5X memory, and 288GB of HBM3e memory with 10TB/s bandwidth, enhancing scalability and performance for AI applications.

What are the implications of using GH200 for real-time AI services?

The GH200 maintains server scenario performance within 5% of its offline performance, making it suitable for real-time AI services. In contrast, CPU-only submissions can see performance degradation of up to 55% under similar conditions, highlighting the GH200's advantage in latency-sensitive applications.

Key Statistics & Figures

Bandwidth to GPU

900 GB/s

This bandwidth is 7x faster than current servers, enhancing memory access efficiency.

AI performance

8 petaflops

This performance is achieved by fusing two Grace CPUs and two Hopper GPUs in the GH200 NVL2 architecture.

Performance uplift on Llama 2 70B

1.4x

This uplift is compared to the H100 Tensor Core GPU, showcasing the GH200's capabilities in generative AI benchmarks.

Throughput increase on GPT-J benchmark

22x

This increase is observed when comparing a single GH200 Superchip to the best two-socket, CPU-only submissions.

Technologies & Tools

Hardware

Nvidia Gh200

Used for high-performance AI inference and generative AI applications.

Interconnect

Nvidia Nvlink

Facilitates high-bandwidth, low-latency communication between the CPU and GPU.

Key Actionable Insights

1
Utilize the NVIDIA GH200 for deploying generative AI applications to achieve superior performance and efficiency.
With its advanced architecture, the GH200 is designed to handle demanding AI workloads, making it an ideal choice for organizations looking to enhance their AI capabilities.

2
Consider the GH200 NVL2 for applications requiring high throughput and low latency.
The NVL2's ability to connect multiple Superchips allows for scaling out to meet the demands of complex AI models, ensuring optimal performance in production environments.

Common Pitfalls

1

Overlooking the importance of memory architecture in AI performance.

Many developers may focus solely on processing power without considering how memory access and bandwidth affect overall system efficiency. Understanding the GH200's shared memory architecture is crucial for optimizing AI workloads.

Related Concepts

Generative AI

Machine Learning Performance Benchmarks

Nvidia Hopper GPU Architecture