Improving Network Performance of HPC Systems Using NVIDIA Magnum IO NVSHMEM and GPUDirect Async

Today’s leading-edge high performance computing (HPC) systems contain tens of thousands of GPUs. In NVIDIA systems, GPUs are connected on nodes through the…

Pak Markthub
13 min readadvanced
--
View Original

Overview

The article discusses how NVIDIA Magnum IO NVSHMEM and InfiniBand GPUDirect Async (IBGDA) enhance network performance in high-performance computing (HPC) systems by enabling efficient GPU-to-GPU communication. It highlights the limitations of traditional CPU proxy methods and presents IBGDA as a solution that improves throughput and reduces latency for small message transfers.

What You'll Learn

1

How to utilize InfiniBand GPUDirect Async for efficient GPU communication

2

Why NVSHMEM is crucial for strong scaling in HPC applications

3

When to implement IBGDA to optimize small message transfers

Prerequisites & Requirements

  • Understanding of high-performance computing concepts
  • Familiarity with NVIDIA Magnum IO and NVSHMEM(optional)

Key Questions Answered

How does InfiniBand GPUDirect Async improve GPU communication?
InfiniBand GPUDirect Async allows GPUs to submit communication requests directly to the NIC, bypassing the CPU, which significantly enhances throughput and reduces latency for small message transfers. This direct interaction enables efficient data transfers, especially in applications requiring strong scaling.
What are the performance benefits of using NVSHMEM with IBGDA?
Using NVSHMEM with IBGDA results in up to 9.5x higher throughput for block-put operations with message sizes less than 1 KiB. This improvement is particularly beneficial for applications that need to scale efficiently across many GPUs.
What limitations does the CPU proxy method impose on communication?
The CPU proxy method introduces significant bottlenecks, limiting the throughput for fine-grain transfers due to the CPU's slower processing rate compared to the NIC. This results in inefficiencies, especially when scaling to larger numbers of GPUs.
How does IBGDA affect latency in all-to-all communication?
IBGDA provides consistent latency around 64 microseconds for message sizes less than 8 KiB, while the CPU proxy method shows fluctuating latencies between 128 to 256 microseconds. This consistency is crucial for performance in HPC applications.

Key Statistics & Figures

Throughput improvement with IBGDA
up to 9.5x
For NVSHMEM block-put operations with message sizes less than 1 KiB.
Latency for all-to-all communication with IBGDA
around 64 microseconds
For message sizes less than 8 KiB.

Technologies & Tools

Software
Nvidia Magnum Io
Provides architecture for parallel, asynchronous, and intelligent data center IO.
Communication Library
Nvshmem
Enables efficient GPU communication in HPC systems.
Networking
Infiniband
Used for high-speed data transfers between nodes.
Technology
Gpudirect Async
Allows direct communication between GPU and NIC, bypassing the CPU.

Key Actionable Insights

1
Implement InfiniBand GPUDirect Async in your HPC applications to enhance communication efficiency.
This approach allows GPUs to communicate directly with the NIC, reducing CPU overhead and improving throughput for small message sizes, which is critical for applications requiring strong scaling.
2
Consider using NVSHMEM for applications that demand high performance in distributed computing environments.
NVSHMEM's architecture is designed for efficient GPU communication, making it suitable for workloads that require fine-grain data access and low latency.
3
Optimize your code to leverage IBGDA's capabilities for small message transfers.
By doing so, you can achieve higher throughput and better performance, particularly as your application scales across multiple GPUs.

Common Pitfalls

1
Relying on CPU proxy methods can severely limit communication efficiency.
This occurs because the CPU becomes a bottleneck, unable to keep up with the high request rates generated by GPUs, especially in high-performance computing scenarios.
2
Not optimizing for small message transfers can lead to suboptimal performance.
As workloads scale, smaller messages become more common, and failing to address their efficient transfer can hinder overall application performance.

Related Concepts

High-performance Computing
GPU Communication Strategies
Data Transfer Optimization Techniques