Benchmarking GPUDirect RDMA on Modern Server Platforms

Davide Rossetti

NVIDIA GPUDirect RDMA is a technology which enables a direct path for data exchange between the GPU and third-party peer devices using standard features of PCI…

NVIDIA

•

Davide Rossetti

•12 min read•intermediate•

--

•View Original

V

Overview

The article discusses NVIDIA GPUDirect RDMA, a technology that facilitates direct data exchange between GPUs and third-party devices via PCI Express. It provides insights into performance benchmarks across different hardware platforms, focusing on latency and bandwidth metrics for GPU-accelerated systems.

What You'll Learn

1

How to optimize data transfer between GPUs and third-party devices using GPUDirect RDMA

2

Why understanding Infiniband performance metrics is crucial for high-performance computing

3

When to use dual-rail configurations for enhanced bandwidth in GPU-accelerated systems

Prerequisites & Requirements

Understanding of GPU architectures and PCI Express technology
Experience with high-performance computing environments(optional)

Key Questions Answered

What is GPUDirect RDMA and how does it work?

GPUDirect RDMA is a technology that allows direct data transfer between NVIDIA GPUs and third-party devices without involving the CPU. This is achieved through PCI Express, enabling high-performance data exchanges for applications in fields like healthcare and high-energy physics.

What are the latency and bandwidth performance metrics for GPUDirect RDMA?

The article reports that GPUDirect RDMA achieves a latency consistently below 2 microseconds and offers bandwidth performance of up to 9.8 GB/s for host-to-GPU transfers and 11.6 GB/s in dual-rail configurations, showcasing significant improvements over traditional methods.

How does Infiniband impact GPU data transfer performance?

Infiniband provides a high-performance, low-latency interconnection for GPU data transfers. The article highlights that Infiniband can achieve link speeds of 40Gb/s to 56Gb/s, significantly enhancing data throughput in GPU-accelerated applications.

What are common performance bottlenecks when using GPUDirect RDMA?

Common bottlenecks include PCIe architectural limitations and NUMA-like effects, which can restrict the achievable bandwidth. The article emphasizes the importance of understanding server topology to mitigate these issues and optimize performance.

Key Statistics & Figures

Host-to-Host Latency

1.3 microseconds

Measured using the ibv_ud_pingpong benchmark on the tested platform.

Host-to-GPU Bandwidth

9.8 GB/s

Achieved on Ivy Bridge Xeon systems when writing to GPU memory.

GPU-to-Host Bandwidth

3.7 GB/s

This bandwidth was observed when boosting the GPU clock to 875 MHz.

GPU-to-GPU Latency

1.9 microseconds

This latency is achieved using GPUDirect RDMA for small message sizes.

Technologies & Tools

Technology

Gpudirect Rdma

Enables direct data transfers between GPUs and third-party devices.

Networking

Infiniband

Provides high-performance, low-latency interconnection for GPU data transfers.

Software

Cuda

Used for programming GPU-accelerated applications.

Key Actionable Insights

1
To maximize the performance of GPU-accelerated applications, leverage GPUDirect RDMA for direct data transfers, reducing latency significantly compared to traditional methods.
This is particularly beneficial in environments where low-latency communication is critical, such as in healthcare or high-energy physics applications.

2
Regularly benchmark your Infiniband network performance to identify potential bottlenecks and optimize configurations for bandwidth and latency.
Understanding your network's performance can help in making informed decisions about hardware upgrades or configuration changes to enhance overall system performance.

3
Consider using dual-rail Infiniband configurations to achieve higher bandwidth and reduce the risk of bottlenecks in data-intensive applications.
This setup is particularly useful for applications that require high throughput, as it can significantly enhance data transfer rates between GPUs.

Common Pitfalls

1

Assuming that all server configurations will yield optimal performance with GPUDirect RDMA can lead to subpar results.

Performance can vary significantly based on the underlying hardware architecture and PCIe topology. It's crucial to analyze and optimize server configurations to achieve the best results.

2

Neglecting to benchmark Infiniband performance regularly may result in undetected bottlenecks.

Without regular benchmarking, performance issues may go unnoticed, leading to inefficient data transfers and potential delays in data-intensive applications.

Related Concepts

High-performance Computing

Pci Express Architecture

GPU Memory Management

Data Transfer Optimization Techniques

Slack has a global customer base, with millions of simultaneously connected users at peak times. Most of the communication between users involves sending lots of tiny messages to each other. For much of Slack’s history, we’ve used HAProxy as a load balancer for all incoming traffic. Today, we’ll talk about problems we faced with HAProxy,…

AWSChefEnvoy

14 min read

Includes Code

Has Summary

--

Slack

Advanced

Scaling Datastores at Slack with Vitess

From the very beginning of Slack, MySQL was used as the storage engine for all our data. Slack operated MySQL servers in an active-active configuration. This is the story of how we changed our data storage architecture from the active-active clusters over to Vitess — a horizontal scaling system for MySQL. Vitess is the present…

ReactPHPMySQL

17 min read

Has Summary

--

Oxide Computer Company

Beginner

Exploiting Undocumented Hardware Blocks in the LPC55S69

A write up of the LPC55S69 ROM Patch.

AWSNitroV

14 min read

Includes Code

Has Summary

--

These articles from Slack and other leading engineering teams share similar topics with "Benchmarking GPUDirect RDMA on Modern Server Platforms". Explore more engineering insights on AWS, Chef, React.