Benchmarking GPUDirect RDMA on Modern Server Platforms

NVIDIA GPUDirect RDMA is a technology which enables a direct path for data exchange between the GPU and third-party peer devices using standard features of PCI…

Davide Rossetti
12 min readintermediate
--
View Original

Overview

The article discusses NVIDIA GPUDirect RDMA, a technology that facilitates direct data exchange between GPUs and third-party devices via PCI Express. It provides insights into performance benchmarks across different hardware platforms, focusing on latency and bandwidth metrics for GPU-accelerated systems.

What You'll Learn

1

How to optimize data transfer between GPUs and third-party devices using GPUDirect RDMA

2

Why understanding Infiniband performance metrics is crucial for high-performance computing

3

When to use dual-rail configurations for enhanced bandwidth in GPU-accelerated systems

Prerequisites & Requirements

  • Understanding of GPU architectures and PCI Express technology
  • Experience with high-performance computing environments(optional)

Key Questions Answered

What is GPUDirect RDMA and how does it work?
GPUDirect RDMA is a technology that allows direct data transfer between NVIDIA GPUs and third-party devices without involving the CPU. This is achieved through PCI Express, enabling high-performance data exchanges for applications in fields like healthcare and high-energy physics.
What are the latency and bandwidth performance metrics for GPUDirect RDMA?
The article reports that GPUDirect RDMA achieves a latency consistently below 2 microseconds and offers bandwidth performance of up to 9.8 GB/s for host-to-GPU transfers and 11.6 GB/s in dual-rail configurations, showcasing significant improvements over traditional methods.
How does Infiniband impact GPU data transfer performance?
Infiniband provides a high-performance, low-latency interconnection for GPU data transfers. The article highlights that Infiniband can achieve link speeds of 40Gb/s to 56Gb/s, significantly enhancing data throughput in GPU-accelerated applications.
What are common performance bottlenecks when using GPUDirect RDMA?
Common bottlenecks include PCIe architectural limitations and NUMA-like effects, which can restrict the achievable bandwidth. The article emphasizes the importance of understanding server topology to mitigate these issues and optimize performance.

Key Statistics & Figures

Host-to-Host Latency
1.3 microseconds
Measured using the ibv_ud_pingpong benchmark on the tested platform.
Host-to-GPU Bandwidth
9.8 GB/s
Achieved on Ivy Bridge Xeon systems when writing to GPU memory.
GPU-to-Host Bandwidth
3.7 GB/s
This bandwidth was observed when boosting the GPU clock to 875 MHz.
GPU-to-GPU Latency
1.9 microseconds
This latency is achieved using GPUDirect RDMA for small message sizes.

Technologies & Tools

Technology
Gpudirect Rdma
Enables direct data transfers between GPUs and third-party devices.
Networking
Infiniband
Provides high-performance, low-latency interconnection for GPU data transfers.
Software
Cuda
Used for programming GPU-accelerated applications.

Key Actionable Insights

1
To maximize the performance of GPU-accelerated applications, leverage GPUDirect RDMA for direct data transfers, reducing latency significantly compared to traditional methods.
This is particularly beneficial in environments where low-latency communication is critical, such as in healthcare or high-energy physics applications.
2
Regularly benchmark your Infiniband network performance to identify potential bottlenecks and optimize configurations for bandwidth and latency.
Understanding your network's performance can help in making informed decisions about hardware upgrades or configuration changes to enhance overall system performance.
3
Consider using dual-rail Infiniband configurations to achieve higher bandwidth and reduce the risk of bottlenecks in data-intensive applications.
This setup is particularly useful for applications that require high throughput, as it can significantly enhance data transfer rates between GPUs.

Common Pitfalls

1
Assuming that all server configurations will yield optimal performance with GPUDirect RDMA can lead to subpar results.
Performance can vary significantly based on the underlying hardware architecture and PCIe topology. It's crucial to analyze and optimize server configurations to achieve the best results.
2
Neglecting to benchmark Infiniband performance regularly may result in undetected bottlenecks.
Without regular benchmarking, performance issues may go unnoticed, leading to inefficient data transfers and potential delays in data-intensive applications.

Related Concepts

High-performance Computing
Pci Express Architecture
GPU Memory Management
Data Transfer Optimization Techniques