Multinode Multi-GPU: Using NVIDIA cuFFTMp FFTs at Scale

cuFFTMp is a multi-node, multi-process extension to cuFFT that enables scientists and engineers to solve challenging problems on exascale platforms.

Leopold Cambier
9 min readadvanced
--
View Original

Overview

NVIDIA has released cuFFTMp, a multi-node, multi-process extension to cuFFT, designed to enhance the performance of Fast Fourier Transforms (FFTs) across exascale platforms. This article discusses the capabilities of cuFFTMp, its performance metrics, and its application in solving complex scientific problems, particularly in fluid dynamics.

What You'll Learn

1

How to implement cuFFTMp for distributed FFTs across multiple nodes

2

Why NVSHMEM enhances communication efficiency in cuFFTMp

3

How to transition existing applications from cuFFT to cuFFTMp

Prerequisites & Requirements

  • Understanding of Fast Fourier Transforms and CUDA programming
  • Familiarity with MPI and NVIDIA HPC SDK(optional)

Key Questions Answered

What performance metrics does cuFFTMp achieve on the Selene cluster?
cuFFTMp can reach over 1.8 PFlop/s, utilizing more than 70% of the peak machine bandwidth while transforming over 4 trillion complex data points using 4096 A100 80GB GPUs on the Selene cluster.
How does cuFFTMp improve upon traditional FFT implementations?
cuFFTMp utilizes NVSHMEM for efficient communication, reducing synchronization costs and allowing for kernel-initiated communications, which enhances performance compared to traditional MPI implementations.
What are the strong scaling results for cuFFTMp?
With an unchanged problem size, cuFFTMp reduces the single-precision time from 78ms with 8 GPUs (1 node) to 4ms with 2048 GPUs (256 nodes), demonstrating effective strong scaling.
What is the significance of the Fluid3D application in turbulence flow simulation?
Fluid3D, which applies direct numerical simulation of the Navier-Stokes equations, benefits significantly from cuFFTMp, allowing for faster iterations and enabling simulations of high Reynolds number flows in reasonable timeframes.

Key Statistics & Figures

Peak performance of cuFFTMp
1.8 PFlop/s
Achieved using 4096 A100 80GB GPUs on the Selene cluster.
Strong scaling time reduction
From 78ms to 4ms
Single-precision time with 8 GPUs (1 node
Memory bandwidth for GPUs
2000 GB/s/gpu
Used for bidirectional global memory bandwidth.

Technologies & Tools

Library
Cufft
Used for Fast Fourier Transforms in NVIDIA's GPU computing.
Communication Library
Nvshmem
Facilitates efficient communication between GPUs in a distributed environment.
Programming Model
Cuda
Used for parallel computing on NVIDIA GPUs.
Communication Protocol
Mpi
Used for managing data distributions in multi-node applications.

Key Actionable Insights

1
Leverage cuFFTMp to enhance the performance of FFTs in your scientific applications.
By utilizing cuFFTMp, you can achieve significant performance improvements, particularly in applications requiring high computational power, such as fluid dynamics simulations.
2
Consider transitioning existing applications to cuFFTMp for better scalability.
The transition process is straightforward, as cuFFTMp extends the existing cuFFT library, allowing for easier adaptation of current multi-GPU applications.
3
Utilize NVSHMEM to optimize communication in distributed systems.
By adopting NVSHMEM, you can reduce communication overhead and improve the efficiency of data access across GPUs, which is crucial for high-performance computing.

Common Pitfalls

1
Underestimating the importance of communication efficiency in distributed FFTs.
Many users may overlook the impact of communication overhead, which can significantly affect performance. Utilizing NVSHMEM can mitigate this issue.
2
Failing to properly initialize MPI before using cuFFTMp.
Since cuFFTMp relies on MPI for data distribution, improper initialization can lead to runtime errors or suboptimal performance.

Related Concepts

Fast Fourier Transforms (ffts)
Cuda Programming
High-performance Computing (hpc)
Fluid Dynamics Simulations