Fusing Communication and Compute with New Device API and Copy Engine Collectives in NVIDIA NCCL 2.28

The latest release of the NVIDIA Collective Communications Library (NCCL) introduces a groundbreaking fusion of communication and computation for higher…

Sylvain Jeaugey
9 min readadvanced
--
View Original

Overview

The article discusses the release of NVIDIA Collective Communications Library (NCCL) 2.28, which introduces a fusion of communication and computation to enhance throughput, reduce latency, and maximize GPU utilization. Key features include GPU-initiated networking, device APIs for communication-compute fusion, copy-engine-based collectives, and improved developer experience through expanded APIs and tooling.

What You'll Learn

1

How to implement GPU-initiated networking in NCCL

2

Why copy engine-based collectives improve performance in multi-GPU systems

3

How to utilize the NCCL Inspector for profiling and observability

Prerequisites & Requirements

  • Understanding of CUDA programming and GPU architecture
  • Familiarity with NVIDIA NCCL and its APIs(optional)

Key Questions Answered

What are the new features introduced in NCCL 2.28?
NCCL 2.28 introduces GPU-initiated networking, device APIs for communication-compute fusion, and copy-engine-based collectives. These features enhance performance by reducing latency and maximizing GPU utilization across multi-GPU systems.
How does the NCCL device API enable direct kernel communication?
The NCCL device API allows CUDA kernels to initiate data movement directly, integrating communication with compute operations. This reduces synchronization overhead and increases throughput by supporting operation modes like Load/Store Accessible, Multimem, and GPU-Initiated Networking.
What benefits do copy engine-based collectives provide?
Copy engine-based collectives offload communication tasks from streaming multiprocessors (SMs) to dedicated hardware, achieving zero-SM operation. This reduces contention for compute resources and allows communication and computation to occur concurrently, improving overall application performance.
How can the NCCL Inspector assist developers?
The NCCL Inspector provides low-overhead profiling and observability for NCCL communication patterns. It tracks performance metrics and generates structured JSON output, enabling developers to analyze and debug collective operations effectively during distributed workloads.

Key Statistics & Figures

Peak bandwidth for CE-based AllGather
780 GBps
This performance is achieved at larger message sizes, demonstrating a significant advantage over traditional SM-based implementations.
Bandwidth gain for CE Multicast over SM Symmetric
1.25×
This gain is observed at 4 GB message sizes, highlighting the efficiency of copy engine-based operations.

Technologies & Tools

Library
Nvidia Collective Communications Library
Used for optimizing communication in multi-GPU and multi-node systems.
Framework
Cuda
Provides the programming model for developing applications that utilize NCCL.

Key Actionable Insights

1
Leverage the NCCL device API to enhance your CUDA applications by integrating communication directly into kernels.
This integration can significantly reduce overhead and improve throughput, especially in applications that require high-performance data movement.
2
Utilize copy engine-based collectives to optimize resource allocation in multi-GPU setups.
By offloading communication tasks to copy engines, you can free up SM resources for computation, leading to better overall performance in distributed applications.
3
Implement the NCCL Inspector in your production workloads for continuous performance monitoring.
The Inspector's low overhead allows for real-time analysis of communication patterns, which can help identify bottlenecks and optimize performance without impacting application efficiency.

Common Pitfalls

1
Failing to utilize the NCCL Inspector can lead to missed performance insights during distributed workloads.
Without profiling, developers may overlook critical bottlenecks in communication patterns, which can hinder application performance.
2
Not taking advantage of copy engine-based collectives may result in resource contention between communication and computation.
This contention can degrade overall application performance, especially in high-demand multi-GPU environments.

Related Concepts

Cuda Programming
Multi-gpu Architecture
Performance Optimization Techniques
Profiling Tools For Distributed Systems