Networking Reliability and Observability at Scale with NCCL 2.24

The NVIDIA Collective Communications Library (NCCL) implements multi-GPU and multinode (MGMN) communication primitives optimized for NVIDIA GPUs and networking.

Ben Williams
14 min readintermediate
--
View Original

Overview

The article discusses the features and improvements introduced in NCCL 2.24, focusing on networking reliability and observability at scale for multi-GPU and multinode communication. Key highlights include the introduction of the RAS subsystem, user buffer registration for multinode collectives, and support for FP8 data types.

What You'll Learn

1

How to utilize the RAS subsystem for diagnosing application crashes in NCCL jobs

2

Why user buffer registration is crucial for optimizing multinode collective operations

3

How to implement NIC Fusion for systems with multiple NICs per GPU

4

When to use optional receive completions to reduce overhead in NCCL

5

How to enforce NCCL_ALGO and NCCL_PROTO for better performance tuning

Prerequisites & Requirements

  • Understanding of multi-GPU and multinode communication concepts
  • Familiarity with NCCL and its APIs(optional)

Key Questions Answered

What is the purpose of the RAS subsystem in NCCL 2.24?
The RAS subsystem in NCCL 2.24 helps diagnose application crashes and hangs by providing a low-overhead infrastructure to monitor the health of NCCL jobs. It establishes TCP/IP connections among NCCL processes to exchange keep-alive messages, allowing users to identify unresponsive nodes or processes during execution.
How does user buffer registration improve performance in NCCL?
User buffer registration allows NCCL to optimize data transfers by enabling direct access to buffers, reducing overhead associated with control flow and buffering. This leads to better performance, especially for operations like AllReduce and AllGather, as it allows the use of special hardware optimizations.
What are the benefits of NIC Fusion in NCCL 2.24?
NIC Fusion allows NCCL to handle systems with multiple NICs per GPU more effectively by merging NICs into logical devices. This prevents crashes and optimizes resource usage, ensuring better performance and load balancing across available network interfaces.
When should optional receive completions be used in NCCL?
Optional receive completions should be used when leveraging LL or LL128 protocols, as these protocols allow for inherent synchronization that reduces the need for explicit polling on network receive completions. This can lower overhead and improve performance in large-scale applications.
What changes were made to NCCL_ALGO and NCCL_PROTO enforcement in version 2.24?
In NCCL 2.24, the enforcement of NCCL_ALGO and NCCL_PROTO has been made stricter, meaning that users will receive an error if they specify unsupported algorithms or protocols, rather than silently falling back to defaults. This change aims to reduce confusion during benchmarking and tuning.

Key Statistics & Figures

Performance improvement for AllGather and Broadcast operations with user buffer registration
5%
This improvement is observed for eight GPU per node operations, enhancing peak bandwidth.

Technologies & Tools

Software Library
Nvidia Collective Communications Library (nccl)
Used for multi-GPU and multinode communication in deep learning training.
Data Type
Fp8
Supported for native reductions in NCCL 2.24.

Key Actionable Insights

1
Utilize the RAS subsystem to monitor the health of your NCCL jobs, especially in large-scale deployments.
This proactive monitoring can help identify issues before they escalate, allowing for quicker resolutions and maintaining application performance.
2
Register user buffers using ncclCommRegister to unlock optimizations and improve data transfer efficiency.
By doing so, you can leverage hardware capabilities like NvSwitch, which can significantly enhance performance in multi-GPU setups.
3
Implement NIC Fusion in your NCCL setup if you are using systems with multiple NICs per GPU to avoid crashes and optimize resource usage.
This is particularly important in high-performance computing environments where network reliability is critical.
4
Consider using optional receive completions to minimize overhead in your NCCL applications when using LL or LL128 protocols.
This can lead to better performance, especially in scenarios with high data throughput requirements.
5
Be aware of the stricter enforcement of NCCL_ALGO and NCCL_PROTO in version 2.24 to avoid runtime errors.
This change encourages better practices in algorithm selection and tuning, leading to more predictable performance outcomes.

Common Pitfalls

1
Failing to register user buffers can lead to higher overhead and suboptimal performance.
Without registration, NCCL has to manage more control flow and buffering, which can consume additional GPU resources and slow down data transfers.
2
Ignoring the stricter enforcement of NCCL_ALGO and NCCL_PROTO can result in runtime errors.
Users may encounter unexpected failures if they attempt to use unsupported algorithms or protocols, which can disrupt application performance.

Related Concepts

Multi-gpu Communication
Nvidia Magnum Io
Collective Communication Algorithms