Disaggregated Scheduled Fabric: Scaling Meta’s AI Journey

Disaggregated Schedule Fabric (DSF) is Meta’s next-generation network fabric technology for AI training networks that addresses the challenges of existing Clos-based networks. We’re sharing the cha…

Ron He
15 min readintermediate
--
View Original

Overview

The article discusses Meta's Disaggregated Scheduled Fabric (DSF), a next-generation network fabric technology designed to enhance AI training networks by overcoming the limitations of traditional Clos-based architectures. It details the challenges faced with existing IP fabrics, the innovative architecture of DSF, and its implications for scaling AI workloads.

What You'll Learn

1

How to implement Disaggregated Scheduled Fabric for AI training networks

2

Why packet spraying improves load balancing in network fabrics

3

When to use Input Balanced Mode to manage traffic during link failures

Prerequisites & Requirements

  • Understanding of network fabric architectures and AI workloads
  • Experience with high-performance networking technologies(optional)

Key Questions Answered

What challenges does Disaggregated Scheduled Fabric address?
Disaggregated Scheduled Fabric addresses issues such as elephant flows, low entropy, and suboptimal fabric utilization that arise in traditional IP fabrics used for AI workloads. These challenges can lead to congestion and inefficient bandwidth usage, which DSF mitigates through its innovative architecture and traffic management strategies.
How does DSF improve network performance for AI applications?
DSF enhances network performance by utilizing a two-domain architecture that separates Ethernet and fabric domains, allowing for packet spraying and credit-based congestion control. This design ensures high-speed traffic distribution and optimal load balancing across available paths, which is crucial for handling the demands of AI workloads.
What is the role of Input Balanced Mode in DSF?
Input Balanced Mode ensures that devices in the DSF network maintain equal or less input bandwidth compared to output bandwidth, preventing oversubscription during remote link failures. This feature dynamically adjusts traffic flow to avoid congestion and maintain performance across the network.

Key Statistics & Figures

Interconnected GPU scale
18K x 800G GPUs
This scale is achieved through the DSF Dual-Stage Fabric architecture, enabling large-scale AI applications.

Technologies & Tools

Network Operating System
Fboss
FBOSS is used to control the distributed system of Disaggregated Scheduled Fabric, enabling real-time state synchronization across nodes.
Network Standard
Ocp-sai
OCP-SAI is the open standard that powers the VOQ-based system of DSF, facilitating modular architecture and optimized load balancing.

Key Actionable Insights

1
Implementing Disaggregated Scheduled Fabric can significantly enhance the scalability of AI training networks, allowing for the interconnection of thousands of GPUs.
This is particularly beneficial in environments where high-performance and low-latency connections are critical for training large AI models.
2
Utilizing packet spraying in DSF can lead to near-optimal load balancing across network paths, improving overall bandwidth utilization.
This method is essential for managing the heavy traffic patterns typical of AI workloads, ensuring efficient data flow and reducing congestion.

Common Pitfalls

1
Relying solely on traditional IP fabric can lead to performance bottlenecks due to elephant flows and low entropy traffic patterns.
These issues can cause congestion and inefficient bandwidth usage, which are detrimental to the performance of AI training workloads.

Related Concepts

Network Fabric Architectures
AI Workload Management
Congestion Control Techniques