OCP Summit 2025: The Open Future of Networking Hardware for AI

At Open Compute Project Summit (OCP) 2025, we’re sharing details about the direction of next-generation network fabrics for our AI training clusters. We’ve expanded our network hardware portfolio a…

Jasmeet Bagga
8 min readadvanced
--
View Original

Overview

The article discusses the advancements presented at the Open Compute Project (OCP) Summit 2025, focusing on the evolution of networking hardware for AI applications. Key highlights include the introduction of new disaggregated network platforms, the evolution of Disaggregated Scheduled Fabric (DSF), and the launch of the Ethernet for Scale-Up Networking (ESUN) initiative.

What You'll Learn

1

How to implement Disaggregated Scheduled Fabric (DSF) for AI clusters

2

Why Non-Scheduled Fabric (NSF) architecture is essential for large AI clusters

3

How to leverage Ethernet for Scale-Up Networking (ESUN) in AI applications

4

When to use 2x400G FR4 LITE optics in data center environments

Prerequisites & Requirements

  • Understanding of networking concepts and AI infrastructure
  • Experience with data center operations and network management(optional)

Key Questions Answered

What is the purpose of the Disaggregated Scheduled Fabric (DSF)?
The Disaggregated Scheduled Fabric (DSF) is designed to support scale-out interconnect for large AI clusters that can span entire data center buildings. It utilizes a VOQ-based system powered by the OCP-SAI standard and FBOSS, enabling flexible and efficient networking for AI workloads.
What are the key features of Non-Scheduled Fabrics (NSF)?
Non-Scheduled Fabrics (NSF) are based on shallow-buffer OCP Ethernet switches, delivering low round-trip latency and supporting adaptive routing for effective load-balancing. This architecture serves as a foundational building block for large-scale AI clusters like Prometheus.
How does Meta contribute to the Ethernet for Scale-Up Networking (ESUN) initiative?
Meta is a founding participant in the ESUN initiative, collaborating with industry leaders to advance Ethernet technology for AI applications. Their contributions include defining technical requirements and ensuring interoperability with open standards, which helps in building robust networking solutions for AI clusters.
What advancements were made in optics for data centers?
Meta introduced 2x400G FR4 LITE optics optimized for intra-data center use cases, supporting fiber links up to 500 meters. This initiative aims to reduce costs while maintaining performance, alongside the introduction of 400G DR4 OSFP-RHS optics for AI host-side NIC connectivity.

Key Statistics & Figures

Number of XPUs supported by the dual-stage DSF architecture
18,432
This capacity allows for the construction of extensive AI clusters that can span multiple regions.
Throughput of the Minipack3N switch
51 Tbps
This switch is part of the new portfolio designed to support next-generation AI fabrics.

Technologies & Tools

Network Architecture
Disaggregated Scheduled Fabric (dsf)
Supports scale-out interconnect for large AI clusters.
Network Architecture
Non-scheduled Fabric (nsf)
Delivers low latency and adaptive routing for AI workloads.
Network Operating System
Fboss
Runs on OCP switches and supports large-scale network management.
Network Standard
Ocp-sai
Provides a standard for onboarding new network fabrics and switch hardware.
Optics
2x400g Fr4 Lite Optics
Optimized for intra-data center applications to reduce costs.

Key Actionable Insights

1
Implementing Disaggregated Scheduled Fabric (DSF) can significantly enhance the scalability of your AI clusters.
By adopting DSF, organizations can interconnect a larger number of GPUs, enabling them to meet the increasing demands of AI workloads effectively.
2
Consider utilizing Non-Scheduled Fabrics (NSF) for applications requiring low latency and high performance.
NSF's adaptive routing capabilities ensure optimal load balancing, which is crucial for maintaining performance in large-scale AI environments.
3
Engage with the Ethernet for Scale-Up Networking (ESUN) initiative to stay at the forefront of AI networking solutions.
Participating in ESUN allows organizations to collaborate on best practices and standards that will shape the future of networking in AI applications.

Common Pitfalls

1
Failing to adopt open hardware standards can lead to vendor lock-in and reduced flexibility.
Organizations that do not embrace open standards may find themselves constrained by proprietary technologies, limiting their ability to innovate and scale effectively.

Related Concepts

Open Compute Project (ocp)
AI/ML Infrastructure
Networking Solutions For Data Centers
Disaggregation In Data Center Technologies