Watch Meta’s engineers discuss QUIC and TCP innovations for our network

With more than 75 percent of our internet traffic set to use QUIC and HTTP/3 together, QUIC is slowly moving to become the de facto protocol used for internet communication at Meta. For Meta’s data…

Tanuja Ingale
5 min readadvanced
--
View Original

Overview

The article discusses innovations in QUIC and TCP protocols at Meta, highlighting how these advancements improve network performance, efficiency, and reliability across their data centers. It features insights from engineers who presented their work at the Networking @Scale 2022 conference.

What You'll Learn

1

How to implement direct server return (DSR) using QUIC at the CDN layer

2

How to reduce startup delays in high-BDP links using QUIC Jump Start

3

How to handle sustained congestion in data centers using DCTCP

4

How to utilize a BPF-based platform for network tuning at scale

5

Why a host-based traffic admission system is essential for WAN resource management

Key Questions Answered

What is the purpose of QUIC in Meta's network?
QUIC is being adopted as the de facto protocol for internet communication at Meta, enhancing performance and reliability across various network layers. It allows for improved congestion management and platform extensibility, particularly in the context of their CDN and data center layers.
How does QUIC Jump Start reduce transfer times?
QUIC Jump Start reduces transfer times by caching congestion control state, allowing new connections to quickly utilize available bandwidth without the typical startup delay. This is particularly beneficial for small transfers that would otherwise spend significant time probing for bandwidth.
What challenges does Meta face with data center congestion?
Meta's data center network faces challenges related to increased user demand, necessitating solutions that ensure high reliability and performance. The engineers discussed using DCTCP and receiver window tuning to manage sustained congestion and bursts effectively.
What is NetEdit and how does it help Meta?
NetEdit is a BPF-based network feature platform designed for fine-grained tuning across millions of servers. It allows for extensive monitoring and observability, ensuring that large-scale network changes can be made without disrupting production traffic.

Technologies & Tools

Protocol
Quic
Used to enhance network performance and reduce transfer times in Meta's backbone network.
Protocol
TCP
Remains the primary transport protocol supporting thousands of services in Meta's data center network.
Protocol
Dctcp
Utilized for handling sustained congestion in the data center network.
Tool
Bpf
Used in the development of NetEdit for network tuning and monitoring.

Key Actionable Insights

1
Implementing QUIC's direct server return (DSR) can significantly enhance the efficiency of content delivery networks.
By bypassing multiple hops in the CDN architecture, DSR reduces CPU cycle usage and improves bandwidth, making it a valuable strategy for high-traffic applications.
2
Utilizing QUIC Jump Start can dramatically decrease startup delays for new connections in high-bandwidth-delay product (BDP) links.
This approach is particularly useful for small data transfers that would otherwise struggle to utilize available bandwidth effectively, thereby optimizing overall transfer times.
3
Adopting DCTCP for managing sustained congestion can enhance data center reliability.
This technique leverages Explicit Congestion Notification (ECN) signals to dynamically adjust to network conditions, ensuring consistent performance even during high demand.
4
Developing a modular BPF-based platform like NetEdit allows for safe and efficient network changes at scale.
This approach ensures that network modifications can be validated and tested thoroughly, minimizing risks to production traffic.
5
Implementing a host-based traffic admission system can improve WAN resource management.
This system allows for better sharing of network resources among high-demand services, ensuring that peak demands are met without overprovisioning.

Common Pitfalls

1
Failing to adequately test network changes can lead to disruptions in production traffic.
Without thorough validation, even minor adjustments can cause significant issues, especially in large-scale environments where uptime is critical.