Curbing Connection Churn in Zuul

Netflix Technology Blog
11 min readadvanced
--
View Original

Overview

The article discusses the challenges of connection churn in Zuul, a gateway service used by Netflix, and outlines the strategies implemented to mitigate this issue. Key improvements include the adoption of HTTP/2 multiplexing and a subsetting algorithm to optimize connection management and reduce overall connection counts.

What You'll Learn

1

How to implement HTTP/2 multiplexing in Zuul

2

Why subsetting can reduce connection churn in service architectures

3

When to apply the Ringsteady algorithm for load balancing

Prerequisites & Requirements

  • Understanding of HTTP/2 and multiplexing concepts
  • Experience with service mesh architectures(optional)

Key Questions Answered

How does Zuul handle connection churn during traffic spikes?
Zuul mitigates connection churn by implementing HTTP/2 multiplexing, which allows multiple streams to be sent over a single connection. This significantly reduces the number of connections required, especially during traffic spikes, by reusing existing connections instead of establishing new ones.
What is the impact of subsetting on connection management in Zuul?
Subsetting allows Zuul to partition origin servers into smaller groups, which reduces the total number of connections while maintaining throughput. This strategy improves load balancing and minimizes connection churn, especially when origins scale up or down.
What improvements were observed after implementing the new connection strategies in Zuul?
After implementing the new strategies, Zuul experienced a reduction in total connections by a factor of 10x at peak traffic, alongside a significant decrease in churn, which improved overall system stability and performance.
How does the Ringsteady algorithm improve load balancing?
The Ringsteady algorithm creates a balanced distribution of servers, allowing Zuul to allocate traffic evenly across subsets. This method ensures that adding or removing servers does not disrupt the overall load balancing, maintaining stability in the system.

Key Statistics & Figures

Total connections reduction
10x
This reduction was observed at peak traffic across all deployment regions.
Churn reduction
8x
This improvement was noted in the number of TCP connections Zuul opens per second.
CPU utilization reduction
~4%
This reduction in CPU usage was a direct result of the implemented connection strategies.
Heap usage reduction
~15%
The reduction in heap usage contributed to overall system efficiency.
Latency reduction
~3%
This improvement in latency was observed post-implementation of the new connection management strategies.

Technologies & Tools

Backend
Zuul
Used as a gateway service to manage connections and traffic.
Protocol
HTTP/2
Implemented for multiplexing to reduce connection overhead.
Service Discovery
Eureka
Integrated with Zuul to manage origin instances and support the new connection strategies.

Key Actionable Insights

1
Implementing HTTP/2 multiplexing can drastically reduce connection overhead in service architectures.
This is particularly useful for applications experiencing high traffic spikes, as it allows for multiple requests to be handled over a single connection, thus minimizing the need for new connections.
2
Utilizing subsetting can enhance load balancing and reduce connection churn.
By partitioning origin servers into subsets, you can manage connections more efficiently, especially as the number of servers scales. This strategy helps maintain performance and stability.
3
Regularly review and adjust the replication factor of subsets based on traffic patterns.
This ensures that you maintain optimal performance without introducing unnecessary churn, especially as the number of origin nodes changes over time.

Common Pitfalls

1
Failing to properly implement HTTP/2 multiplexing can lead to underutilization of connections.
If multiplexing is not effectively utilized, the expected benefits in connection reduction and performance may not be realized, leading to continued high connection counts.
2
Not adjusting the replication factor of subsets can cause imbalances in load distribution.
If the replication factor is not regularly reviewed and adjusted, it can lead to uneven traffic distribution and potential hot-spotting on certain origin servers.

Related Concepts

HTTP/2 Multiplexing
Load Balancing Strategies
Service Mesh Architectures
Connection Management Techniques