Rethinking Netflix’s Edge Load Balancing

Netflix Technology Blog
19 min readadvanced
--
View Original

Overview

The article discusses Netflix's improvements in edge load balancing, focusing on the implementation of a new load balancing strategy that combines client-side and server-side data to enhance performance and reduce errors. It details the guiding principles, the load-balancing approach, and the operational impacts of these changes.

What You'll Learn

1

How to implement a choice-of-2 algorithm for load balancing

2

Why combining client-side and server-side data improves load balancing

3

How to apply probation and server-age mechanisms to prevent overload

Prerequisites & Requirements

  • Understanding of load balancing concepts and algorithms
  • Experience with distributed systems and microservices architecture(optional)

Key Questions Answered

How does Netflix's new load balancing approach reduce errors?
The new load balancing approach reduces errors by utilizing a combination of client-side latency data and server-side utilization metrics. This dual data source allows for better decision-making in routing requests, leading to a significant decrease in load-related errors and improved service availability.
What are the guiding principles for Netflix's load balancing improvements?
The guiding principles include working within existing frameworks, applying learnings from other teams, avoiding distributed state, and minimizing client-side configuration. These principles ensure that the new load balancing strategies are efficient, reusable, and maintainable across different systems.
What operational impacts did the new load balancing strategy have?
The new load balancing strategy resulted in wider request distribution and ensured that slower servers received less traffic. This change improved overall system resilience and reduced the chances of overload during peak times, although it also introduced challenges in monitoring and alerting.

Key Statistics & Figures

Reduction in load-shedding and connection errors
Orders of magnitude reduction
This was achieved compared to the previous round-robin load balancing implementation.
Improvement in average and long-tail latency
3x reduction
This significant improvement was observed with the new load balancing features enabled.
Error reduction contribution from server-utilization feature
One order of magnitude
This feature alone provided substantial value in reducing errors.

Technologies & Tools

Backend
Zuul
Used as the load balancer framework for Netflix's microservices architecture.
Backend
Ribbon
Previously used load balancer with round-robin algorithm before improvements were made.

Key Actionable Insights

1
Implement a choice-of-2 algorithm to enhance load distribution across servers.
This approach mitigates the herding problem seen in traditional load balancing methods, ensuring that requests are more evenly spread across available resources.
2
Utilize server-reported utilization metrics to improve decision-making in load balancing.
By relying on real-time data from servers, you can avoid the inaccuracies of client-side metrics, leading to better performance and reduced error rates.
3
Incorporate probation mechanisms for newly launched servers to prevent overload.
This practice helps ensure that new servers are gradually introduced into the traffic flow, allowing them to stabilize before handling full loads.

Common Pitfalls

1
Relying solely on client-side metrics can lead to inaccurate load balancing decisions.
This occurs because client-side views may not represent the actual utilization of servers, especially in large clusters with low traffic, leading to inefficient request routing.
2
Overloading newly launched servers during their warm-up phase.
Without proper probation mechanisms, new servers may receive too much traffic too quickly, causing them to fail before they can stabilize.

Related Concepts

Distributed Systems
Microservices Architecture
Load Balancing Algorithms
Error Handling In Cloud Environments