Rethinking Netflix’s Edge Load Balancing

Netflix Technology Blog

Netflix

•

Netflix Technology Blog

•19 min read•advanced•

--

•View Original

Load Balancer

Overview

The article discusses Netflix's improvements in edge load balancing, focusing on the implementation of a new load balancing strategy that combines client-side and server-side data to enhance performance and reduce errors. It details the guiding principles, the load-balancing approach, and the operational impacts of these changes.

What You'll Learn

1

How to implement a choice-of-2 algorithm for load balancing

2

Why combining client-side and server-side data improves load balancing

3

How to apply probation and server-age mechanisms to prevent overload

Prerequisites & Requirements

Understanding of load balancing concepts and algorithms
Experience with distributed systems and microservices architecture(optional)

Key Questions Answered

How does Netflix's new load balancing approach reduce errors?

The new load balancing approach reduces errors by utilizing a combination of client-side latency data and server-side utilization metrics. This dual data source allows for better decision-making in routing requests, leading to a significant decrease in load-related errors and improved service availability.

What are the guiding principles for Netflix's load balancing improvements?

The guiding principles include working within existing frameworks, applying learnings from other teams, avoiding distributed state, and minimizing client-side configuration. These principles ensure that the new load balancing strategies are efficient, reusable, and maintainable across different systems.

What operational impacts did the new load balancing strategy have?

The new load balancing strategy resulted in wider request distribution and ensured that slower servers received less traffic. This change improved overall system resilience and reduced the chances of overload during peak times, although it also introduced challenges in monitoring and alerting.

Key Statistics & Figures

Reduction in load-shedding and connection errors

Orders of magnitude reduction

This was achieved compared to the previous round-robin load balancing implementation.

Improvement in average and long-tail latency

3x reduction

This significant improvement was observed with the new load balancing features enabled.

Error reduction contribution from server-utilization feature

One order of magnitude

This feature alone provided substantial value in reducing errors.

Technologies & Tools

Backend

Zuul

Used as the load balancer framework for Netflix's microservices architecture.

Backend

Ribbon

Previously used load balancer with round-robin algorithm before improvements were made.

Key Actionable Insights

1
Implement a choice-of-2 algorithm to enhance load distribution across servers.
This approach mitigates the herding problem seen in traditional load balancing methods, ensuring that requests are more evenly spread across available resources.

2
Utilize server-reported utilization metrics to improve decision-making in load balancing.
By relying on real-time data from servers, you can avoid the inaccuracies of client-side metrics, leading to better performance and reduced error rates.

3
Incorporate probation mechanisms for newly launched servers to prevent overload.
This practice helps ensure that new servers are gradually introduced into the traffic flow, allowing them to stabilize before handling full loads.

Common Pitfalls

1

Relying solely on client-side metrics can lead to inaccurate load balancing decisions.

This occurs because client-side views may not represent the actual utilization of servers, especially in large clusters with low traffic, leading to inefficient request routing.

2

Overloading newly launched servers during their warm-up phase.

Without proper probation mechanisms, new servers may receive too much traffic too quickly, causing them to fail before they can stabilize.

Related Concepts

Distributed Systems

Microservices Architecture

Load Balancing Algorithms

Error Handling In Cloud Environments