Hodor: Detecting and addressing overload in LinkedIn microservices

Bryan Barkley
17 min readadvanced
--
View Original

Overview

The article discusses Hodor, a framework developed by LinkedIn to detect and address service overload in their microservices architecture. It outlines the challenges faced with service overload and details the mechanisms Hodor employs to maintain quality of service by intelligently shedding load during peak traffic.

What You'll Learn

1

How to implement Holistic Overload Detection in microservices

2

Why adaptive concurrency limits are effective in managing service overload

3

How to utilize load shedding strategies to maintain service health

Prerequisites & Requirements

  • Understanding of microservices architecture and overload scenarios
  • Familiarity with Java and JVM performance metrics(optional)

Key Questions Answered

How does Hodor detect service overload in LinkedIn's microservices?
Hodor employs overload detectors that monitor CPU availability and other resource limits within the JVM. By measuring the ability to obtain CPU time and analyzing performance indicators like latency and garbage collection activity, it determines when a service is overloaded and needs to shed traffic.
What strategies does Hodor use to shed load during overload situations?
Hodor utilizes an adaptive algorithm that sets concurrency limits based on real-time feedback from overload detectors. This approach allows the system to dynamically adjust the number of concurrent requests handled, ensuring that only necessary traffic is dropped to maintain service health.
What are the common causes of overload in LinkedIn's microservices?
Common causes of overload include CPU and memory exhaustion, I/O limits for network and disk access, and increased latencies from downstream services. Hodor is designed to address these various overload scenarios effectively.
How does Hodor ensure clients can safely retry rejected requests?
Hodor allows clients to retry rejected requests safely by returning specific HTTP status codes, such as 503, before any application logic is executed. This ensures that retries do not lead to unintended consequences during overload situations.

Key Statistics & Figures

CPU overload detection threshold
99th percentile over 55ms
This threshold indicates a violation window for CPU availability, leading to overload classification.
Consecutive violation windows for overload
8 consecutive windows
If this number of windows is in violation, the service is considered overloaded.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Backend
Java
Used for developing LinkedIn's microservices and the Hodor framework.
Framework
Rest.li
Primary framework used for communication in LinkedIn's services.

Key Actionable Insights

1
Implementing an adaptive load shedding strategy can significantly improve service reliability during peak traffic.
By dynamically adjusting concurrency limits based on real-time overload detection, services can maintain performance without manual tuning.
2
Regularly monitor JVM performance metrics to identify potential overload scenarios before they impact users.
Using tools to track CPU availability and garbage collection activity can help preemptively address issues that may lead to service degradation.
3
Encourage clients to implement retry logic for handling rejected requests to enhance user experience.
Providing clear guidelines on retry strategies can mitigate the impact of traffic shedding during overload situations.

Common Pitfalls

1
Failing to adjust concurrency limits dynamically can lead to unnecessary service overload.
Static limits may not account for changing traffic patterns, leading to either under-utilization or overload.
2
Ignoring JVM performance metrics can result in undetected overload scenarios.
Without monitoring, services may experience degradation before issues are identified and addressed.

Related Concepts

Microservices Architecture
Overload Detection Techniques
Load Shedding Strategies
Jvm Performance Optimization