Hodor: Overload scenarios and the evolution of their detection and handling

LinkedIn Engineering Team
13 min readadvanced
--
View Original

Overview

The article discusses the Hodor framework developed by LinkedIn for detecting and handling overload scenarios in microservices. It outlines the evolution of overload detection methods, introduces new tools for identifying garbage collection and application threadpool overloads, and emphasizes the importance of maintaining high availability for LinkedIn services.

What You'll Learn

1

How to detect different types of overloads in real-time using the Hodor framework

2

Why traffic tiering is essential for prioritizing requests during overload scenarios

3

How to implement garbage collection overload detection in Java microservices

4

When to apply load shedding strategies to maintain service availability

Prerequisites & Requirements

  • Understanding of microservices architecture and overload scenarios
  • Familiarity with Java and the Java Virtual Machine (JVM)

Key Questions Answered

What are the main goals of the Hodor framework?
The Hodor framework aims to detect various types of overloads in real-time, mitigate these overloads to improve resilience, function as an out-of-the-box solution for all LinkedIn services, and ensure a net positive impact on member experience.
How does the garbage collection (GC) detector work?
The GC detector monitors the overhead of garbage collection events in Java microservices. It calculates the GC overhead percentage and signals a GC overload if the duration in a certain GC overhead tier exceeds the defined violation period, which varies by severity.
What is traffic tiering and how is it implemented?
Traffic tiering categorizes requests into three tiers: optional, degradable, and non-degradable. This prioritization allows the system to drop lower priority requests first, ensuring that critical user requests are processed even during overload situations.
What is the purpose of the latency confirmation filter?
The latency confirmation filter serves to reduce false positives from the overload detectors by using a moving average crossover algorithm. It confirms overload signals only when there is a significant increase in latency, ensuring accurate detection.

Key Statistics & Figures

Availability goal for LinkedIn services
99.9%
This goal necessitates the operation of approximately 1000 services working in tandem.
Number of microservices running Hodor
1000+
Hodor has successfully prevented hundreds of overloads in LinkedIn's production systems.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Key Actionable Insights

1
Implement the Hodor framework to enhance overload detection and remediation in your microservices architecture.
By adopting Hodor, you can proactively manage overload scenarios, ensuring high availability and performance of your services, which is crucial for user satisfaction.
2
Utilize traffic tiering to prioritize critical requests during peak loads.
This approach helps maintain service quality by ensuring that essential user requests are processed first, reducing the risk of downtime and enhancing user experience.
3
Monitor garbage collection activity to identify potential performance bottlenecks.
Understanding GC overhead can help you optimize your Java microservices, leading to better resource management and improved application performance.

Common Pitfalls

1
Relying solely on queue length for overload detection can lead to inaccurate assessments.
Queue length varies based on service design and may not correlate with user experience, making it an unreliable metric for determining overload states.

Related Concepts

Microservices Architecture
Overload Detection And Handling
Traffic Management Strategies