Scaling LinkedIn's Hadoop YARN cluster beyond 10,000 nodes

Keqiu H.
21 min readadvanced
--
View Original

Overview

This article discusses the challenges and solutions LinkedIn faced while scaling its Hadoop YARN cluster beyond 10,000 nodes. It covers performance issues, the development of monitoring tools like DynoYARN, and the introduction of Robin, a load balancer designed to facilitate horizontal scaling.

What You'll Learn

1

How to identify and mitigate performance bottlenecks in a Hadoop YARN cluster

2

Why monitoring and forecasting tools like DynoYARN are essential for large-scale systems

3

How to implement a load balancer like Robin for horizontal scaling of YARN clusters

Prerequisites & Requirements

  • Understanding of Hadoop and YARN architecture
  • Familiarity with monitoring tools and metrics(optional)

Key Questions Answered

What performance issues did LinkedIn encounter when scaling YARN?
LinkedIn faced significant delays in job scheduling as the cluster grew, with users experiencing hours-long delays despite available resources. The primary issues were linked to the resource manager's single-threaded scheduling mechanism and inefficient queue management.
How does DynoYARN help in forecasting YARN scalability?
DynoYARN simulates YARN clusters of arbitrary size and replays production workloads to project future performance. It allows LinkedIn to understand how the resource manager will handle increased workloads, helping to plan for scaling before performance issues arise.
What is Robin and how does it facilitate horizontal scaling?
Robin is a load balancer developed by LinkedIn to distribute YARN applications across multiple clusters. It enables applications to stay within a single sub-cluster while dynamically routing jobs, which helps manage the growing compute demands effectively.

Key Statistics & Figures

Container allocation speed pre-merge
500 containers per second for the primary cluster
This was the average throughput before merging clusters.
Container allocation speed post-merge
600 containers per second
This was the aggregated average allocation speed after merging, which often dropped to as low as 50 containers per second.
p95 application delay at 11,443 nodes
10.278 minutes
This delay is just above the target of 10 minutes for application delays.

Technologies & Tools

Backend
Hadoop
Used as the backbone for big data analytics and machine learning at LinkedIn.
Tool
Dynoyarn
A tool developed to simulate YARN clusters and forecast scalability.
Tool
Robin
A load balancer designed to distribute YARN applications across multiple clusters.

Key Actionable Insights

1
Implementing a load balancer like Robin can significantly improve resource utilization in large-scale YARN environments.
By dynamically routing jobs to the appropriate cluster, Robin mitigates delays caused by resource contention and ensures efficient workload distribution.
2
Regularly monitoring key performance metrics is crucial for maintaining the health of a YARN cluster.
Metrics such as container allocation throughput and apps pending can provide early warnings of performance degradation, allowing teams to address issues proactively.
3
Utilizing simulation tools like DynoYARN can help predict how system changes will affect performance.
By simulating various workloads, teams can make informed decisions about scaling and optimizations before implementing changes in production.

Common Pitfalls

1
Assuming that YARN's single-threaded scheduling mechanism can indefinitely handle growth without issues.
This misconception led to significant performance degradation as the cluster size increased, highlighting the importance of understanding the limitations of the underlying architecture.
2
Neglecting to monitor performance metrics, which can result in delayed responses to emerging issues.
Without proper monitoring, teams may miss early signs of performance degradation, leading to more severe problems down the line.

Related Concepts

Hadoop Architecture
Yarn Scheduling Mechanisms
Load Balancing In Distributed Systems
Performance Monitoring Tools