Scaling Uber’s Apache Hadoop Distributed File System for Growth

Ang Zhang, Wei Yan
13 min readadvanced
--
View Original

Overview

The article discusses Uber's experience in scaling its Apache Hadoop Distributed File System (HDFS) to accommodate exponential data growth while maintaining performance. It outlines the challenges faced, the solutions implemented, and the lessons learned during this process.

What You'll Learn

1

How to scale HDFS using ViewFs for multiple namespaces

2

Why garbage collection tuning is essential for HDFS performance

3

How to manage small files in HDFS to improve efficiency

Prerequisites & Requirements

  • Understanding of Hadoop and HDFS architecture
  • Familiarity with Apache Hadoop tools and configurations(optional)

Key Questions Answered

What challenges did Uber face while scaling HDFS?
Uber faced challenges such as high NameNode RPC queue times, the need for rapid data access, and the management of small files that increased memory pressure on the NameNode. These issues arose due to the exponential growth of data and the demands of thousands of users running queries.
How did Uber improve HDFS performance during scaling?
Uber implemented several strategies including upgrading HDFS versions, tuning garbage collection, and creating a dedicated HDFS load management service. These improvements allowed them to scale their infrastructure over 400 percent while enhancing overall performance.
What is the purpose of the Observer NameNode?
The Observer NameNode is designed as a read-only replica of the active NameNode, aimed at reducing the load on the active NameNode cluster. This is particularly beneficial as more than half of HDFS RPC volume comes from read-only queries, which can help scale throughput significantly.
What strategies did Uber use to control the number of small files in HDFS?
Uber built new ingestion pipelines using the Hoodie library to generate larger files and created a tool to merge small files into larger ones. They also enforced strict namespace quotas to encourage users to optimize file sizes, which helped reduce the number of small files.

Key Statistics & Figures

Data growth on HDFS
400 percent
In 2017 alone, the amount of data stored on HDFS grew more than 400 percent due to increased business demands.
NameNode RPC queue time
500 milliseconds
At times, NameNode queue time could exceed 500 milliseconds per request, significantly slowing down HDFS operations.
Reduction in GC pause time
from 13 percent to 1.7 percent
By increasing the young generation size and tuning parameters, the total time spent on GC pause was reduced from 13 percent to 1.7 percent.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Backend
Apache Hadoop
Used as the primary infrastructure for big data analysis and storage.
Backend
Viewfs
Used to manage multiple namespaces in HDFS.
Backend
Flink
Used in the HDFS load management service for real-time processing.
Backend
Kafka
Used in conjunction with Flink for streaming audit logs from the NameNode.

Key Actionable Insights

1
Implement garbage collection tuning to enhance HDFS performance.
Tuning garbage collection can significantly reduce pause times and improve throughput, which is crucial for maintaining performance as data loads increase.
2
Utilize ViewFs to manage multiple namespaces effectively.
By implementing ViewFs, organizations can split HDFS into multiple physical namespaces while presenting a unified interface to users, thus enhancing scalability.
3
Control the number of small files to optimize HDFS performance.
Encouraging users to create larger files and implementing tools to merge small files can alleviate memory pressure on the NameNode and improve overall system efficiency.

Common Pitfalls

1
Failing to manage small files can lead to significant performance degradation.
Since the NameNode loads all file metadata into memory, an excess of small files increases memory pressure, leading to slower performance. Implementing strategies to merge small files and optimize file sizes is crucial.

Related Concepts

Hadoop Architecture
Hdfs Performance Tuning
Data Ingestion Strategies