Cost-Efficient Open Source Big Data Platform at Uber

Zheng Shao, Mohammad Islam
18 min readadvanced
--
View Original

Overview

The article discusses Uber's initiatives to enhance the efficiency of its Big Data platform, focusing on cost reduction through optimizations in file formats, HDFS erasure coding, YARN scheduling improvements, and the implementation of Apache Hudi for incremental processing. It highlights the importance of balancing supply and demand in data processing to achieve operational efficiency.

What You'll Learn

1

How to optimize file formats for better data processing efficiency

2

Why HDFS erasure coding can significantly reduce storage costs

3

How to implement dynamic scheduling policies in YARN

4

When to use Apache Hudi for incremental data processing

Prerequisites & Requirements

  • Understanding of Big Data concepts and technologies
  • Familiarity with Apache Hadoop ecosystem(optional)

Key Questions Answered

What are the benefits of using ZSTD compression in Parquet files?
ZSTD compression can reduce Parquet file sizes by 8% to 12% compared to GZIP, while also providing faster decompression speeds. This allows for more efficient data storage and quicker access times, making it a preferred choice for Uber's data processing needs.
How does Uber manage YARN cluster capacity effectively?
Uber employs a Dynamic MAX algorithm to adjust YARN queue capacities based on historical usage patterns. This approach allows for better resource allocation during peak times while maintaining overall cluster efficiency, ensuring that users have clear expectations of resource availability.
What challenges does Uber face with YARN scheduling?
Uber faces challenges in balancing high utilization of their YARN cluster while meeting user expectations for resource availability. The spiky nature of resource demands from various teams complicates this balance, necessitating innovative scheduling solutions.
What is the significance of Apache Hudi in Uber's data processing?
Apache Hudi enables efficient incremental processing of data, allowing Uber to handle late-arriving or modified data without re-scanning entire datasets. This capability significantly reduces computational overhead and improves query performance.

Key Statistics & Figures

Reduction in Parquet file size with ZSTD Level 9
8%
Compared to GZIP Level 6 compression, ZSTD Level 9 offers significant size reductions.
Reduction in Parquet file size with ZSTD Level 19
12%
This compression level provides even greater efficiency, albeit with slower compression speeds.
HDFS replication factor reduction
1.67x and 1.5x
Through erasure coding, Uber can significantly lower the space needed for HDFS files.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Backend
Apache Hadoop
Used for distributed storage and processing of large datasets.
Backend
Apache Hudi
Facilitates efficient incremental data processing.
Data Format
Apache Parquet
Columnar storage format used for efficient data processing.
Data Streaming
Apache Kafka
Used for real-time data ingestion and processing.
Data Processing
Apache Spark
Used for large-scale data processing tasks.
Resource Management
Apache Yarn
Manages resources for distributed applications.

Key Actionable Insights

1
Implement ZSTD compression for Parquet files to enhance storage efficiency.
Switching to ZSTD can lead to reduced file sizes and faster decompression, which is crucial for managing large datasets effectively.
2
Utilize Dynamic MAX scheduling in YARN to optimize resource allocation.
This approach allows for better management of spiky workloads, ensuring that resources are available when needed while maintaining overall cluster efficiency.
3
Adopt Apache Hudi for handling incremental data updates.
Using Hudi can streamline data processing workflows, reducing the need for extensive data scans and improving query response times.

Common Pitfalls

1
Failing to optimize file formats can lead to increased storage costs and slower query performance.
Many organizations overlook the importance of choosing the right file format, which can significantly impact both performance and cost efficiency.
2
Ignoring the need for dynamic scheduling can result in resource wastage.
Static scheduling policies may not adapt well to fluctuating workloads, leading to underutilization of resources during off-peak times.

Related Concepts

Data Processing Optimization Techniques
Cost Efficiency In Big Data Platforms
Resource Management Strategies In Distributed Systems