Uber Case Study: Choosing the Right HDFS File Format for Your Apache Spark Jobs

Scott Short
12 min readintermediate
--
View Original

Overview

This article discusses Uber's approach to selecting the appropriate HDFS file formats for processing large volumes of imagery and metadata using Apache Spark. It highlights the advantages and disadvantages of different file formats, including SequenceFiles, Avro, and Parquet, and provides insights into optimizing data processing workflows.

What You'll Learn

1

How to choose the right file format for processing large datasets in Apache Spark

2

Why SequenceFiles are preferred for ingesting raw data into HDFS

3

When to use Avro for intermediate data processing in Spark jobs

4

How to optimize final output formats for efficient querying and filtering

Prerequisites & Requirements

  • Understanding of Apache Spark and HDFS
  • Familiarity with file formats such as Avro and Parquet(optional)

Key Questions Answered

What file formats does Uber use for processing imagery and metadata?
Uber utilizes SequenceFiles for ingesting data, Avro for intermediate data processing, and Parquet for final output. Each format is chosen based on its strengths in handling specific types of data and processing requirements.
How does the choice of file format impact data processing performance?
The choice of file format significantly affects performance metrics such as write speed, query execution time, and I/O consumption. For instance, Parquet can reduce I/O by up to 92.5% compared to Avro during queries, enhancing overall efficiency.
When should Avro be used instead of Parquet?
Avro is preferred for intermediate data when the data is written once and read once without filtering. This is due to its lower write overhead compared to Parquet, which is more resource-intensive for large binary data.
What are the benefits of using Parquet for final output?
Parquet provides efficient querying and filtering capabilities due to its columnar storage format, which allows for better compression and faster read times compared to row-based formats like Avro.

Key Statistics & Figures

I/O consumption reduction
7.5%
Parquet consumes 7.5% of the I/O required by Avro queries, demonstrating its efficiency in data processing.
Query execution speed improvement
290%
Parquet queries executed almost three times faster than Avro queries, highlighting the performance benefits of using Parquet.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Backend
Apache Spark
Used for processing large volumes of imagery and metadata.
Storage
Hadoop Distributed File System (hdfs)
Used for storing ingested data and processed outputs.
File Format
Sequencefile
Used for efficient writes of blob data during ingestion.
File Format
Avro
Used for intermediate data processing due to its efficient write performance.
File Format
Parquet
Used for final output to optimize querying and filtering.

Key Actionable Insights

1
Choose SequenceFiles for ingesting large volumes of raw data into HDFS to optimize resource usage and processing speed.
Using SequenceFiles minimizes memory consumption on NameNodes and enhances the efficiency of Spark jobs, especially when dealing with a high number of files.
2
Utilize Avro for intermediate data processing to take advantage of its efficient write performance and schema support.
Avro's ability to handle blob data efficiently makes it suitable for scenarios where data is processed in bulk without filtering.
3
Leverage Parquet for final output to enhance query performance and reduce I/O costs.
Parquet's columnar format allows for faster data retrieval and lower resource consumption, making it ideal for analytics and reporting tasks.

Common Pitfalls

1
Using too many small files in HDFS can lead to inefficient resource utilization.
Small files increase the overhead on NameNodes and can slow down processing. It's better to consolidate data into fewer large files to enhance performance.