Upgrading Data Warehouse Infrastructure at Airbnb

This blog aims to introduce Airbnb’s experience upgrading Data Warehouse infrastructure to Spark and Iceberg

Ronnie Zhu
10 min readadvanced
--
View Original

Overview

This article discusses Airbnb's upgrade of its Data Warehouse infrastructure to Spark 3 and Iceberg, addressing challenges faced with the previous Hive and S3 setup. It details the motivations for the upgrade, the challenges encountered, and the benefits realized from the new technology stack, particularly in data ingestion processes.

What You'll Learn

1

How to implement Apache Iceberg for efficient data partitioning

2

Why Adaptive Query Execution improves Spark performance

3

How to migrate data ingestion frameworks from Hive to Spark

Prerequisites & Requirements

  • Understanding of data warehousing concepts
  • Familiarity with Apache Spark and Iceberg(optional)

Key Questions Answered

What challenges did Airbnb face with their previous data warehouse infrastructure?
Airbnb faced several challenges including bottlenecks from the Hive Metastore due to increasing partitions, inefficient interactions between Hive and S3, issues with schema evolution across different compute engines, and limitations in partitioning flexibility. These challenges hindered scalability and productivity for users.
How does Apache Iceberg improve data management in data warehouses?
Apache Iceberg enhances data management by allowing flexible partition specifications, eliminating the need for S3 listings, and providing consistent schema evolution across compute engines. This results in reduced load on the Hive Metastore and improved query performance.
What performance improvements were observed after upgrading to Spark 3 and Iceberg?
After the upgrade, Airbnb achieved over 50% savings in compute resources and a 40% reduction in job elapsed time for their data ingestion framework. This demonstrates significant efficiency gains in processing and managing large datasets.
What is Adaptive Query Execution and how does it help in Spark?
Adaptive Query Execution (AQE) is a feature in Spark 3 that optimizes query execution plans based on runtime statistics. It dynamically adjusts partition sizes and join strategies during query execution, leading to improved performance and resource utilization.

Key Statistics & Figures

Compute resource savings
over 50%
This was achieved in the data ingestion framework after migrating to Spark 3 and Iceberg.
Job elapsed time reduction
40%
This reduction was noted in the data ingestion framework following the upgrade.
Event messages processed daily
>35 billion
This volume reflects the scale of data handled by Airbnb's data ingestion framework.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Key Actionable Insights

1
Implementing Apache Iceberg can significantly enhance your data warehouse's performance and flexibility.
By adopting Iceberg, you can streamline data ingestion processes and improve query performance, especially for large datasets with varying time granularities.
2
Utilizing Adaptive Query Execution in Spark can optimize resource usage and improve job performance.
AQE dynamically adjusts execution plans based on real-time data characteristics, which can lead to better handling of variable data sizes and improved overall efficiency.
3
Migrating from Hive to Spark requires careful tuning of parameters to achieve optimal performance.
Understanding the differences in how Spark and Hive handle data can help you effectively tune your data ingestion framework for better performance.

Common Pitfalls

1
Underestimating the complexity of migrating from Hive to Spark can lead to performance issues.
Many organizations may overlook the need for careful tuning of parameters and understanding the differences in execution models, which can result in suboptimal performance.
2
Failing to adapt to the dynamic nature of data sizes can hinder performance in Spark jobs.
Without utilizing features like Adaptive Query Execution, Spark jobs may struggle with varying data sizes, leading to inefficient resource utilization and longer job times.

Related Concepts

Data Warehousing
Apache Spark
Apache Iceberg
Data Ingestion Frameworks