How Uber Migrated from Hive to Spark SQL for ETL Workloads

Kumudini Kakwani, Akshayaprakash Sharma, Nimesh Khandelwal, Aayush Chaturvedi, Chintan Betrabet, Suprit Acharya

Uber

•

Kumudini Kakwani, Akshayaprakash Sharma, Nimesh Khandelwal, Aayush Chaturvedi, Chintan Betrabet, Suprit Acharya

•14 min read•intermediate•

--

•View Original

ApacheApache SparkJavaJSONMySQLOracleServerlessSQL

Overview

This article details Uber's migration from Apache Hive to Apache Spark SQL for ETL workloads, highlighting the motivations behind the transition, the architecture involved, and the challenges faced during the migration process. It emphasizes the performance improvements achieved and the strategies implemented to ensure a smooth transition with minimal user intervention.

What You'll Learn

1

How to leverage Apache Spark SQL for improved ETL performance

2

Why shadow testing is crucial during migration processes

3

How to implement automated migration services for legacy systems

Prerequisites & Requirements

Understanding of ETL processes and data querying
Familiarity with Apache Spark and Hive(optional)

Key Questions Answered

What were the main motivations for Uber's migration from Hive to Spark SQL?

Uber migrated from Hive to Spark SQL primarily for improved compute efficiency and modernization. Spark SQL offers better performance due to features like adaptive query execution and dynamic partition pruning, leading to initial workload results showing up to 4x performance benefits compared to Hive.

How did Uber automate the migration of workflows from Hive to Spark SQL?

Uber implemented an Automated Migration Service (AMS) that orchestrated shadow testing by translating Hive queries to Spark SQL and running them in shadow mode. This service ensured that the performance of the queries remained the same or improved, facilitating a seamless transition.

What challenges did Uber face during the migration process?

Uber encountered several challenges, including differences in query syntax between Hive and Spark SQL, handling floating-point arithmetic precision issues, and ensuring data consistency across systems. These challenges required careful planning and the development of custom solutions.

What strategies were used to validate data consistency between Hive and Spark SQL?

To validate data consistency, Uber developed a Data Validation Service that executed a series of assertions comparing datasets generated by Hive and Spark. This included row count validation and row-level checksum comparisons to ensure accuracy.

Key Statistics & Figures

Monthly queries handled

5 million

This was the volume of queries processed by Hive before the migration to Spark SQL.

Performance improvement

up to 4x

Initial workload results showed that Spark SQL outperformed Hive significantly.

Runtime and resource usage reduction

50%

The migration resulted in a substantial decrease in both runtime and resource consumption.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Data Processing

Apache Hive

Used for ETL workloads prior to migration.

Data Processing

Apache Spark SQL

The new engine adopted for improved ETL performance.

Key Actionable Insights

1
Implement shadow testing when migrating ETL workflows to ensure data consistency and performance.
Shadow testing allows you to run new queries alongside existing ones without impacting production data, providing a safety net during migrations.

2
Utilize automated migration services to reduce developer effort and streamline the transition process.
By automating the migration, Uber minimized the manual workload on developers, allowing them to focus on other critical tasks while ensuring a smooth transition.

3
Leverage community support and open-source contributions to enhance performance and capabilities.
Engaging with the open-source community can provide valuable insights and improvements, as seen with Uber's adaptation of features from Spark's open PRs.

Common Pitfalls

1

Floating point arithmetic can lead to precision errors during data validation.

This issue arises when aggregation functions like SUM or AVG are applied, necessitating the introduction of mismatch tolerances and manual identification of problematic columns.

2

Stringified JSON can cause mismatches during validation.

To mitigate this, a custom UDF was developed to sort JSON keys before computing checksums, ensuring consistency across datasets.

Related Concepts

Etl Processes

Data Validation Techniques

Apache Spark Performance Optimization