•Kumudini Kakwani, Akshayaprakash Sharma, Nimesh Khandelwal, Aayush Chaturvedi, Chintan Betrabet, Suprit Acharya•14 min read•intermediate•
--
•View OriginalOverview
This article details Uber's migration from Apache Hive to Apache Spark SQL for ETL workloads, highlighting the motivations behind the transition, the architecture involved, and the challenges faced during the migration process. It emphasizes the performance improvements achieved and the strategies implemented to ensure a smooth transition with minimal user intervention.
What You'll Learn
1
How to leverage Apache Spark SQL for improved ETL performance
2
Why shadow testing is crucial during migration processes
3
How to implement automated migration services for legacy systems
Prerequisites & Requirements
- Understanding of ETL processes and data querying
- Familiarity with Apache Spark and Hive(optional)
Key Questions Answered
What were the main motivations for Uber's migration from Hive to Spark SQL?
Uber migrated from Hive to Spark SQL primarily for improved compute efficiency and modernization. Spark SQL offers better performance due to features like adaptive query execution and dynamic partition pruning, leading to initial workload results showing up to 4x performance benefits compared to Hive.
How did Uber automate the migration of workflows from Hive to Spark SQL?
Uber implemented an Automated Migration Service (AMS) that orchestrated shadow testing by translating Hive queries to Spark SQL and running them in shadow mode. This service ensured that the performance of the queries remained the same or improved, facilitating a seamless transition.
What challenges did Uber face during the migration process?
Uber encountered several challenges, including differences in query syntax between Hive and Spark SQL, handling floating-point arithmetic precision issues, and ensuring data consistency across systems. These challenges required careful planning and the development of custom solutions.
What strategies were used to validate data consistency between Hive and Spark SQL?
To validate data consistency, Uber developed a Data Validation Service that executed a series of assertions comparing datasets generated by Hive and Spark. This included row count validation and row-level checksum comparisons to ensure accuracy.
Key Statistics & Figures
Monthly queries handled
5 million
This was the volume of queries processed by Hive before the migration to Spark SQL.
Performance improvement
up to 4x
Initial workload results showed that Spark SQL outperformed Hive significantly.
Runtime and resource usage reduction
50%
The migration resulted in a substantial decrease in both runtime and resource consumption.
Technologies & Tools
Some links below are affiliate links. We may earn a commission if you make a purchase.
Data Processing
Apache Hive
Used for ETL workloads prior to migration.
Data Processing
Apache Spark SQL
The new engine adopted for improved ETL performance.
Key Actionable Insights
1Implement shadow testing when migrating ETL workflows to ensure data consistency and performance.Shadow testing allows you to run new queries alongside existing ones without impacting production data, providing a safety net during migrations.
2Utilize automated migration services to reduce developer effort and streamline the transition process.By automating the migration, Uber minimized the manual workload on developers, allowing them to focus on other critical tasks while ensuring a smooth transition.
3Leverage community support and open-source contributions to enhance performance and capabilities.Engaging with the open-source community can provide valuable insights and improvements, as seen with Uber's adaptation of features from Spark's open PRs.
Common Pitfalls
1
Floating point arithmetic can lead to precision errors during data validation.
This issue arises when aggregation functions like SUM or AVG are applied, necessitating the introduction of mismatch tolerances and manual identification of problematic columns.
2
Stringified JSON can cause mismatches during validation.
To mitigate this, a custom UDF was developed to sort JSON keys before computing checksums, ensuring consistency across datasets.
Related Concepts
Etl Processes
Data Validation Techniques
Apache Spark Performance Optimization