Overview
Uber is modernizing its batch data infrastructure by migrating to Google Cloud Platform (GCP) to enhance data analytics and machine learning capabilities. This transition aims to improve user productivity, engineering velocity, cost efficiency, and data governance.
What You'll Learn
1
How to leverage cloud IaaS for seamless migration of batch data stacks
2
Why using open standards like Apache Parquet and Apache Hudi enhances data compatibility
3
How to implement data access proxies for federating query traffic during migration
Prerequisites & Requirements
- Understanding of cloud computing concepts and data management
- Familiarity with GCP services and Hadoop ecosystem(optional)
Key Questions Answered
What are the core principles for migrating Uber's data infrastructure to GCP?
The core principles include avoiding painful migrations for data users by maintaining existing workflows, enhancing data access proxies to manage traffic across on-prem and cloud, leveraging existing cloud-agnostic infrastructure, and forecasting potential data governance issues from cloud services.
How does Uber plan to manage data replication during the migration?
Uber will use HiveSync, a permissions-aware, bi-directional data replication service, to ensure that data lakes in both regions remain synchronized. This includes both bulk migration and ongoing incremental updates until the cloud-based stack becomes the primary data source.
What challenges does Uber anticipate during the migration to GCP?
Uber anticipates challenges such as performance differences between Object Store and HDFS, governance of cloud usage costs, and the need to migrate non-analytics usage of HDFS to other storage solutions. They plan to address these through proactive management and leveraging cloud elasticity.
Key Statistics & Figures
Data hosted in Hadoop ecosystem
more than 1 exabyte
This data is distributed across tens of thousands of servers in Uber's two regions.
Technologies & Tools
Some links below are affiliate links. We may earn a commission if you make a purchase.
Cloud Service
Google Cloud Platform
Used for migrating Uber's batch data analytics and ML training stack.
Data Format
Apache Parquet
Utilized for data storage to ensure compatibility.
Data Format
Apache Hudi
Used for managing data lakes and ensuring data consistency.
Data Processing
Apache Spark
Employed for data analytics and processing tasks.
Query Engine
Presto
Used for federating query traffic during the migration.
Key Actionable Insights
1Implementing a cloud storage connector that supports HDFS compatibility can significantly ease the migration process.This approach allows teams to continue using familiar tools and workflows, minimizing disruption during the transition to GCP.
2Utilizing open standards like Apache Parquet and Apache Hudi can enhance data compatibility across platforms.This ensures that data formats remain consistent and accessible, facilitating smoother integrations and analytics.
3Establishing clear IAM policies during the migration will help manage access and governance effectively.This proactive measure can prevent potential data governance issues that often arise when transitioning to cloud services.
Common Pitfalls
1
Failing to account for performance differences between Object Store and HDFS can lead to unexpected slowdowns.
To avoid this, teams should leverage existing Hadoop connectors and optimize them for cloud environments.
2
Not managing cloud usage costs effectively can result in inflated expenses.
Implementing fine-grained cost tracking and leveraging cloud elasticity can help mitigate this risk.