Modernizing Uber’s Batch Data Infrastructure with Google Cloud Platform

Abhi Khune, Arun Mahadeva Iyer, Sahana Bhat, Matt Mathew

Uber

•

Abhi Khune, Arun Mahadeva Iyer, Sahana Bhat, Matt Mathew

•7 min read•advanced•

--

•View Original

ApacheApache SparkGoogle CloudGoogle Cloud StorageSQL

Overview

Uber is modernizing its batch data infrastructure by migrating to Google Cloud Platform (GCP) to enhance data analytics and machine learning capabilities. This transition aims to improve user productivity, engineering velocity, cost efficiency, and data governance.

What You'll Learn

1

How to leverage cloud IaaS for seamless migration of batch data stacks

2

Why using open standards like Apache Parquet and Apache Hudi enhances data compatibility

3

How to implement data access proxies for federating query traffic during migration

Prerequisites & Requirements

Understanding of cloud computing concepts and data management
Familiarity with GCP services and Hadoop ecosystem(optional)

Key Questions Answered

What are the core principles for migrating Uber's data infrastructure to GCP?

The core principles include avoiding painful migrations for data users by maintaining existing workflows, enhancing data access proxies to manage traffic across on-prem and cloud, leveraging existing cloud-agnostic infrastructure, and forecasting potential data governance issues from cloud services.

How does Uber plan to manage data replication during the migration?

Uber will use HiveSync, a permissions-aware, bi-directional data replication service, to ensure that data lakes in both regions remain synchronized. This includes both bulk migration and ongoing incremental updates until the cloud-based stack becomes the primary data source.

What challenges does Uber anticipate during the migration to GCP?

Uber anticipates challenges such as performance differences between Object Store and HDFS, governance of cloud usage costs, and the need to migrate non-analytics usage of HDFS to other storage solutions. They plan to address these through proactive management and leveraging cloud elasticity.

Key Statistics & Figures

Data hosted in Hadoop ecosystem

more than 1 exabyte

This data is distributed across tens of thousands of servers in Uber's two regions.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Cloud Service

Google Cloud Platform

Used for migrating Uber's batch data analytics and ML training stack.

Data Format

Apache Parquet

Utilized for data storage to ensure compatibility.

Data Format

Apache Hudi

Used for managing data lakes and ensuring data consistency.

Data Processing

Apache Spark

Employed for data analytics and processing tasks.

Query Engine

Presto

Used for federating query traffic during the migration.

Key Actionable Insights

1
Implementing a cloud storage connector that supports HDFS compatibility can significantly ease the migration process.
This approach allows teams to continue using familiar tools and workflows, minimizing disruption during the transition to GCP.

2
Utilizing open standards like Apache Parquet and Apache Hudi can enhance data compatibility across platforms.
This ensures that data formats remain consistent and accessible, facilitating smoother integrations and analytics.

3
Establishing clear IAM policies during the migration will help manage access and governance effectively.
This proactive measure can prevent potential data governance issues that often arise when transitioning to cloud services.

Common Pitfalls

1

Failing to account for performance differences between Object Store and HDFS can lead to unexpected slowdowns.

To avoid this, teams should leverage existing Hadoop connectors and optimize them for cloud environments.

2

Not managing cloud usage costs effectively can result in inflated expenses.

Implementing fine-grained cost tracking and leveraging cloud elasticity can help mitigate this risk.