DataMesh: How Uber laid the foundations for the data lake cloud migration

Arun Mahadeva Iyer, Abhi Khune, Sahana Bhat
11 min readintermediate
--
View Original

Overview

The article discusses Uber's migration of its batch data platform to the cloud, focusing on the implementation of DataMesh principles. It highlights the challenges faced during the transition, the strategies employed for effective data management, and the benefits achieved through this migration.

What You'll Learn

1

How to effectively map HDFS files to cloud storage buckets during migration

2

Why decentralized data ownership enhances data governance and access control

3

How to automate cloud infrastructure setup for data analytics use cases

Prerequisites & Requirements

  • Understanding of cloud storage concepts and data governance
  • Familiarity with data management practices in cloud environments(optional)

Key Questions Answered

What challenges did Uber face during its data migration to the cloud?
Uber faced several challenges during its data migration, including managing cloud provider limits, ensuring optimal data mapping, and maintaining access controls. The migration also required addressing ownership changes and automating the infrastructure setup to streamline the process.
How does Uber's DataMesh service improve data management?
The DataMesh service organizes data resources in a hierarchical manner based on ownership, automates resource management, and simplifies access control. This allows teams to manage their data more efficiently while ensuring compliance with governance policies.
What are the key principles of data mesh applied in Uber's migration?
Uber applied several key data mesh principles, including decentralized data ownership, optimal data mapping to cloud storage, and improved data governance. These principles facilitate better management of data assets and enhance collaboration across teams.

Key Statistics & Figures

Active internal users
10,000
Uber's batch data platform is utilized by over 10,000 active internal users including data scientists and engineers.
HDFS storage capacity
1.5 exabytes
The platform hosts around 1.5 exabytes of Apache Hadoop Distributed File System (HDFS
Daily Presto queries
500,000
The system serves over 500,000 Presto queries daily.
Daily Apache Spark applications
370,000
There are over 370,000 Apache Spark applications running daily on the platform.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Data Processing
Apache Hadoop
Used for distributed storage and processing of large data sets.
Cloud Infrastructure
Google Cloud Platform (gcp)
The cloud provider used for migrating Uber's batch data platform.
Data Query Engine
Presto
Used for running interactive queries on large data sets.
Data Processing
Apache Spark
Used for large-scale data processing and analytics.

Key Actionable Insights

1
Implement a path translation service to automate the migration of hard-coded paths in user workflows.
This approach minimizes disruption during migration by allowing existing code to function without requiring manual updates, thus speeding up the transition process.
2
Consolidate security groups to streamline access control and reduce complexity.
By consolidating security groups, organizations can simplify the management of user permissions, ensuring that access controls are both efficient and effective.
3
Utilize cloud-native features to optimize data placement and performance.
Leveraging cloud features can help avoid hitting storage limits and improve the overall performance of data queries, which is crucial for maintaining operational efficiency.

Common Pitfalls

1
Failing to account for cloud provider limits can lead to performance bottlenecks.
Many organizations overlook the specific quotas and limits imposed by cloud providers, which can result in degraded performance and increased costs during migration.
2
Not consolidating security groups may lead to overly complex access controls.
If security groups proliferate without consolidation, managing user access becomes cumbersome, increasing the risk of misconfigured permissions.

Related Concepts

Data Governance
Cloud Migration Strategies
Data Mesh Principles
Decentralized Data Ownership