I/O Observability for Uber’s Massive Petabyte-Scale Data Lake

Arnav Balyan, Kartik Bommepally, Amruth Sampath, Jing Zhao, Akshayaprakash Sharma
10 min readadvanced
--
View Original

Overview

The article discusses Uber's implementation of I/O observability for its massive petabyte-scale data lake, focusing on the challenges and solutions in monitoring data access patterns across its hybrid cloud architecture. It highlights the creation of a system that provides insights into data usage without requiring changes to application code, enhancing operational efficiency and cost management.

What You'll Learn

1

How to implement real-time I/O observability for data workloads

2

Why monitoring data access patterns is critical for cloud migration strategies

3

How to use HiCam for aggregating high-cardinality metrics

Prerequisites & Requirements

  • Understanding of data lake architectures and cloud services
  • Familiarity with Apache Spark and Presto(optional)

Key Questions Answered

What insights does Uber's I/O observability system provide?
Uber's I/O observability system provides insights into cloud provider network egress attribution, cross-zone traffic monitoring, dataset placement for CloudLake, and a storage-level heatmap. This helps in understanding data access patterns and optimizing resource allocation.
How does Uber handle high cardinality time-series metrics?
Uber uses HiCam, a lightweight metrics aggregator, to manage high cardinality time-series metrics. HiCam aggregates metrics in memory and emits consolidated events, reducing write amplification and avoiding duplicate metrics storage.
What challenges did Uber face in achieving observability?
Uber faced challenges such as the lack of a unified mechanism for monitoring data access patterns and the need for real-time, engine-agnostic observability at the dataset partition level. This necessitated the development of a custom solution to fill these gaps.

Key Statistics & Figures

Daily YARN containers
6.7 million
This metric reflects the scale at which Uber's data infrastructure operates.
Daily Spark applications
400,000
This indicates the volume of data processing tasks handled by Uber's infrastructure.
Daily Presto queries
2 million
This showcases the demand for data querying within Uber's data lake.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Key Actionable Insights

1
Implementing real-time I/O observability can significantly enhance operational efficiency during cloud migrations.
By understanding data access patterns, organizations can make informed decisions about resource allocation and workload placement, ultimately reducing costs and improving performance.
2
Utilizing HiCam for metrics aggregation can streamline data handling and reduce storage costs.
HiCam's ability to consolidate metrics before they reach storage helps manage high volumes of data without overwhelming the system, making it a crucial tool for organizations with large-scale data operations.

Common Pitfalls

1
Failing to account for the scale of metrics can lead to overwhelming storage requirements.
Organizations must implement aggregation strategies like HiCam to manage high volumes of metrics efficiently, preventing storage overload and ensuring manageable data flows.

Related Concepts

Data Lake Architectures
Cloud Migration Strategies
Real-time Data Processing