I/O Observability for Uber’s Massive Petabyte-Scale Data Lake

Arnav Balyan, Kartik Bommepally, Amruth Sampath, Jing Zhao, Akshayaprakash Sharma

Uber

•

Arnav Balyan, Kartik Bommepally, Amruth Sampath, Jing Zhao, Akshayaprakash Sharma

•10 min read•advanced•

--

•View Original

ApacheApache SparkGoogle CloudGoogle Cloud StorageGrafanaJavaMySQLOracleSQL

Overview

The article discusses Uber's implementation of I/O observability for its massive petabyte-scale data lake, focusing on the challenges and solutions in monitoring data access patterns across its hybrid cloud architecture. It highlights the creation of a system that provides insights into data usage without requiring changes to application code, enhancing operational efficiency and cost management.

What You'll Learn

1

How to implement real-time I/O observability for data workloads

2

Why monitoring data access patterns is critical for cloud migration strategies

3

How to use HiCam for aggregating high-cardinality metrics

Prerequisites & Requirements

Understanding of data lake architectures and cloud services
Familiarity with Apache Spark and Presto(optional)

Key Questions Answered

What insights does Uber's I/O observability system provide?

Uber's I/O observability system provides insights into cloud provider network egress attribution, cross-zone traffic monitoring, dataset placement for CloudLake, and a storage-level heatmap. This helps in understanding data access patterns and optimizing resource allocation.

How does Uber handle high cardinality time-series metrics?

Uber uses HiCam, a lightweight metrics aggregator, to manage high cardinality time-series metrics. HiCam aggregates metrics in memory and emits consolidated events, reducing write amplification and avoiding duplicate metrics storage.

What challenges did Uber face in achieving observability?

Uber faced challenges such as the lack of a unified mechanism for monitoring data access patterns and the need for real-time, engine-agnostic observability at the dataset partition level. This necessitated the development of a custom solution to fill these gaps.

Key Statistics & Figures

Daily YARN containers

6.7 million

This metric reflects the scale at which Uber's data infrastructure operates.

Daily Spark applications

400,000

This indicates the volume of data processing tasks handled by Uber's infrastructure.

Daily Presto queries

2 million

This showcases the demand for data querying within Uber's data lake.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Backend

Apache Spark

Used for processing large-scale data workloads.

Backend

Presto

Utilized for querying data across various data sources.

Backend

Apache Flink

Used for real-time data processing and metrics aggregation.

Database

Apache Pinot

Employed for real-time analytics on the aggregated metrics.

Frontend

Grafana

Used for visualizing real-time metrics and insights.

Key Actionable Insights

1
Implementing real-time I/O observability can significantly enhance operational efficiency during cloud migrations.
By understanding data access patterns, organizations can make informed decisions about resource allocation and workload placement, ultimately reducing costs and improving performance.

2
Utilizing HiCam for metrics aggregation can streamline data handling and reduce storage costs.
HiCam's ability to consolidate metrics before they reach storage helps manage high volumes of data without overwhelming the system, making it a crucial tool for organizations with large-scale data operations.

Common Pitfalls

1

Failing to account for the scale of metrics can lead to overwhelming storage requirements.

Organizations must implement aggregation strategies like HiCam to manage high volumes of metrics efficiently, preventing storage overload and ensuring manageable data flows.

Related Concepts

Data Lake Architectures

Cloud Migration Strategies

Real-time Data Processing