•Arnav Balyan, Kartik Bommepally, Amruth Sampath, Jing Zhao, Akshayaprakash Sharma•10 min read•advanced•
--
•View OriginalOverview
The article discusses Uber's implementation of I/O observability for its massive petabyte-scale data lake, focusing on the challenges and solutions in monitoring data access patterns across its hybrid cloud architecture. It highlights the creation of a system that provides insights into data usage without requiring changes to application code, enhancing operational efficiency and cost management.
What You'll Learn
1
How to implement real-time I/O observability for data workloads
2
Why monitoring data access patterns is critical for cloud migration strategies
3
How to use HiCam for aggregating high-cardinality metrics
Prerequisites & Requirements
- Understanding of data lake architectures and cloud services
- Familiarity with Apache Spark and Presto(optional)
Key Questions Answered
What insights does Uber's I/O observability system provide?
Uber's I/O observability system provides insights into cloud provider network egress attribution, cross-zone traffic monitoring, dataset placement for CloudLake, and a storage-level heatmap. This helps in understanding data access patterns and optimizing resource allocation.
How does Uber handle high cardinality time-series metrics?
Uber uses HiCam, a lightweight metrics aggregator, to manage high cardinality time-series metrics. HiCam aggregates metrics in memory and emits consolidated events, reducing write amplification and avoiding duplicate metrics storage.
What challenges did Uber face in achieving observability?
Uber faced challenges such as the lack of a unified mechanism for monitoring data access patterns and the need for real-time, engine-agnostic observability at the dataset partition level. This necessitated the development of a custom solution to fill these gaps.
Key Statistics & Figures
Daily YARN containers
6.7 million
This metric reflects the scale at which Uber's data infrastructure operates.
Daily Spark applications
400,000
This indicates the volume of data processing tasks handled by Uber's infrastructure.
Daily Presto queries
2 million
This showcases the demand for data querying within Uber's data lake.
Technologies & Tools
Some links below are affiliate links. We may earn a commission if you make a purchase.
Backend
Apache Spark
Used for processing large-scale data workloads.
Backend
Presto
Utilized for querying data across various data sources.
Backend
Apache Flink
Used for real-time data processing and metrics aggregation.
Database
Apache Pinot
Employed for real-time analytics on the aggregated metrics.
Frontend
Grafana
Used for visualizing real-time metrics and insights.
Key Actionable Insights
1Implementing real-time I/O observability can significantly enhance operational efficiency during cloud migrations.By understanding data access patterns, organizations can make informed decisions about resource allocation and workload placement, ultimately reducing costs and improving performance.
2Utilizing HiCam for metrics aggregation can streamline data handling and reduce storage costs.HiCam's ability to consolidate metrics before they reach storage helps manage high volumes of data without overwhelming the system, making it a crucial tool for organizations with large-scale data operations.
Common Pitfalls
1
Failing to account for the scale of metrics can lead to overwhelming storage requirements.
Organizations must implement aggregation strategies like HiCam to manage high volumes of metrics efficiently, preventing storage overload and ensuring manageable data flows.
Related Concepts
Data Lake Architectures
Cloud Migration Strategies
Real-time Data Processing