Hadoop statistics collection and applications

Pinterest Engineering
6 min readadvanced
--
View Original

Overview

The article discusses the design and implementation of a new statistics collection engine for Hadoop jobs at Pinterest, aimed at improving data management and analysis. It highlights the challenges of using native Hadoop counters and presents solutions for real-time alerts and data analysis applications.

What You'll Learn

1

How to design a lightweight statistics collection engine for Hadoop jobs

2

Why preserving counter history is crucial for data integrity

3

How to implement real-time alerts based on collected statistics

Prerequisites & Requirements

  • Understanding of Hadoop job architecture and data processing
  • Familiarity with AWS S3 for data storage(optional)

Key Questions Answered

What are the main challenges of using native Hadoop counters?
Native Hadoop counters are expensive due to the need for synchronization with the Hadoop job tracker, and they do not preserve counter history, making it difficult to trace trends or identify data corruption. This can lead to issues in data integrity and debugging.
How does the new statistics collection engine improve data management?
The new statistics collection engine collects job statistics at runtime, merges them with native Hadoop counters, and stores them in S3. This allows for real-time alerts, better data analysis, and the ability to trace statistics trends, enhancing overall data integrity.
What types of alerts are implemented based on collected statistics?
The article describes two types of alerts: one based on absolute values, which triggers when metrics fall below a threshold, and another based on temporal estimation, which predicts current values using recent data. This helps in identifying anomalies in job performance.

Key Statistics & Figures

Daily data output from indexing workflows
Dozens of terabytes
This volume is generated through daily Hadoop jobs managed by Pinball.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Backend
Hadoop
Used for processing large volumes of data through jobs managed by Pinball.
Storage
AWS S3
Used for storing job output and statistics data.
Database
Mysql
Serves applications by storing merged statistics data.

Key Actionable Insights

1
Implement a statistics collection engine at the per shard layer to reduce communication overhead with the Hadoop job tracker.
This approach minimizes the cost of data collection and enhances performance by allowing each reducer to collect statistics locally without needing to communicate with the job tracker.
2
Utilize real-time alerts to monitor job performance and data integrity effectively.
By setting up alerts based on both absolute values and temporal estimations, teams can quickly respond to potential issues, reducing the risk of data corruption and improving overall job reliability.
3
Aggregate statistics data post-job completion to enhance data analysis capabilities.
Storing both intermediate and final statistics allows for comprehensive data analysis, enabling teams to derive insights from job performance and make informed decisions based on historical trends.

Common Pitfalls

1
Relying solely on native Hadoop counters can lead to missed trends and data integrity issues.
Since native counters do not preserve history, teams may struggle to identify the root causes of anomalies, leading to costly debugging efforts.

Related Concepts

Data Integrity In Distributed Systems
Real-time Data Processing
Alerting Mechanisms In Data Workflows