Hadoop statistics collection and applications

Pinterest Engineering

•

Pinterest Engineering

•6 min read•advanced•

--

•View Original

AWSAWS S3MySQL

Overview

The article discusses the design and implementation of a new statistics collection engine for Hadoop jobs at Pinterest, aimed at improving data management and analysis. It highlights the challenges of using native Hadoop counters and presents solutions for real-time alerts and data analysis applications.

What You'll Learn

1

How to design a lightweight statistics collection engine for Hadoop jobs

2

Why preserving counter history is crucial for data integrity

3

How to implement real-time alerts based on collected statistics

Prerequisites & Requirements

Understanding of Hadoop job architecture and data processing
Familiarity with AWS S3 for data storage(optional)

Key Questions Answered

What are the main challenges of using native Hadoop counters?

Native Hadoop counters are expensive due to the need for synchronization with the Hadoop job tracker, and they do not preserve counter history, making it difficult to trace trends or identify data corruption. This can lead to issues in data integrity and debugging.

How does the new statistics collection engine improve data management?

The new statistics collection engine collects job statistics at runtime, merges them with native Hadoop counters, and stores them in S3. This allows for real-time alerts, better data analysis, and the ability to trace statistics trends, enhancing overall data integrity.

What types of alerts are implemented based on collected statistics?

The article describes two types of alerts: one based on absolute values, which triggers when metrics fall below a threshold, and another based on temporal estimation, which predicts current values using recent data. This helps in identifying anomalies in job performance.

Key Statistics & Figures

Daily data output from indexing workflows

Dozens of terabytes

This volume is generated through daily Hadoop jobs managed by Pinball.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Backend

Hadoop

Used for processing large volumes of data through jobs managed by Pinball.

Storage

AWS S3

Used for storing job output and statistics data.

Database

Mysql

Serves applications by storing merged statistics data.

Key Actionable Insights

1
Implement a statistics collection engine at the per shard layer to reduce communication overhead with the Hadoop job tracker.
This approach minimizes the cost of data collection and enhances performance by allowing each reducer to collect statistics locally without needing to communicate with the job tracker.

2
Utilize real-time alerts to monitor job performance and data integrity effectively.
By setting up alerts based on both absolute values and temporal estimations, teams can quickly respond to potential issues, reducing the risk of data corruption and improving overall job reliability.

3
Aggregate statistics data post-job completion to enhance data analysis capabilities.
Storing both intermediate and final statistics allows for comprehensive data analysis, enabling teams to derive insights from job performance and make informed decisions based on historical trends.

Common Pitfalls

1

Relying solely on native Hadoop counters can lead to missed trends and data integrity issues.

Since native counters do not preserve history, teams may struggle to identify the root causes of anomalies, leading to costly debugging efforts.

Related Concepts

Data Integrity In Distributed Systems

Real-time Data Processing

Alerting Mechanisms In Data Workflows

Apache Airflow is a tool for describing, executing, and monitoring workflows. At Slack, we use Airflow to orchestrate and manage our data warehouse workflows, which includes product and business metrics and also is used for different engineering use-cases (e.g. search and offline indexing). For two years we’ve been running Airflow 1.8, and it was time for…

AWSMySQLAWS S3

11 min read

Has Summary

--

These articles from Pinterest and other leading engineering teams share similar topics with "Hadoop statistics collection and applications". Explore more engineering insights on Kubernetes, FastAPI, JSON.