Behind the Pins: Building Analytics

Pinterest Engineering

•

Pinterest Engineering

•7 min read•intermediate•

--

•View Original

AWSAWS S3

Overview

The article discusses the revamped analytics tool developed by Pinterest to enhance business insights into audience engagement and content performance. It details the architecture, data processing pipeline, and user interaction features that support businesses in understanding their analytics data.

What You'll Learn

1

How to implement a scalable data processing pipeline using Hadoop MapReduce

2

Why user interaction metrics are crucial for analytics in business applications

3

How to optimize data storage and retrieval using HBase

4

When to use pre-aggregation for analytics data

Prerequisites & Requirements

Understanding of data processing concepts and analytics
Familiarity with Hadoop and HBase(optional)

Key Questions Answered

What are the main components of Pinterest's analytics architecture?

The analytics architecture consists of a scalable Hadoop MapReduce batch data processing pipeline, robust HBase data storage, API endpoints for data fetching, and a web application with an interactive UI for business users. Each component plays a crucial role in ensuring reliable data processing and accessibility.

How does Pinterest process analytics data daily?

Pinterest processes tens of terabytes of data each day using a Hadoop MapReduce pipeline that is triggered by the availability of data on AWS S3. The pipeline is designed to handle failures by allowing independent jobs to continue processing, ensuring reliability.

What types of analytics data does Pinterest provide?

Pinterest provides three datasets: domain analytics, which tracks usage data linked to a business domain; profile analytics, which tracks data related to business user Pins; and audience analytics, which focuses on user interactions with the content. This segmentation helps businesses understand their performance.

What optimizations are suggested for the data processing pipeline?

To optimize the data processing pipeline, Pinterest suggests minimizing data size, avoiding unnecessary sorting, optimizing joins, and leveraging User Defined Functions (UDFs). These strategies help improve performance and efficiency in handling large datasets.

Key Statistics & Figures

Data processed daily

Tens of terabytes

This volume of data processing is essential for providing accurate and timely analytics to businesses.

ETA for data processing pipeline

19 hours

This indicates the time required to process and make data available, highlighting the importance of optimizing the pipeline.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Data Processing

Hadoop

Used for building the scalable data processing pipeline.

Data Storage

Hbase

Serves as the backend storage for analytics data, providing low-latency access.

Data Streaming

Kafka

Used for logging event-based data.

Cloud Storage

AWS S3

Stores raw event-based data for processing.

Key Actionable Insights

1
Implement a robust data processing pipeline using Hadoop MapReduce to handle large datasets efficiently.
This is critical for businesses that require real-time analytics and insights from vast amounts of data. By ensuring scalability and reliability, companies can make informed decisions based on accurate data.

2
Utilize HBase for data storage to achieve low-latency access to analytics data.
HBase's key/value store capabilities allow for efficient data retrieval, which is essential for applications needing quick access to user interaction metrics.

3
Pre-aggregate analytics data to reduce the need for real-time processing.
By storing aggregated data at various intervals, businesses can significantly improve performance and reduce the load on their data processing systems.

Common Pitfalls

1

Failing to optimize data size can lead to bottlenecks in processing.

Large datasets can slow down the entire pipeline, making it crucial to minimize data size by only keeping necessary fields.

2

Overusing sorting operations can degrade performance.

Sorting large datasets is resource-intensive; instead, using clustering techniques can yield better performance without unnecessary overhead.