Building Pin stats

Pinterest Engineering
6 min readintermediate
--
View Original

Overview

The article discusses the development of Pin stats at Pinterest, a tool designed to provide businesses with near-real-time analytics on their Pins. It highlights the challenges faced in logging, processing, and aggregating billions of events efficiently to deliver insights within two hours.

What You'll Learn

1

How to implement near-real-time analytics for content performance

2

Why efficient event logging is crucial for real-time data processing

3

How to aggregate data from multiple instances for comprehensive insights

Prerequisites & Requirements

  • Understanding of event logging and data processing concepts
  • Familiarity with Apache Kafka for event streaming(optional)

Key Questions Answered

What are the main challenges in building Pin stats?
The main challenges include providing near-real-time insights by processing tens of billions of events quickly and ensuring canonicalization of data to give businesses a complete view of their Pins' performance. This involved reducing analytics delivery time significantly and aggregating multiple instances of Pins into a single stat.
How does Pinterest log events for Pin stats?
Pinterest logs events related to Pins created by businesses in real-time using Apache Kafka. The logging process includes filtering to ensure only relevant events are recorded, which is crucial for maintaining low latency and high performance.
What storage solution does Pinterest use for Pin stats?
Pinterest utilizes Terrapin, an in-house low-latency serving system, to handle large datasets efficiently. This system is designed to be elastic and fault-tolerant, allowing for direct data ingestion from Amazon S3.
What lessons were learned during the development of Pin stats?
Key lessons include the importance of building a reliable and scalable data pipeline capable of handling high event volumes, the challenges of filtering events without slowing down logging, and the decision-making process between real-time and near-real-time analytics.

Key Statistics & Figures

Analytics delivery time reduction
18x decrease
This was achieved by processing tens of billions of events to provide insights within two hours.
Events processed per hour
approximately four billion events/hour
This volume is handled by the hourly data ingestion pipeline to ensure timely analytics.
Daily events processed
approximately 100 billion events/day
This figure represents the total events processed by the daily pipeline, which is verified more thoroughly.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Backend
Apache Kafka
Used as the log transport layer for real-time logging of events.
Backend
Terrapin
An in-house low-latency serving system for handling large datasets.

Key Actionable Insights

1
Implementing a real-time logging system can significantly enhance data analytics capabilities.
By using a system like Apache Kafka for event logging, businesses can ensure that they are capturing relevant data efficiently, which is essential for timely insights.
2
Aggregating multiple instances of data can provide a more comprehensive view of performance metrics.
This approach allows businesses to understand the full impact of their content, rather than just isolated instances, leading to better decision-making.
3
Optimizing event filtering can reduce latency in data processing.
By implementing heuristics to filter events before logging, organizations can minimize unnecessary network calls and improve the overall efficiency of their data pipeline.

Common Pitfalls

1
Failing to implement efficient event filtering can lead to performance bottlenecks.
Without proper filtering, the data pipeline can become overwhelmed with irrelevant events, slowing down the entire logging process and delaying insights.
2
Neglecting the need for reliable data consistency across multiple pipelines can result in inaccurate analytics.
It's crucial to ensure that divergent data pipelines maintain consistency to provide accurate and trustworthy insights to businesses.