Real-time experiment analytics at Pinterest using Apache Flink

Pinterest Engineering

•

Pinterest Engineering

•11 min read•intermediate•

--

•View Original

ApacheMySQLScala

Overview

The article discusses Pinterest's implementation of a near real-time experimentation platform using Apache Flink to analyze thousands of experiments daily. It highlights the challenges of traditional daily metrics and presents a solution that allows for immediate feedback on experiment performance, thereby improving user experience and operational efficiency.

What You'll Learn

1

How to implement a real-time experimentation platform using Apache Flink

2

Why real-time metrics are crucial for immediate feedback on experiments

3

How to apply statistical tests for valid experiment comparisons

Prerequisites & Requirements

Understanding of real-time data processing concepts
Familiarity with Apache Flink

Key Questions Answered

How does Pinterest use Apache Flink for real-time experiment analytics?

Pinterest utilizes Apache Flink to create a real-time experimentation platform that processes user actions and experiment activations to provide immediate metrics. This allows for quick identification of issues such as significant drops in impressions or increases in user engagement, enabling timely adjustments to experiments.

What challenges does the real-time experimentation platform address?

The platform addresses the delays and inefficiencies of traditional daily metrics, which can take over 10 hours to run. By providing near real-time metrics, it helps catch bugs and performance issues quickly, thereby improving user experience and reducing potential negative impacts on top-line metrics.

What statistical methods are used for experiment validation?

The article mentions several statistical methods for validating experiment results, including t-test with Bonferroni Correction, Gambler’s Ruin, Bayesian A/B testing, and Alpha-Spending Method. These methods help ensure that comparisons between control and treatment groups maintain a low false positive rate.

What is the role of the Numerator Computer in the pipeline?

The Numerator Computer is a KeyedProcessFunction that maintains rolling 15-minute buckets for event counts and unique user counts. It updates these counts based on incoming events and ensures accurate metrics for the experiments being analyzed.

Key Statistics & Figures

Average input topic volume

100G

This reflects the daily data processed by the real-time experimentation pipeline.

Number of experiment groups handled

200-300

Indicates the scale of experiments being analyzed concurrently.

Parallelism level for computation

256

This demonstrates the processing power allocated to handle the real-time data streams.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Stream Processing

Apache Flink

Used to build the real-time experimentation platform for processing user actions and experiment metrics.

Message Broker

Kafka

Utilized for managing streams of filtered events and experiment activations.

Database

Mysql

Serves as the backend for querying experiment metadata.

Storage

Hdfs

Used for storing incremental checkpoints in the Flink application.

Key Actionable Insights

1
Implement a real-time analytics pipeline to enhance user experience by catching issues early.
By transitioning from daily metrics to real-time analytics, teams can quickly identify and rectify problems in experiments, thus minimizing negative impacts on user engagement.

2
Utilize Flink's incremental checkpointing to ensure fault tolerance in streaming applications.
This approach helps maintain data integrity and allows for recovery from failures without significant downtime, which is critical for maintaining operational efficiency.

3
Conduct regular validations of streaming results against a daily batch process.
This practice ensures that the streaming pipeline delivers accurate metrics and helps identify discrepancies that could indicate underlying issues in the data processing logic.

Common Pitfalls

1

Failing to manage checkpointing effectively can lead to timeout issues.

When the checkpointing process takes too long, it can cause the entire pipeline to stall. This often happens due to high back-pressure from certain subtasks, which can be mitigated by monitoring task performance and implementing capping rules.

2

Not validating streaming results against batch computations can lead to inaccuracies.

Without regular validation, discrepancies can go unnoticed, leading to incorrect metrics being reported. Establishing a robust validation workflow is essential for maintaining data integrity.

Related Concepts

Real-time Data Processing

Experimentation Platforms

Statistical Testing Methods

Fault Tolerance In Streaming Applications