Unified Flink Source at Pinterest: Streaming Data Processing

Pinterest Engineering
7 min readadvanced
--
View Original

Overview

Pinterest utilizes Flink as its stream processing engine to build a reliable and scalable platform called Xenon. This article discusses the architecture, features, and benefits of the Unified Flink Source, which integrates real-time and historical data processing.

What You'll Learn

1

How to implement a unified source for both real-time and historical data in Flink

2

Why using Merced improves data retention and access in stream processing

3

How to manage traffic control in Flink applications to prevent backpressure

Prerequisites & Requirements

  • Understanding of stream processing concepts and Flink architecture
  • Familiarity with Kafka and S3(optional)

Key Questions Answered

What is the purpose of the Unified Flink Source at Pinterest?
The Unified Flink Source at Pinterest integrates real-time and historical data processing, allowing seamless access to both data types through a single API. This design enhances the efficiency and reliability of data-driven applications by providing a consistent view of data from different sources.
How does Merced enhance data processing capabilities?
Merced allows Pinterest to maintain historical data alongside real-time data, enabling users to seek any offset or timestamp without worrying about the underlying storage system. This capability significantly reduces operational costs and improves data accessibility for various use cases.
What challenges does traffic control address in Flink applications?
Traffic control in Flink applications addresses issues related to bandwidth consumption and backpressure by regulating the flow of data from different sources. This ensures that critical data streams, like published Pins, can progress without being hindered by larger, less critical streams.

Key Statistics & Figures

Infrastructure cost increase when replaying historical data via Kafka
20x
This statistic highlights the operational challenges and costs associated with using Kafka for historical data access.
Throughput improvement when reading from Merced compared to Kafka
6X
This demonstrates the efficiency gains achieved by using Merced for data retrieval in Flink applications.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Key Actionable Insights

1
Implementing UnifiedSource can streamline your data processing architecture by allowing access to both real-time and historical data through a single API.
This integration simplifies the development process and reduces the need for multiple data sources, making it easier for engineers to build and maintain Flink applications.
2
Utilize Merced for historical data retention to enhance your Flink applications' capabilities.
By leveraging Merced, you can provide a consistent view of data that combines both real-time and historical elements, improving the overall efficiency of your data processing workflows.
3
Implement traffic control mechanisms to manage data flow and prevent backpressure in your Flink jobs.
This is crucial when dealing with varying data sizes from different sources, as it helps maintain the stability and performance of your streaming applications.

Common Pitfalls

1
Failing to implement adequate traffic control can lead to backpressure and checkpointing failures in Flink applications.
This often occurs when larger data streams consume excessive bandwidth, slowing down the processing of critical data. Implementing rate limiting and synchronization can help mitigate these issues.

Related Concepts

Stream Processing
Data Retention Strategies
Real-time Data Processing
Flink Architecture