Real-Time Exactly-Once Ad Event Processing with Apache Flink, Kafka, and Pinot

Jacob Tsafatinos, Yuriy Bondaruk, Yupeng Fu, James Kwon
12 min readadvanced
--
View Original

Overview

This article discusses Uber's implementation of a real-time exactly-once ad event processing system using open-source technologies such as Apache Flink, Kafka, and Pinot. It highlights the challenges faced in ad event processing and how the architecture was designed to ensure speed, reliability, and accuracy.

What You'll Learn

1

How to implement a real-time event processing system using Apache Flink

2

Why exactly-once semantics are crucial in distributed systems

3

How to use Apache Pinot for real-time analytics

4

When to apply upsert functionality in data processing

Prerequisites & Requirements

  • Understanding of distributed systems and stream processing concepts
  • Familiarity with Apache Flink, Kafka, and Pinot(optional)

Key Questions Answered

How does Uber ensure exactly-once processing in their ad event system?
Uber achieves exactly-once processing by utilizing the transactional capabilities of Apache Flink and Kafka. This involves configuring Flink to use a two-phase commit protocol, ensuring that only committed messages are processed, and generating unique identifiers for records to facilitate idempotency and deduplication.
What technologies are used in Uber's ad event processing architecture?
The architecture relies on Apache Flink for stream processing, Apache Kafka for message queuing, Apache Pinot for real-time analytics, and Apache Hive for data warehousing. These technologies work together to ensure speed, reliability, and accuracy in processing ad events.
What are the main challenges in processing ad events at Uber?
The main challenges include ensuring speed, reliability, and accuracy of event processing. This involves managing the flow of events, cleansing and aggregating data, and accurately attributing events to orders while maintaining system performance.
How does the Aggregation Job work in the event processing system?
The Aggregation Job processes raw ad events from Kafka, cleanses the data, and aggregates it into minute-based buckets. It generates unique identifiers for each aggregated result to ensure idempotency and prevent duplicates in downstream systems.

Key Statistics & Figures

Ad events processed weekly
hundreds of millions
This demonstrates the scale at which Uber's ad event processing system operates, highlighting its capacity to handle large volumes of data.
Checkpoint interval
2 minutes
This interval is set to ensure timely processing of events while adhering to exactly-once semantics.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Key Actionable Insights

1
Implementing exactly-once semantics is crucial for maintaining data integrity in distributed systems. By leveraging technologies like Apache Flink and Kafka, teams can ensure that each event is processed only once, preventing data loss and inaccuracies.
This is particularly important in financial applications where data integrity directly impacts revenue and customer trust.
2
Utilizing upsert functionality in data processing can significantly enhance data management capabilities. This allows for efficient updates to existing records without duplicating data, which is essential for real-time analytics.
In scenarios where data accuracy is paramount, such as in advertising metrics, upsert can streamline the data handling process.
3
Regularly reviewing and optimizing the Flink checkpoint interval can improve system performance. A shorter checkpoint interval can reduce latency in event processing, though it may increase the load on the system.
Finding the right balance between performance and resource utilization is key to maintaining an efficient event processing system.

Common Pitfalls

1
Failing to implement idempotency can lead to data duplication and inaccuracies in reporting. Without unique identifiers for each event, the system may process the same event multiple times.
This is particularly problematic in financial applications where accurate reporting is critical. Implementing idempotency ensures that each event is counted only once, maintaining data integrity.

Related Concepts

Distributed Systems
Stream Processing
Real-time Analytics
Data Integrity