Real-Time Exactly-Once Ad Event Processing with Apache Flink, Kafka, and Pinot

Jacob Tsafatinos, Yuriy Bondaruk, Yupeng Fu, James Kwon

Uber

•

Jacob Tsafatinos, Yuriy Bondaruk, Yupeng Fu, James Kwon

•12 min read•advanced•

--

•View Original

ApacheApache KafkaSQL

Overview

This article discusses Uber's implementation of a real-time exactly-once ad event processing system using open-source technologies such as Apache Flink, Kafka, and Pinot. It highlights the challenges faced in ad event processing and how the architecture was designed to ensure speed, reliability, and accuracy.

What You'll Learn

1

How to implement a real-time event processing system using Apache Flink

2

Why exactly-once semantics are crucial in distributed systems

3

How to use Apache Pinot for real-time analytics

4

When to apply upsert functionality in data processing

Prerequisites & Requirements

Understanding of distributed systems and stream processing concepts
Familiarity with Apache Flink, Kafka, and Pinot(optional)

Key Questions Answered

How does Uber ensure exactly-once processing in their ad event system?

Uber achieves exactly-once processing by utilizing the transactional capabilities of Apache Flink and Kafka. This involves configuring Flink to use a two-phase commit protocol, ensuring that only committed messages are processed, and generating unique identifiers for records to facilitate idempotency and deduplication.

What technologies are used in Uber's ad event processing architecture?

The architecture relies on Apache Flink for stream processing, Apache Kafka for message queuing, Apache Pinot for real-time analytics, and Apache Hive for data warehousing. These technologies work together to ensure speed, reliability, and accuracy in processing ad events.

What are the main challenges in processing ad events at Uber?

The main challenges include ensuring speed, reliability, and accuracy of event processing. This involves managing the flow of events, cleansing and aggregating data, and accurately attributing events to orders while maintaining system performance.

How does the Aggregation Job work in the event processing system?

The Aggregation Job processes raw ad events from Kafka, cleanses the data, and aggregates it into minute-based buckets. It generates unique identifiers for each aggregated result to ensure idempotency and prevent duplicates in downstream systems.

Key Statistics & Figures

Ad events processed weekly

hundreds of millions

This demonstrates the scale at which Uber's ad event processing system operates, highlighting its capacity to handle large volumes of data.

Checkpoint interval

2 minutes

This interval is set to ensure timely processing of events while adhering to exactly-once semantics.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Stream Processing

Apache Flink

Used for processing unbounded data in near real-time with exactly-once guarantees.

Message Queuing

Apache Kafka

Serves as the messaging backbone for event streaming and ensures reliable message delivery.

Real-time Analytics

Apache Pinot

Provides low-latency delivery of analytical queries and supports near-real-time data ingestion.

Data Warehousing

Apache Hive

Facilitates reading, writing, and managing large datasets for reporting and analysis.

Key Actionable Insights

1
Implementing exactly-once semantics is crucial for maintaining data integrity in distributed systems. By leveraging technologies like Apache Flink and Kafka, teams can ensure that each event is processed only once, preventing data loss and inaccuracies.
This is particularly important in financial applications where data integrity directly impacts revenue and customer trust.

2
Utilizing upsert functionality in data processing can significantly enhance data management capabilities. This allows for efficient updates to existing records without duplicating data, which is essential for real-time analytics.
In scenarios where data accuracy is paramount, such as in advertising metrics, upsert can streamline the data handling process.

3
Regularly reviewing and optimizing the Flink checkpoint interval can improve system performance. A shorter checkpoint interval can reduce latency in event processing, though it may increase the load on the system.
Finding the right balance between performance and resource utilization is key to maintaining an efficient event processing system.

Common Pitfalls

1

Failing to implement idempotency can lead to data duplication and inaccuracies in reporting. Without unique identifiers for each event, the system may process the same event multiple times.

This is particularly problematic in financial applications where accurate reporting is critical. Implementing idempotency ensures that each event is counted only once, maintaining data integrity.

Related Concepts

Distributed Systems

Stream Processing

Real-time Analytics

Data Integrity