From Batch to Streaming: Accelerating Data Freshness in Uber’s Data Lake

Xinli Shang, Peter Huang, Jing Li, Jing Zhao, Jack Song

Uber

•

Xinli Shang, Peter Huang, Jing Li, Jing Zhao, Jack Song

•6 min read•advanced•

--

•View Original

ApacheApache KafkaApache SparkMachine Learning

Overview

This article discusses Uber's transition from batch to streaming data ingestion using Apache Flink, which significantly enhances data freshness and operational efficiency. It outlines the architecture of the new ingestion system, key challenges faced during implementation, and the positive impact on data analytics across the company.

What You'll Learn

1

How to implement streaming ingestion using Apache Flink

2

Why transitioning from batch to streaming improves data freshness

3

How to address small file generation issues in streaming data

4

When to apply operational tuning for Kafka consumption

Prerequisites & Requirements

Understanding of data ingestion processes and streaming architectures
Familiarity with Apache Flink and Apache Kafka(optional)

Key Questions Answered

How does Uber achieve data freshness with streaming ingestion?

Uber's transition to streaming ingestion using Apache Flink has reduced data freshness from hours to minutes, enabling real-time analytics and faster decision-making. This shift allows various departments to utilize up-to-date data for experimentation and model development, significantly enhancing operational efficiency.

What challenges does streaming ingestion present?

Streaming ingestion can generate many small files, which degrade query performance and increase storage overhead. Additionally, partition skew can lead to inefficient data processing, while checkpoint and commit synchronization issues can cause data duplication or loss during failures.

What solutions did Uber implement to address small file issues?

Uber introduced row-group-level merging for Apache Parquet files, which operates directly on the columnar structure to avoid costly recompression. This method accelerates compaction by more than ten times, significantly improving query performance and reducing storage overhead.

How does Uber ensure the reliability of its ingestion system?

Uber's ingestion system includes a control plane that automates job management, health verification, and regional failover strategies. This design ensures consistent operation across thousands of datasets, maintaining availability and preventing data loss during outages.

Key Statistics & Figures

Data freshness improvement

Reduced from hours to minutes

This improvement allows for real-time analytics and faster decision-making across various business units.

Compute usage reduction

25% relative to batch ingestion

This reduction demonstrates the efficiency gains achieved through the new streaming ingestion system.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Stream Processing

Apache Flink

Used for real-time data ingestion and processing.

Messaging

Apache Kafka

Serves as the event source for the streaming ingestion system.

Data Storage

Apache Hudi

Provides transactional commits and time travel capabilities for data in the lake.

Data Format

Apache Parquet

Used for storing data in a columnar format to optimize query performance.

Key Actionable Insights

1
Implement streaming ingestion to enhance data freshness and operational efficiency.
By transitioning from batch to streaming ingestion, organizations can significantly reduce data latency, allowing for real-time analytics and faster decision-making processes.

2
Utilize row-group-level merging to manage small file generation effectively.
This approach minimizes the performance degradation associated with small files, ensuring that query performance remains optimal even as data volume increases.

3
Adopt operational tuning techniques to balance Kafka consumption across Flink subtasks.
Aligning parallelism with partitions and implementing round-robin polling can help mitigate issues caused by partition skew, leading to more efficient data processing.

Common Pitfalls

1

Failing to address small file generation can lead to degraded query performance.

This issue arises when streaming ingestion creates numerous small files, which can overwhelm metadata management systems and slow down data retrieval processes.

2

Neglecting checkpoint and commit synchronization can result in data duplication or loss.

When Flink checkpoints and Hudi commits become misaligned, it can cause inconsistencies in the data, leading to potential data integrity issues.

Related Concepts

Streaming Data Ingestion

Real-time Analytics

Data Lake Architecture

Operational Efficiency In Data Processing