From Batch to Streaming: Accelerating Data Freshness in Uber’s Data Lake

Xinli Shang, Peter Huang, Jing Li, Jing Zhao, Jack Song
6 min readadvanced
--
View Original

Overview

This article discusses Uber's transition from batch to streaming data ingestion using Apache Flink, which significantly enhances data freshness and operational efficiency. It outlines the architecture of the new ingestion system, key challenges faced during implementation, and the positive impact on data analytics across the company.

What You'll Learn

1

How to implement streaming ingestion using Apache Flink

2

Why transitioning from batch to streaming improves data freshness

3

How to address small file generation issues in streaming data

4

When to apply operational tuning for Kafka consumption

Prerequisites & Requirements

  • Understanding of data ingestion processes and streaming architectures
  • Familiarity with Apache Flink and Apache Kafka(optional)

Key Questions Answered

How does Uber achieve data freshness with streaming ingestion?
Uber's transition to streaming ingestion using Apache Flink has reduced data freshness from hours to minutes, enabling real-time analytics and faster decision-making. This shift allows various departments to utilize up-to-date data for experimentation and model development, significantly enhancing operational efficiency.
What challenges does streaming ingestion present?
Streaming ingestion can generate many small files, which degrade query performance and increase storage overhead. Additionally, partition skew can lead to inefficient data processing, while checkpoint and commit synchronization issues can cause data duplication or loss during failures.
What solutions did Uber implement to address small file issues?
Uber introduced row-group-level merging for Apache Parquet files, which operates directly on the columnar structure to avoid costly recompression. This method accelerates compaction by more than ten times, significantly improving query performance and reducing storage overhead.
How does Uber ensure the reliability of its ingestion system?
Uber's ingestion system includes a control plane that automates job management, health verification, and regional failover strategies. This design ensures consistent operation across thousands of datasets, maintaining availability and preventing data loss during outages.

Key Statistics & Figures

Data freshness improvement
Reduced from hours to minutes
This improvement allows for real-time analytics and faster decision-making across various business units.
Compute usage reduction
25% relative to batch ingestion
This reduction demonstrates the efficiency gains achieved through the new streaming ingestion system.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Key Actionable Insights

1
Implement streaming ingestion to enhance data freshness and operational efficiency.
By transitioning from batch to streaming ingestion, organizations can significantly reduce data latency, allowing for real-time analytics and faster decision-making processes.
2
Utilize row-group-level merging to manage small file generation effectively.
This approach minimizes the performance degradation associated with small files, ensuring that query performance remains optimal even as data volume increases.
3
Adopt operational tuning techniques to balance Kafka consumption across Flink subtasks.
Aligning parallelism with partitions and implementing round-robin polling can help mitigate issues caused by partition skew, leading to more efficient data processing.

Common Pitfalls

1
Failing to address small file generation can lead to degraded query performance.
This issue arises when streaming ingestion creates numerous small files, which can overwhelm metadata management systems and slow down data retrieval processes.
2
Neglecting checkpoint and commit synchronization can result in data duplication or loss.
When Flink checkpoints and Hudi commits become misaligned, it can cause inconsistencies in the data, leading to potential data integrity issues.

Related Concepts

Streaming Data Ingestion
Real-time Analytics
Data Lake Architecture
Operational Efficiency In Data Processing