DBEvents: A Standardized Framework for Efficiently Ingesting Data into Uber’s Apache Hadoop Data Lake

Nishith Agarwal, Ovais Tariq

Uber

•

Nishith Agarwal, Ovais Tariq

•18 min read•advanced•

--

•View Original

ApacheApache KafkaApache SparkCassandraMySQL

Overview

DBEvents is a change data capture system developed by Uber to efficiently ingest data into their Apache Hadoop data lake. It addresses the challenges of data freshness, quality, and efficiency, enabling real-time insights for various Uber services.

What You'll Learn

1

How to implement change data capture using DBEvents

2

Why data freshness is critical for real-time applications

3

How to ensure data quality through schema management

4

When to use incremental ingestion versus full snapshots

Prerequisites & Requirements

Understanding of change data capture concepts
Familiarity with Apache Hadoop and its ecosystem(optional)

Key Questions Answered

What is DBEvents and how does it work?

DBEvents is a change data capture system designed to ingest data into Uber's Apache Hadoop data lake. It captures incremental changes from various sources like MySQL and Apache Cassandra, ensuring data freshness and quality while optimizing resource usage.

How does DBEvents ensure data freshness?

DBEvents minimizes freshness lag by incrementally applying changes to datasets in small batches. This approach allows Uber to provide timely updates, crucial for applications like fraud detection, where delays can impact user experience.

What are the key requirements for DBEvents?

DBEvents was designed with three main requirements: freshness, quality, and efficiency. Freshness ensures timely data updates, quality involves maintaining a consistent schema, and efficiency focuses on optimizing resource usage during data ingestion.

What challenges does DBEvents address in data ingestion?

DBEvents addresses challenges such as the inefficiency of full data snapshots, the pressure on upstream databases during ingestion, and the need for high-quality, well-defined schemas to avoid data swamp issues.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Backend

Apache Hadoop

Used as the data lake for storing large volumes of data ingested through DBEvents.

Backend

Apache Kafka

Serves as the message bus for transporting changelog events from various data sources.

Backend

Apache Hudi

Facilitates efficient data ingestion and management of raw datasets in HDFS.

Database

Mysql

One of the sources from which DBEvents captures and ingests data.

Database

Apache Cassandra

Another source integrated into the DBEvents framework for data ingestion.

Key Actionable Insights

1
Implement incremental ingestion to reduce resource usage and improve data freshness.
By using incremental ingestion, organizations can avoid the overhead of full table snapshots, leading to more efficient data processing and timely updates.

2
Utilize schema management services to maintain data quality.
Implementing a schema management service ensures that all ingested data adheres to defined structures, preventing data swamp scenarios and enhancing usability.

3
Monitor latency and completeness for data ingestion processes.
Establishing monitoring mechanisms helps identify delays in data availability, allowing for timely adjustments to ingestion strategies.

Common Pitfalls

1

Over-reliance on full table snapshots can lead to inefficiencies and increased load on source databases.

This occurs because full snapshots require reading entire tables, which can slow down real-time applications and lead to resource exhaustion.

2

Neglecting schema management can result in a data swamp.

Without proper schema enforcement, data can become unmanageable and unusable, leading to difficulties in data retrieval and analysis.

Related Concepts

Change Data Capture

Data Lake Architecture

Real-time Data Processing

Schema Management