From Archival to Access: Config-Driven Data Pipelines

Abhishek Dobliyal, Aakash Bhardwaj

Uber

•

Abhishek Dobliyal, Aakash Bhardwaj

•12 min read•intermediate•

--

•View Original

ApacheApache SparkAWSMySQLOracleYAML

Overview

The article discusses Uber's implementation of a configuration-driven archival and retrieval framework designed to manage vast amounts of regulatory data efficiently. It highlights the challenges faced, the architecture of the solution, and the benefits of automating data workflows to ensure compliance and optimize storage costs.

What You'll Learn

1

How to implement an archival and retrieval mechanism for regulatory data

2

Why using self-descriptive formats like Parquet can prevent schema conflicts

3

When to apply lazy-loading and partition pruning for efficient data access

Prerequisites & Requirements

Understanding of data lifecycle management and regulatory compliance
Familiarity with Apache Hadoop, Apache Hive, and Apache Spark(optional)

Key Questions Answered

What challenges did Uber face in managing regulatory data?

Uber faced challenges such as schema evolution, data ingestion during backfills, and the need for efficient data retrieval without overloading hot storage. These issues were addressed by implementing a robust archival and retrieval mechanism that ensures compliance and optimizes storage usage.

How does Uber's archival framework optimize data storage?

The archival framework optimizes data storage by moving infrequently accessed data to cold storage, which reduces costs associated with high-cost hot storage resources. This approach ensures that regulatory data is archived efficiently while remaining accessible when needed.

What technologies are used in Uber's archival and retrieval framework?

Uber's framework utilizes technologies such as HDFS for hot storage, Terrablob as an abstraction over Amazon S3 for cold storage, and Piper for orchestrating workflows. These technologies work together to automate data management processes effectively.

How does the retrieval workflow handle multiple requests?

The retrieval workflow is designed to handle multiple requests simultaneously, allowing users to retrieve a range of partitions from cold storage to hot storage. This is achieved through a trigger-based approach that ensures efficient data access.

Key Statistics & Figures

Regulatory reports managed by CDS in 2021

65

This number surged to over 500 reports by Q2 2024, indicating a significant increase in data management needs.

Reduction in retrieval time

90%

The streamlined retrieval process has drastically cut down the time required for data requests compared to previous manual methods.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Backend

Apache Hadoop

Used for hot storage of compliance data requiring high-throughput access.

Backend

Apache Hive

Facilitates data warehousing and querying within the archival framework.

Backend

Apache Spark

Used for processing large datasets efficiently.

Backend

Terrablob

An abstraction over Amazon S3 used for cold storage.

Orchestration

Piper

Manages scheduled and trigger-based workflows for archival and retrieval processes.

Database

Mysql

Stores metadata related to archival job details.

Key Actionable Insights

1
Implementing a configuration-driven approach can streamline data workflows and reduce manual errors.
By automating processes and allowing users to adjust configurations at the dataset level, teams can improve efficiency and compliance in data management.

2
Utilizing lazy-loading and partition pruning can significantly enhance data retrieval performance.
These techniques ensure that only necessary data is restored from cold storage, optimizing resource usage and maintaining system performance.

3
Adopting self-descriptive data formats like Parquet can mitigate schema evolution issues.
This approach allows for seamless retrieval of archived data without conflicts, which is crucial for regulatory compliance.

Common Pitfalls

1

Failing to manage schema evolution can lead to conflicts when retrieving archived data.

As schemas change, older data may not align with the latest schema, causing retrieval issues. Using self-descriptive formats and schema mapping can help mitigate these conflicts.

2

Inconsistent data ingestion during active backfills can result in missing or duplicate records.

Running backfill processes while archiving can cause inconsistencies. To avoid this, it's essential to use separate tables for archiving that do not interfere with backfill operations.

Related Concepts

Data Lifecycle Management

Regulatory Compliance

Data Archiving Strategies

Data Retrieval Techniques