Migrating a Trillion Entries of Uber’s Ledger Data from DynamoDB to LedgerStore

Raghav Gautam, Erik Seaberg, Abhishek Kanhar

Uber

•

Raghav Gautam, Erik Seaberg, Abhishek Kanhar

•12 min read•advanced•

--

•View Original

ApacheApache SparkAWSAWS S3DynamoDBJavaScala

Overview

This article details Uber's migration of over a trillion entries of ledger data from DynamoDB to LedgerStore, focusing on the challenges, strategies, and outcomes of the process. It emphasizes the importance of immutability, cost savings, and the need for a seamless transition without service disruption.

What You'll Learn

1

How to migrate large datasets without downtime

2

Why immutability is crucial for ledger-style databases

3

How to implement effective data validation strategies during migration

Prerequisites & Requirements

Understanding of database migration principles
Familiarity with Apache Spark for data processing(optional)

Key Questions Answered

What were the main reasons for migrating from DynamoDB to LedgerStore?

The migration was driven by the need for a more cost-effective solution, better suited for storing immutable ledger data, and the desire to simplify the storage architecture by consolidating data management into a single system. This change aimed to enhance performance and reduce operational complexity.

How did Uber ensure data integrity during the migration process?

Uber employed shadow validation and offline validation techniques to ensure data integrity. Shadow validation compared responses from the old and new systems during migration, while offline validation involved comparing complete datasets to identify and backfill any missing records.

What challenges did Uber face during the backfill process?

Challenges included managing scalability, ensuring fault tolerance, and handling data quality issues. The team had to implement incremental backfills to avoid overwhelming the system and to ensure that data was accurately written without causing service disruptions.

What strategies were used for rate control during the backfill?

Uber implemented rate control mechanisms to manage the backfill job's speed, allowing adjustments based on current system load. This included using Guava's RateLimiter to ensure consistent performance and prevent overwhelming the system during high traffic periods.

Key Statistics & Figures

Total entries migrated

over a trillion

This figure represents the scale of the data migration effort undertaken by Uber.

Compressed size of immutable records

1.2 PB

This statistic highlights the volume of data being handled during the migration.

Uncompressed size of secondary indexes

0.5 PB

This indicates the additional storage requirements for indexing the migrated data.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Database

Dynamodb

Initially used for storing Uber's ledger data before migration.

Database

Ledgerstore

The target database for the migration, designed for ledger-style data.

Data Processing

Apache Spark

Used for handling the backfill and validation processes during migration.

Key Actionable Insights

1
Implement shadow validation during data migrations to ensure accuracy and completeness.
Shadow validation allows for real-time comparison between old and new data sources, helping to identify discrepancies early in the migration process.

2
Utilize offline validation to address issues with rarely accessed historical data.
This method ensures that all records, especially those not frequently accessed, are validated and backfilled correctly, preventing potential data integrity issues.

3
Adopt a phased rollout strategy for new systems to mitigate risk.
Gradually introducing the new system allows for monitoring and adjustments based on real-time feedback, reducing the likelihood of major disruptions.

Common Pitfalls

1

Failing to implement effective rate control can lead to system overload during backfills.

Without proper rate limiting, backfill jobs can generate excessive load, potentially causing service disruptions. It's crucial to monitor system performance and adjust the rate of data processing accordingly.

2

Neglecting to validate historical data can result in undetected data integrity issues.

If validation focuses only on recent data, older records may contain errors that go unnoticed, leading to long-term data quality problems. Comprehensive validation strategies must include all data, regardless of access frequency.

Related Concepts

Database Migration Strategies

Data Validation Techniques

Cost Management In Cloud Databases