Building Uber’s Data Lake: Batch Data Replication Using HiveSync

Radhika Patwari, Trivedhi Talakola, Rajan Jaiswal, Chayanika Bhandary, Mukesh Verma, Sanjay Sundaresan
14 min readadvanced
--
View Original

Overview

This article discusses the architecture and implementation of Uber's HiveSync, a critical service for data replication across its massive data lake. It highlights how HiveSync ensures data consistency and availability across regions, addressing challenges in disaster recovery and operational efficiency.

What You'll Learn

1

How to implement a bi-directional data replication system using HiveSync

2

Why cross-region data consistency is critical for disaster recovery

3

How to optimize data replication processes to meet strict SLAs

Prerequisites & Requirements

  • Understanding of data replication concepts
  • Familiarity with Apache Hadoop and Hive(optional)

Key Questions Answered

How does HiveSync ensure data consistency across regions?
HiveSync uses a bi-directional, permissions-aware data replication mechanism to maintain data consistency across regions. It monitors Hive events and replicates changes in real-time, ensuring that both primary and secondary data centers have synchronized data, which is crucial for disaster recovery.
What are the performance metrics of HiveSync at Uber?
HiveSync operates at a scale of 800,000 Hive tables, managing approximately 300 PB of data. It handles over 5 million Hive DDL/DML events daily, replicating about 8 PB of data each day, showcasing its efficiency and capacity to manage large-scale data operations.
What is the role of the One-Time Replication Service (OTRS) in HiveSync?
The One-Time Replication Service (OTRS) allows for the bulk copying of large datasets into HiveSync, ensuring that the secondary region is not left behind during incremental updates. This service performs a one-off replication of historical data to maintain synchronization before regular operations commence.

Key Statistics & Figures

Total data managed by HiveSync
300 PB
As of 2023, HiveSync replicates the entire data lake at Uber.
Daily Hive DDL/DML events
5 million
This volume reflects the operational scale of HiveSync in managing data changes.
Daily data replicated
8 PB
This statistic highlights the efficiency of HiveSync in handling large data volumes.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Key Actionable Insights

1
Implementing HiveSync can significantly enhance data availability and disaster recovery capabilities in multi-region architectures.
By ensuring real-time data replication, organizations can maintain operational continuity and minimize downtime during regional outages.
2
Utilizing the One-Time Replication Service (OTRS) can streamline the onboarding of large datasets into HiveSync.
This approach prevents initial data lag in the secondary region, ensuring that both regions remain synchronized from the start.
3
Adopting strict SLAs for data replication can improve data reliability and user trust in analytics.
Establishing clear performance metrics helps teams monitor and optimize their data replication processes effectively.

Common Pitfalls

1
Failing to maintain data consistency can lead to discrepancies between regions, especially if manual changes are made directly in HDFS.
To avoid this, ensure that all data modifications go through HiveSync to trigger the necessary events for replication.
2
Overlooking the importance of SLAs can result in delayed data availability, impacting analytics and decision-making.
Establishing and adhering to strict SLAs ensures that data remains fresh and accessible for users across regions.

Related Concepts

Data Replication Strategies
Disaster Recovery Planning
Multi-region Data Architecture