How Uber Scaled Data Replication to Move Petabytes Every Day

Abhay Yadav, Radhika Patwari, Sanjay Sundaresan

Uber

•

Abhay Yadav, Radhika Patwari, Sanjay Sundaresan

•15 min read•intermediate•

--

•View Original

ApacheApache Spark

Overview

This article details how Uber optimized Apache Hadoop's Distcp (Distributed Copy) tool to scale their data replication infrastructure from handling 250 TB to petabytes of daily data movement. The key improvements include shifting resource-intensive tasks to the Application Master, parallelizing Copy Listing and Copy Committer operations, and implementing Uber jobs for small transfers, resulting in a 5x boost in incremental data replication capacity across on-premise and cloud data centers.

What You'll Learn

1

How to optimize Apache Hadoop Distcp for petabyte-scale data replication across distributed data centers

2

Why shifting resource-intensive preparation tasks from the client to the Application Master eliminates HDFS client contention and reduces job submission latency

3

How to parallelize Copy Listing and Copy Committer tasks to dramatically reduce job planning and completion times

4

When to use Hadoop's Uber jobs feature to eliminate unnecessary container launches for small data transfers

5

How to implement circuit breakers and observability metrics for large-scale data replication systems

Prerequisites & Requirements

Understanding of Apache Hadoop ecosystem including MapReduce, YARN, and HDFS
Familiarity with distributed systems concepts such as data replication, disaster recovery, and distributed file systems
Understanding of JVM concepts including threading, lock contention, and memory management
Experience with large-scale data infrastructure and data lake architectures(optional)

Key Questions Answered

How does Uber replicate petabytes of data across distributed data centers daily?

Uber uses the HiveSync service built on Airbnb's ReAir project, which leverages an optimized version of Apache Hadoop's Distcp for large-scale data replication. HiveSync supports both bulk and incremental replication across HDFS and cloud-based object storage, processing over 374,000 daily replication jobs. Key optimizations to Distcp include parallelized Copy Listing, parallelized Copy Committer, and Uber jobs for small transfers.

What is Distcp and how does its architecture work for distributed data copying?

Distcp (Distributed Copy) is an open-source framework that copies large datasets between locations using Hadoop's MapReduce framework. Its architecture includes the Distcp Tool (identifies files and creates Copy Listings), Hadoop Client (configures Input Splitting), YARN Resource Manager (schedules tasks), Application Master (monitors job lifecycle), Copy Mapper (performs actual data copying in containers), and Copy Committer (merges copied blocks into final files at the destination).

Why does HDFS client lock contention cause Distcp performance bottlenecks at scale?

As the number of Distcp executor threads increases, multiple components (Distcp worker, Distcp tool for Copy Listing, and Hadoop Client for Input Splitting) all share the same HDFS client for Remote Procedure Calls. JVM-level locks in the HDFS client create thread contention as parallelism increases, with threads getting stuck waiting for locks held by the HDFS client for RPC calls. Copy Listing alone was responsible for 90% of job submission latency due to this contention.

How do you reduce Distcp job submission latency by 90%?

By shifting the resource-intensive Copy Listing and Input Splitting tasks from the HiveSync server (client side) to the Application Master (AM). Each Distcp job then performs Copy Listing on its own AM container with its own HDFS client, which significantly reduces lock contention on the HiveSync Hadoop client. This architectural change distributes the HDFS client load across multiple containers instead of concentrating it on a single shared client.

How does parallelizing Copy Listing improve Distcp performance?

Instead of sequentially calling the NameNode to create file splits, the parallelized approach assigns a separate thread to create splits for each file. Each thread handles block creation for one file and adds chunks to a blocking queue, while a separate writer thread sequentially writes blocks to the sequence file. Using 6 threads achieved a 60% reduction in p99 average listing latency and a 75% decrease in maximum latency across all HiveSync servers.

What are Hadoop Uber jobs and when should you use them for Distcp?

Uber jobs are a Hadoop feature that eliminates the need to allocate separate containers for tasks by executing Copy Mapper tasks directly within the Application Master's JVM. They are ideal for small transfers—around 52% of Uber's Distcp jobs require only one mapper to copy less than 512 MB and fewer than 200 files. Enabling Uber jobs for these small jobs eliminated 268,000 single-core container launches per day, significantly improving YARN resource usage.

How does parallelizing the Copy Committer task reduce file merge latency?

The open-source Distcp version merges file chunks sequentially after Copy Mapper tasks complete, which can take up to 30 minutes for directories with over 500,000 files. By parallelizing file concatenation with each thread responsible for merging one file at a time, using the Sequence File to identify block order, mean concatenation latency dropped by 97.29% using 10 threads. If any thread fails, the main thread halts and retries the entire Distcp job.

What challenges arise when scaling Distcp for petabyte-level data replication?

Key challenges include OOM exceptions in the Application Master requiring rigorous stress testing for optimal memory configuration, YARN Queue Full errors from increased job submission rates requiring circuit breaker implementation, high network bandwidth usage from increased copy rates requiring careful tuning, and AM failures from long-running Copy Listing tasks blocking heartbeat signals. Each required specific solutions including metrics, circuit breakers, and architectural adjustments.

Key Statistics & Figures

Data Lake total size

350+ PB

Uber's total Data Lake size exceeding this amount

Daily data replication volume increase

250 TB to 1 PB

Surge in daily replication volume in a single quarter (Q3 2022

Daily replication jobs increase

10,000 to 374,000

Average daily replication jobs skyrocketed due to scaling demands

Dataset count growth

30,000 to 144,000

Number of datasets managed by HiveSync grew in one quarter as new datasets were onboarded

Distcp job submission latency reduction

90%

Achieved by shifting Copy Listing and Input Splitting from client to Application Master

Copy Listing p99 average latency reduction

60%

Achieved by parallelizing Copy Listing task using 6 threads

Copy Listing maximum latency reduction

75%

Achieved across all HiveSync servers by parallelizing with 6 threads

Mean concatenation latency reduction

97.29%

Achieved by parallelizing Copy Committer task using 10 threads

Single-mapper jobs percentage

52%

Percentage of Distcp jobs requiring only one mapper to copy less than 512 MB

Daily container launches eliminated

268,000

Single-core container launches eliminated per day by implementing Uber jobs

Incremental replication capacity increase

5x

Data handling capacity increase across on-premises in one year

Data migrated to cloud

306+ PB

Total data successfully migrated from on-premise to cloud via HiveSync

Primary data center generation share

90%

Primary on-premise data center handled 90% of data generation in the active-passive architecture

P100 replication lag SLA

4 hours

SLA target for maximum replication delay

P99.9 replication lag SLO

20 minutes

SLO target for 99.9th percentile replication delay

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Data Replication

Apache Hadoop Distcp

Core framework for distributed data copying between locations, optimized by Uber for petabyte-scale replication

Data Replication Service

Hivesync

Uber's data replication service built on Airbnb's ReAir that orchestrates bulk and incremental data replication using Distcp

Resource Management

Apache Hadoop Yarn

Resource manager that schedules Distcp tasks, allocates containers, and manages Application Master lifecycle

Distributed File System

Hdfs

Hadoop Distributed File System used as the on-premise data lake storage layer

Distributed Computing

Apache Hadoop Mapreduce

Framework used by Distcp to parallelize and distribute copy tasks across multiple nodes via mappers

Data Warehouse

Apache Hive

Data warehouse system whose datasets are replicated across data centers via HiveSync

Open-source Replication

Reair

Airbnb's open-source project that served as the foundation for HiveSync

Cloud Platform

GCP

Cloud platform for Uber's data lake migration from on-premise infrastructure

Key Actionable Insights

1
Move resource-intensive preparation tasks from the client to distributed worker nodes to eliminate shared-resource contention. When multiple threads compete for a single HDFS client's RPC lock, performance degrades dramatically at scale. By offloading Copy Listing and Input Splitting to individual Application Master containers, each job gets its own isolated HDFS client.
This approach reduced Distcp job submission latency by 90% at Uber. The principle applies broadly to any system where centralized clients become bottlenecks as parallelism increases.

2
Identify which operations can be safely parallelized by analyzing their ordering constraints. Copy Listing files can be processed in parallel because files are independent, but chunks within each file must remain ordered. Understanding these constraints allows targeted parallelization without breaking correctness.
Uber used a producer-consumer pattern with a blocking queue: multiple listing threads create file splits in parallel, while a single writer thread maintains ordering when writing to the sequence file. This achieved a 60% reduction in p99 listing latency with 6 threads.

3
Leverage in-process execution (Uber/Uber jobs pattern) for small tasks that don't justify container allocation overhead. When 52% of your distributed jobs need only a single mapper for less than 512 MB, the JVM startup and container allocation time dominates actual work time. Running these tasks within the Application Master's existing JVM eliminates this overhead entirely.
This optimization eliminated 268,000 container launches daily at Uber. Configure with mapreduce.job.ubertask.enable=true, maxmaps=1, and maxbytes=512MB to automatically route qualifying jobs.

4
Implement circuit breakers when reducing latency in distributed systems, because faster submission rates can overwhelm downstream resources. Lowering Distcp submission latency caused YARN Queue Full errors because more jobs were submitted per unit time. A circuit breaker that temporarily pauses new submissions until retries succeed prevents cascading failures.
This is a common pattern when optimizing one part of a pipeline—faster throughput upstream can create new bottlenecks downstream. Pair circuit breakers with real-time metrics to monitor and dynamically adjust queue capacity.

5
When moving long-running tasks between components in a distributed system, carefully consider heartbeat and liveness check mechanisms. Moving Copy Listing to the AM's setup phase caused timeout failures because the RM's heartbeat sender hadn't started yet, and Copy Listing could exceed the 10-minute timeout.
The fix was to move Copy Listing to the output committer's setup phase, which runs after the heartbeat sender has started. Always map task placement against the lifecycle of health-check mechanisms.

6
Add comprehensive observability metrics before and during performance optimization rollouts. Uber introduced metrics for job submission times, submission rates, Copy Listing and Copy Committer task performance, heap memory usage, and per-job copy rates. These metrics were crucial for both tuning optimizations and diagnosing production incidents.
Without detailed metrics, it would be impossible to identify that Copy Listing was responsible for 90% of job submission latency or to detect OOM issues in the Application Master. Invest in observability early to guide optimization decisions.

Common Pitfalls

1

Running resource-intensive preparation tasks (Copy Listing, Input Splitting) on the client side with a shared HDFS client. As the number of parallel Distcp worker threads increases, JVM-level locks in the shared HDFS client create severe thread contention, with Copy Listing alone causing 90% of job submission latency.

The solution is to offload these tasks to individual Application Master containers so each job gets its own isolated HDFS client, eliminating the shared lock contention.

2

Placing long-running tasks in the Application Master's setup phase without considering heartbeat mechanisms. The Resource Manager expects regular heartbeat signals from the AM, but the heartbeat sender only starts after setup completes. If Copy Listing takes longer than 10 minutes during setup, the RM times out and kills the AM.

Move long-running tasks to the output committer's setup phase instead, which runs after the heartbeat sender has already started, preventing timeouts.

3

Reducing job submission latency without accounting for increased downstream pressure. Faster submission rates can overwhelm YARN queues, leading to 'Yarn Queue Full' errors and cascading failures across the replication pipeline.

Implement circuit breakers that temporarily pause new submissions when YARN queues are saturated, and add metrics for real-time monitoring to dynamically adjust queue capacity.

4

Allocating full YARN containers for small data copy jobs that don't need them. Over 52% of Distcp jobs at Uber required only a single mapper for less than 512 MB, but each still incurred container allocation and JVM startup overhead that dominated the actual copy time.

Enable Hadoop's Uber jobs feature for qualifying small jobs to execute Copy Mapper tasks directly within the Application Master's JVM, eliminating unnecessary container launches.

5

Running Copy Committer file merge operations sequentially, which becomes a severe bottleneck for large directories. For directories with over 500,000 files, sequential merge can take up to 30 minutes.

Parallelize the concatenation process with dedicated threads per file, using the Sequence File to maintain correct block ordering. Uber achieved a 97.29% reduction in mean concat latency with 10 threads.

Related Concepts

Distributed Data Replication

Apache Hadoop Ecosystem

Mapreduce Framework

Yarn Resource Management

Hdfs Architecture

Data Lake Management

Active-passive Disaster Recovery

Thread Contention And Jvm Locking

Producer-consumer Pattern With Blocking Queues

Circuit Breaker Pattern

Cloud Migration Strategies

Distributed Systems Observability

Namenode Rpc Optimization

Container Orchestration Overhead

Data Pipeline Scalability