Robust Database Backup Recovery at Uber

Arjav Jain, Shivam Vijay, Debadarsini Nayak, Mohammed Khatib, Ramnik Jain

Uber

•

Arjav Jain, Shivam Vijay, Debadarsini Nayak, Mohammed Khatib, Ramnik Jain

•11 min read•advanced•

--

•View Original

ApacheCassandraMySQLOracle

Overview

The article discusses Uber's robust database backup recovery system, highlighting its importance for business continuity and disaster recovery. It covers the challenges faced, architectural improvements, and the implementation of a Continuous Backup Continuous Recovery (CBCR) framework to enhance data integrity and recovery speed.

What You'll Learn

1

How to implement a centralized backup scheduling system for databases

2

Why continuous validation of backup processes is crucial for disaster recovery

3

When to perform restore testing to ensure data integrity

Key Questions Answered

What challenges did Uber face in evolving its backup recovery system?

Uber faced several challenges, including rudimentary backup scheduling, an ad-hoc recovery process, lack of recovery drills, and outdated recovery objectives. These issues led to inefficiencies in backup workloads and recovery times, necessitating a more robust system.

How does Uber's Continuous Backup Continuous Recovery framework work?

The Continuous Backup Continuous Recovery framework provides a unified experience for managing backup and restore operations. It includes centralized scheduling for backups, periodic recovery testing, and integrates technology-specific plugins for efficient snapshot management, ensuring data integrity and quick recovery.

What improvements were made to Uber's backup recovery objectives?

Uber improved its backup recovery objectives by reducing the Recovery Point Objective (RPO) from 7-21 days to 4-24 hours and the Recovery Time Objective (RTO) to 300 TB per hour. These optimizations were achieved through enhancements in scheduling and infrastructure.

Key Statistics & Figures

Data backed up

Close to 100 PB

This amount of data is backed up from databases at different intervals to support Uber's operations.

Recovery Time Objective (RTO)

300 TB per hour

This is the improved speed at which data can be restored following the optimizations made to the backup recovery system.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Database

Mysql

Used as one of the primary databases for storing online data.

Database

Apache Cassandra

Utilized for handling large-scale data storage and retrieval.

Database

Etcd

Employed for distributed key-value storage and configuration management.

Database

Apache Zookeeper

Used for coordinating distributed applications and managing configuration.

Key Actionable Insights

1
Implementing a centralized backup scheduling system can significantly improve data recovery times and reliability.
By adapting backup workloads based on network and host resources, organizations can ensure that backups do not disrupt production traffic, leading to more efficient operations.

2
Regularly testing restore processes is essential for maintaining data integrity and ensuring operational resilience.
Conducting both dedicated and random restore tests helps validate the effectiveness of backup strategies, allowing teams to identify and rectify potential issues before they impact business continuity.

Common Pitfalls

1

Failing to regularly test backup and recovery processes can lead to unexpected failures during actual recovery scenarios.

Without routine drills, teams may not be aware of potential issues in their recovery procedures, which can result in prolonged downtime and data loss during critical incidents.