Overview
The article discusses Uber's robust database backup recovery system, highlighting its importance for business continuity and disaster recovery. It covers the challenges faced, architectural improvements, and the implementation of a Continuous Backup Continuous Recovery (CBCR) framework to enhance data integrity and recovery speed.
What You'll Learn
1
How to implement a centralized backup scheduling system for databases
2
Why continuous validation of backup processes is crucial for disaster recovery
3
When to perform restore testing to ensure data integrity
Key Questions Answered
What challenges did Uber face in evolving its backup recovery system?
Uber faced several challenges, including rudimentary backup scheduling, an ad-hoc recovery process, lack of recovery drills, and outdated recovery objectives. These issues led to inefficiencies in backup workloads and recovery times, necessitating a more robust system.
How does Uber's Continuous Backup Continuous Recovery framework work?
The Continuous Backup Continuous Recovery framework provides a unified experience for managing backup and restore operations. It includes centralized scheduling for backups, periodic recovery testing, and integrates technology-specific plugins for efficient snapshot management, ensuring data integrity and quick recovery.
What improvements were made to Uber's backup recovery objectives?
Uber improved its backup recovery objectives by reducing the Recovery Point Objective (RPO) from 7-21 days to 4-24 hours and the Recovery Time Objective (RTO) to 300 TB per hour. These optimizations were achieved through enhancements in scheduling and infrastructure.
Key Statistics & Figures
Data backed up
Close to 100 PB
This amount of data is backed up from databases at different intervals to support Uber's operations.
Recovery Time Objective (RTO)
300 TB per hour
This is the improved speed at which data can be restored following the optimizations made to the backup recovery system.
Technologies & Tools
Some links below are affiliate links. We may earn a commission if you make a purchase.
Database
Mysql
Used as one of the primary databases for storing online data.
Database
Apache Cassandra
Utilized for handling large-scale data storage and retrieval.
Database
Etcd
Employed for distributed key-value storage and configuration management.
Database
Apache Zookeeper
Used for coordinating distributed applications and managing configuration.
Key Actionable Insights
1Implementing a centralized backup scheduling system can significantly improve data recovery times and reliability.By adapting backup workloads based on network and host resources, organizations can ensure that backups do not disrupt production traffic, leading to more efficient operations.
2Regularly testing restore processes is essential for maintaining data integrity and ensuring operational resilience.Conducting both dedicated and random restore tests helps validate the effectiveness of backup strategies, allowing teams to identify and rectify potential issues before they impact business continuity.
Common Pitfalls
1
Failing to regularly test backup and recovery processes can lead to unexpected failures during actual recovery scenarios.
Without routine drills, teams may not be aware of potential issues in their recovery procedures, which can result in prolonged downtime and data loss during critical incidents.