How Shopify Manages Petabyte Scale MySQL Backup and Restore

At Shopify, we need a robust backup and restore solution for our MySQL servers. We used disk-based snapshots to reduced our RTO to under 30 minutes.

Akshay Suryawanshi
7 min readintermediate
--
View Original

Overview

This article discusses how Shopify manages MySQL backups and restores at a petabyte scale using Google Cloud Platform's Persistent Disk snapshots. It highlights the challenges faced with traditional backup methods and the improvements made to reduce Recovery Time Objective (RTO) to under 30 minutes.

What You'll Learn

1

How to leverage Google Cloud Platform's Persistent Disk snapshots for MySQL backups

2

Why using incremental snapshots can significantly reduce backup times

3

How to implement a backup retention policy using Kubernetes

4

When to use snapshots for disaster recovery and scaling replicas

Prerequisites & Requirements

  • Understanding of MySQL and cloud storage concepts
  • Familiarity with Google Cloud Platform and Kubernetes(optional)

Key Questions Answered

How does Shopify reduce its MySQL backup and restore times?
Shopify reduces its MySQL backup and restore times by using Google Cloud Platform's Persistent Disk snapshots, which allow for faster backups and restores. The initial snapshot of a multi-terabyte volume takes around 20 minutes, while incremental snapshots typically take less than 10 minutes. This approach has reduced their Recovery Time Objective (RTO) to under 30 minutes.
What are the challenges faced with traditional MySQL backup tools?
Traditional tools like Percona's Xtrabackup resulted in lengthy restore times exceeding six hours, which hindered disaster recovery and daily tasks such as rebuilding replicas. This inefficiency prompted the need for a more robust solution.
What is the backup verification process used by Shopify?
Shopify runs a backup verification process for daily backups, verifying two snapshots per shard, per region. This includes checking if a MySQL instance can start, if replication can initiate using Global Transaction ID, and if there is any InnoDB page-level corruption.
What are the pros and cons of using Persistent Disk snapshots for backups?
The pros include faster backups and restores, reduced backup times due to incremental snapshots, and lower performance impact on MySQL instances. The cons involve higher storage costs compared to traditional backups, lack of integrity checks during the snapshot process, and complexity in off-site storage.

Key Statistics & Figures

Initial snapshot time for multi-terabyte PD volume
20 minutes
This is the time taken to create the first snapshot of a Persistent Disk volume.
Incremental snapshot time
less than 10 minutes
This is the typical time taken for subsequent incremental snapshots.
Recovery Time Objective (RTO)
under 30 minutes
This is the time taken to restore data using the new snapshot-based approach.
Number of snapshots created daily
nearly 100 times a day
This reflects the frequency of snapshots taken across all MySQL instances.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Key Actionable Insights

1
Implementing a snapshot-based backup strategy can drastically reduce backup and restore times.
By leveraging GCP's Persistent Disk snapshots, organizations can achieve faster backups and significantly lower Recovery Time Objectives, which is crucial for maintaining service availability.
2
Establishing a robust retention policy is essential to manage storage costs effectively.
Creating a framework to delete unnecessary snapshots ensures that only critical backups are retained, preventing excessive storage expenses while meeting disaster recovery requirements.
3
Regularly verifying backups is vital for data integrity and reliability.
Implementing a verification process helps ensure that backups are usable and free from corruption, which is essential for maintaining trust in the backup system.

Common Pitfalls

1
Failing to implement a proper backup retention policy can lead to excessive storage costs.
Without a clear retention strategy, organizations may end up retaining unnecessary snapshots, resulting in inflated storage expenses.
2
Neglecting backup verification can compromise data integrity.
If backups are not regularly verified, there is a risk that corrupted or unusable backups could be relied upon during recovery, leading to potential data loss.

Related Concepts

Disaster Recovery Strategies
Incremental Backups
Cloud Storage Solutions
Database Scaling Techniques