How Stripe’s document databases supported 99.999% uptime with zero-downtime data migrations

In this blog post we’ll share an overview of Stripe’s database infrastructure and discuss the design and application of the Data Movement Platform.

Jimmy Morzaria
12 min readintermediate
--
View Original

Overview

In 2023, Stripe achieved 99.999% uptime while processing $1 trillion in payments by utilizing their in-house database-as-a-service, DocDB, built on MongoDB. The article discusses the architecture of DocDB and the Data Movement Platform that enables zero-downtime data migrations, ensuring high reliability and scalability.

What You'll Learn

1

How to implement zero-downtime data migrations in a database infrastructure

2

Why using a Data Movement Platform is crucial for scaling database services

3

How to optimize data insertion order to enhance write throughput

Prerequisites & Requirements

  • Understanding of database sharding and distributed systems concepts
  • Familiarity with MongoDB and its operational characteristics(optional)

Key Questions Answered

How does Stripe achieve 99.999% uptime with its database infrastructure?
Stripe maintains 99.999% uptime by using their custom database-as-a-service, DocDB, which is built on MongoDB. This infrastructure supports over five million queries per second and utilizes a Data Movement Platform that allows for zero-downtime data migrations, ensuring high availability and reliability.
What is the purpose of the Data Movement Platform in Stripe's infrastructure?
The Data Movement Platform is designed to manage online data migrations across database shards without downtime. It allows for efficient scaling, merging underutilized shards, and upgrading the database engine while maintaining data consistency and availability.
What challenges does Stripe face during data migrations?
Stripe faces challenges such as ensuring data consistency and completeness, minimizing downtime during migrations, and maintaining performance on source shards. The Data Movement Platform addresses these issues by allowing for client-transparent migrations and managing the complexities of distributed systems.

Key Statistics & Figures

Total payments volume processed
$1 trillion
This was achieved in 2023 while maintaining 99.999% uptime.
Queries served per second
over five million
This reflects the performance capability of Stripe's DocDB.
Data migrated
1.5 petabytes
This was achieved transparently to product applications using the Data Movement Platform.
Reduction in total number of DocDB shards
approximately three quarters
This reduction was part of the optimization process facilitated by the Data Movement Platform.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Key Actionable Insights

1
Implement a Data Movement Platform to facilitate zero-downtime migrations in your database infrastructure.
This approach allows for seamless scaling and upgrades without affecting service availability, which is critical for businesses that rely on continuous operations.
2
Optimize your data insertion order to improve write throughput significantly.
By arranging data based on common index attributes before insertion, you can enhance performance, as demonstrated by Stripe's 10x improvement in write throughput.
3
Utilize sharding to manage large datasets effectively and maintain performance.
Sharding allows for distributing data across multiple database instances, which can help in handling high query volumes and improving overall system reliability.

Common Pitfalls

1
Failing to ensure data consistency during migrations can lead to data integrity issues.
This can happen if the migration process does not adequately synchronize data between source and target shards, which is critical in financial applications.
2
Not optimizing data insertion order can result in poor write performance.
If data is inserted without considering the indexing strategy, it can lead to increased latency and decreased throughput, impacting overall system performance.

Related Concepts

Database Sharding
Distributed Systems
Data Consistency And Availability
Online Data Migrations