Overview
This article discusses Notion's recent horizontal re-sharding of its PostgreSQL database to accommodate increased traffic without downtime. It details the challenges faced, the implementation process, and the outcomes of this significant infrastructure upgrade.
What You'll Learn
1
How to horizontally re-shard a PostgreSQL database for increased capacity
2
Why using logical replication is essential for data synchronization during migrations
3
How to effectively manage PgBouncer connection limits during database scaling
Prerequisites & Requirements
- Understanding of PostgreSQL architecture and sharding concepts
- Familiarity with PgBouncer and Terraform(optional)
Key Questions Answered
What challenges did Notion face with its previous database architecture?
Notion faced several challenges including exceeding 90% CPU utilization at peak traffic, nearing full disk bandwidth utilization, and connection limits with PgBouncer. These issues prompted the need for a more scalable solution to handle increased user demand.
How did Notion ensure zero downtime during the database migration?
Notion implemented a horizontal re-sharding strategy that involved provisioning new database shards and using logical replication to synchronize data. This approach allowed them to manage traffic without any observable downtime for users.
What was the outcome of the database re-sharding process?
The re-sharding process resulted in a tripling of the database instances from 32 to 96, significantly reducing CPU and IOPS utilization. This provided ample headroom for future growth while maintaining performance during peak traffic.
What role did PgBouncer play in Notion's database architecture?
PgBouncer served as a connection pooling layer between the application web servers and the databases. It helped manage database connections efficiently, especially during the scaling process when new shards were added.
Key Statistics & Figures
Number of database instances after scaling
96
Increased from 32 to 96 to accommodate growing traffic demands.
CPU and IOPS utilization after re-sharding
20%
New utilization levels during peak traffic, providing significant headroom for growth.
Time to synchronize data across new shards
12 hours
Reduced from 3 days by optimizing the data copy process.
Technologies & Tools
Some links below are affiliate links. We may earn a commission if you make a purchase.
Database
Postgresql
Used as the primary database system for Notion's application.
Connection Pooling
Pgbouncer
Serves as a connection pool manager to optimize database connections.
Infrastructure As Code
Terraform
Automated the provisioning of new database instances.
Key Actionable Insights
1Implement horizontal sharding to distribute database load effectively.This approach allows for better performance and scalability as user demand increases, ensuring that no single database instance becomes a bottleneck.
2Utilize logical replication for data synchronization during migrations.This technique ensures that data remains consistent across old and new database instances, minimizing the risk of data loss during transitions.
3Carefully manage connection limits in PgBouncer to avoid overwhelming database instances.By sharding the PgBouncer cluster, you can effectively distribute connection loads, allowing for smoother transitions during scaling operations.
Common Pitfalls
1
Failing to manage connection limits can overwhelm database instances during migration.
If too many connections are allowed, it can lead to performance degradation and potential downtime. Properly sharding the PgBouncer cluster mitigates this risk.
Related Concepts
Database Sharding
Postgresql Replication
Connection Pooling With Pgbouncer
Infrastructure Automation With Terraform