Building and scaling Notion’s data lake

XZ Tie, Nathan Louie, Thomas Chow, Darin Im, Abhishek Modi, Wendy Jiao
13 min readadvanced
--
View Original

Overview

The article discusses how Notion built and scaled its data lake to manage a tenfold increase in data over three years, driven by user and content growth. It details the architecture, design decisions, and technologies used to enhance data management and support product features like Notion AI.

What You'll Learn

1

How to implement a data lake architecture using Kafka and Apache Hudi

2

Why incremental data ingestion is preferred over snapshot dumps in data pipelines

3

How to optimize data processing for update-heavy workloads using Spark

4

When to use S3 as a data repository for large-scale data storage

Prerequisites & Requirements

  • Understanding of data lake concepts and architectures
  • Familiarity with Kafka and Apache Hudi(optional)
  • Experience with data processing frameworks like Spark(optional)

Key Questions Answered

How did Notion manage its data growth over the past three years?
Notion's data grew tenfold in three years, necessitating the development of a scalable data lake. They transitioned from a single PostgreSQL instance to a sharded architecture with 96 physical instances and 480 logical shards to manage the increasing data volume effectively.
What technologies did Notion use to build its data lake?
Notion's data lake is built on a combination of Kafka for data streaming, Apache Hudi for data processing and storage, and S3 for data storage. This architecture allows for efficient handling of their update-heavy workloads.
What challenges did Notion face while scaling its data infrastructure?
Notion encountered challenges related to operability, data freshness, and cost due to the overhead of managing numerous connectors and the update-heavy nature of their data. These challenges prompted the exploration of a dedicated data lake.
Why did Notion choose S3 as its data repository?
Notion selected S3 because it aligns with their AWS tech stack, supports large data storage, and integrates well with various data processing engines like Spark. This choice enhances scalability and cost-efficiency.

Key Statistics & Figures

Data growth rate
10x
Notion's data expanded tenfold over three years due to user and content growth.
Block rows in PostgreSQL
more than 200 billion
The number of block rows in PostgreSQL increased from over 20 billion at the start of 2021 to over 200 billion.
Logical shards maintained
480
Notion expanded its database infrastructure to maintain 480 logical shards for scalable data management.
Cost savings in 2022
over a million dollars
Notion achieved significant cost savings by migrating large datasets to their data lake.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Key Actionable Insights

1
Implementing a data lake can significantly reduce costs associated with data storage and processing.
Notion reported a net savings of over a million dollars in 2022 by moving large datasets to their data lake, demonstrating the financial benefits of this architecture.
2
Utilizing incremental ingestion methods can enhance data freshness and reduce costs.
By opting for incremental data ingestion over full snapshots, Notion achieved faster data availability and minimized operational costs, ensuring timely access to critical data.
3
Choosing the right processing engine is crucial for handling specific workloads effectively.
Notion selected Apache Spark for its ability to manage complex data processing tasks efficiently, particularly for their update-heavy workloads, which is essential for maintaining performance.

Common Pitfalls

1
Overloading live databases during data ingestion can lead to performance issues.
This often occurs when large volumes of data are ingested without proper management, causing slowdowns or outages. Implementing incremental ingestion strategies can help mitigate this risk.
2
Neglecting to optimize data processing for update-heavy workloads can lead to inefficiencies.
Many data processing frameworks are optimized for insert-heavy workloads, which can cause challenges when handling update-heavy data. Choosing the right tools and configurations is essential for maintaining performance.

Related Concepts

Data Lakes
Change Data Capture (cdc)
Data Processing Frameworks
Incremental Data Ingestion