Building and scaling Notion’s data lake

XZ Tie, Nathan Louie, Thomas Chow, Darin Im, Abhishek Modi, Wendy Jiao

Notion

•

XZ Tie, Nathan Louie, Thomas Chow, Darin Im, Abhishek Modi, Wendy Jiao

•13 min read•advanced•

--

•View Original

ApacheAWSAWS RDSEmbeddingPySparkScalaSQLVector Database

Overview

The article discusses how Notion built and scaled its data lake to manage a tenfold increase in data over three years, driven by user and content growth. It details the architecture, design decisions, and technologies used to enhance data management and support product features like Notion AI.

What You'll Learn

1

How to implement a data lake architecture using Kafka and Apache Hudi

2

Why incremental data ingestion is preferred over snapshot dumps in data pipelines

3

How to optimize data processing for update-heavy workloads using Spark

4

When to use S3 as a data repository for large-scale data storage

Prerequisites & Requirements

Understanding of data lake concepts and architectures
Familiarity with Kafka and Apache Hudi(optional)
Experience with data processing frameworks like Spark(optional)

Key Questions Answered

How did Notion manage its data growth over the past three years?

Notion's data grew tenfold in three years, necessitating the development of a scalable data lake. They transitioned from a single PostgreSQL instance to a sharded architecture with 96 physical instances and 480 logical shards to manage the increasing data volume effectively.

What technologies did Notion use to build its data lake?

Notion's data lake is built on a combination of Kafka for data streaming, Apache Hudi for data processing and storage, and S3 for data storage. This architecture allows for efficient handling of their update-heavy workloads.

What challenges did Notion face while scaling its data infrastructure?

Notion encountered challenges related to operability, data freshness, and cost due to the overhead of managing numerous connectors and the update-heavy nature of their data. These challenges prompted the exploration of a dedicated data lake.

Why did Notion choose S3 as its data repository?

Notion selected S3 because it aligns with their AWS tech stack, supports large data storage, and integrates well with various data processing engines like Spark. This choice enhances scalability and cost-efficiency.

Key Statistics & Figures

Data growth rate

10x

Notion's data expanded tenfold over three years due to user and content growth.

Block rows in PostgreSQL

more than 200 billion

The number of block rows in PostgreSQL increased from over 20 billion at the start of 2021 to over 200 billion.

Logical shards maintained

480

Notion expanded its database infrastructure to maintain 480 logical shards for scalable data management.

Cost savings in 2022

over a million dollars

Notion achieved significant cost savings by migrating large datasets to their data lake.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Database

Postgresql

Used as the primary database for storing block data.

Streaming

Kafka

Utilized for streaming data changes from PostgreSQL to the data lake.

Data Processing

Apache Hudi

Employed for processing and storing data in the data lake.

Storage

S3

Serves as the storage layer for raw and processed data.

Data Processing

Spark

Used for data transformation and processing tasks.

Key Actionable Insights

1
Implementing a data lake can significantly reduce costs associated with data storage and processing.
Notion reported a net savings of over a million dollars in 2022 by moving large datasets to their data lake, demonstrating the financial benefits of this architecture.

2
Utilizing incremental ingestion methods can enhance data freshness and reduce costs.
By opting for incremental data ingestion over full snapshots, Notion achieved faster data availability and minimized operational costs, ensuring timely access to critical data.

3
Choosing the right processing engine is crucial for handling specific workloads effectively.
Notion selected Apache Spark for its ability to manage complex data processing tasks efficiently, particularly for their update-heavy workloads, which is essential for maintaining performance.

Common Pitfalls

1

Overloading live databases during data ingestion can lead to performance issues.

This often occurs when large volumes of data are ingested without proper management, causing slowdowns or outages. Implementing incremental ingestion strategies can help mitigate this risk.

2

Neglecting to optimize data processing for update-heavy workloads can lead to inefficiencies.

Many data processing frameworks are optimized for insert-heavy workloads, which can cause challenges when handling update-heavy data. Choosing the right tools and configurations is essential for maintaining performance.

Related Concepts

Data Lakes

Change Data Capture (cdc)

Data Processing Frameworks

Incremental Data Ingestion

Slack Data Engineering recently underwent data workload migration from AWS EMR 5 (Spark 2/Hive 2 processing engine) to EMR 6 (Spark 3 processing engine). In this blog, we will share our migration journey, challenges, and the performance gains we observed in the process. This blog aims to assist Data Engineers, Data Infrastructure Engineers, and Product…

AWSScalaSQL

12 min read

Includes Code

Has Summary

--

Uber

Intermediate

Horovod Adds Support for PySpark and Apache MXNet and Additional Features for Faster Training

AWSSQLAzure

7 min read

Has Summary

--

NVIDIA

Intermediate

Merging Telemetry and Logs from Microservices at Scale with Apache Spark

One of the most common challenges with big data is the ability to merge data from several sources with minimal cost and latency. It’s an even bigger challenge…

AWSSQLApache Kafka

11 min read

Includes Code

Has Summary

--

These articles from Slack and other leading engineering teams share similar topics with "Building and scaling Notion’s data lake". Explore more engineering insights on AWS, Scala, SQL.