A brief history of Notion’s data catalog

Wendy Jiao, Parul Baweja, Evelyn Wou
13 min readintermediate
--
View Original

Overview

This article explores the evolution of Notion's data catalog, detailing the challenges faced and solutions implemented across three distinct phases. It highlights the transition from chaotic data management to a structured approach, emphasizing the integration of TypeScript and AI-driven processes for metadata generation.

What You'll Learn

1

How to impose structure on unstructured JSON data using TypeScript

2

Why integrating AI can enhance metadata description generation

3

How to automate the propagation of metadata across data systems

Prerequisites & Requirements

  • Familiarity with TypeScript and JSON data formats
  • Experience with data catalog tools like Acryl DataHub(optional)

Key Questions Answered

What were the main challenges faced in Notion's early data management?
Notion's early data management faced several challenges, including a lack of formal guidelines leading to inconsistent naming, unclear ownership causing governance issues, and difficulties in data discoverability. These issues hindered effective product decision-making and data utility as the organization grew.
How did Notion improve user engagement with its data catalog?
Notion improved user engagement by identifying issues with their existing data catalog, such as unstructured data and lack of metadata descriptions. They implemented a structured approach using TypeScript for data models and AI for generating consistent metadata descriptions, which enhanced usability and discoverability.
What design decisions were made to enhance the data catalog?
Key design decisions included selecting TypeScript as the Interface Definition Language (IDL) for its existing codebase compatibility, adopting JSON Schema for data catalog integration, and automating metadata description generation using AI with human feedback to ensure accuracy.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Programming Language
Typescript
Used as the Interface Definition Language (IDL) for defining data models.
Data Format
JSON Schema
Adopted for compatibility with data catalog tools.
Data Catalog Tool
Acryl Datahub
Utilized for managing and displaying data schemas.
Technology
AI
Employed for generating metadata descriptions with human feedback.

Key Actionable Insights

1
Integrating TypeScript as your IDL can streamline data management processes.
By leveraging existing TypeScript types, teams can avoid redundancy and ensure type safety, which enhances the reliability of data models across applications.
2
Automating metadata generation with AI can significantly reduce manual effort.
This approach not only saves time but also ensures that descriptions remain consistent and up-to-date, which is crucial for effective data governance.
3
Implementing a human review process for AI-generated descriptions is essential.
This step mitigates the risks associated with inaccuracies in AI outputs, fostering trust in the data catalog and ensuring that users have reliable information.

Common Pitfalls

1
Failing to establish clear ownership and governance can lead to data quality issues.
Without defined responsibilities, data can become inconsistent and unreliable, making it difficult for teams to trust the information they are using for decision-making.
2
Neglecting the importance of metadata can hinder data discoverability.
If metadata descriptions are missing or outdated, users may struggle to understand the data's context and usage, leading to underutilization of valuable data assets.