Next Generation DB Ingestion at Pinterest

Pinterest Engineering
10 min readintermediate
--
View Original

Overview

The article discusses Pinterest's transition to a next-generation database ingestion framework designed to address the limitations of legacy systems. It highlights the challenges faced with batch-oriented workflows and presents a unified Change Data Capture (CDC)-based architecture that improves data latency, efficiency, and compliance.

What You'll Learn

1

How to implement a unified database ingestion framework using Change Data Capture

2

Why transitioning from batch to real-time data ingestion is crucial for modern applications

3

How to optimize Upsert operations using SparkSQL and Iceberg

Prerequisites & Requirements

  • Understanding of Change Data Capture and data ingestion concepts
  • Familiarity with Apache Kafka, Spark, and Iceberg(optional)

Key Questions Answered

What are the main challenges of legacy database ingestion systems?
Legacy systems at Pinterest faced high data latency, inefficiencies due to full-table batch jobs, lack of support for row-level deletion, and operational complexity. These issues hindered real-time analytics and increased resource wastage.
How does the new ingestion framework improve data processing?
The new ingestion framework utilizes Change Data Capture, allowing for real-time access to online database changes in minutes, processing only changed records, and supporting row-level deletions, resulting in significant cost savings and improved compliance.
What optimizations were made to enhance Upsert operations?
Optimizations included partitioning the base table by primary key hash to improve parallel processing and implementing a Bucket Join strategy to reduce compute costs and latency during Upsert operations, achieving over a 40% reduction in compute costs.
How does the CDC table differ from the base table?
The CDC table serves as a time-series, append-only ledger recording every change event with latency under 5 minutes, while the base table mirrors the online table, preserving historical records with a latency of 15 minutes to an hour.

Key Statistics & Figures

Data latency for updates
Under 5 minutes
This is the latency achieved by the CDC table, significantly improving upon the previous system's 24-hour update latency.
Reduction in compute costs
40%+
This reduction was achieved by implementing Bucket Join strategies during Upsert operations.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Key Actionable Insights

1
Transitioning to a real-time ingestion framework can significantly enhance data availability for analytics and machine learning applications.
By moving away from batch processing, organizations can ensure that data is updated in near real-time, allowing for more accurate and timely insights.
2
Implementing Change Data Capture can streamline data processing and reduce resource wastage.
This approach focuses on processing only changed records, which can lead to substantial cost savings in infrastructure.
3
Utilizing partitioning strategies in Spark can drastically improve Upsert performance.
Partitioning by primary key hash allows for parallel processing, reducing the amount of data scanned and improving efficiency.

Common Pitfalls

1
Failing to optimize Upsert operations can lead to high compute costs and latency.
Without proper partitioning and efficient join strategies, Upsert operations can become prohibitively expensive, especially with large datasets.
2
Over-reliance on batch processing can hinder real-time analytics capabilities.
Organizations that do not transition to real-time ingestion frameworks may struggle to provide timely insights, impacting decision-making.

Related Concepts

Change Data Capture
Real-time Data Processing
Data Ingestion Frameworks
Apache Kafka
Apache Spark