Next Generation DB Ingestion at Pinterest

Pinterest Engineering

•

Pinterest Engineering

•10 min read•intermediate•

--

•View Original

ApacheApache KafkaApache SparkAWSMySQLOracle

Overview

The article discusses Pinterest's transition to a next-generation database ingestion framework designed to address the limitations of legacy systems. It highlights the challenges faced with batch-oriented workflows and presents a unified Change Data Capture (CDC)-based architecture that improves data latency, efficiency, and compliance.

What You'll Learn

1

How to implement a unified database ingestion framework using Change Data Capture

2

Why transitioning from batch to real-time data ingestion is crucial for modern applications

3

How to optimize Upsert operations using SparkSQL and Iceberg

Prerequisites & Requirements

Understanding of Change Data Capture and data ingestion concepts
Familiarity with Apache Kafka, Spark, and Iceberg(optional)

Key Questions Answered

What are the main challenges of legacy database ingestion systems?

Legacy systems at Pinterest faced high data latency, inefficiencies due to full-table batch jobs, lack of support for row-level deletion, and operational complexity. These issues hindered real-time analytics and increased resource wastage.

How does the new ingestion framework improve data processing?

The new ingestion framework utilizes Change Data Capture, allowing for real-time access to online database changes in minutes, processing only changed records, and supporting row-level deletions, resulting in significant cost savings and improved compliance.

What optimizations were made to enhance Upsert operations?

Optimizations included partitioning the base table by primary key hash to improve parallel processing and implementing a Bucket Join strategy to reduce compute costs and latency during Upsert operations, achieving over a 40% reduction in compute costs.

How does the CDC table differ from the base table?

The CDC table serves as a time-series, append-only ledger recording every change event with latency under 5 minutes, while the base table mirrors the online table, preserving historical records with a latency of 15 minutes to an hour.

Key Statistics & Figures

Data latency for updates

Under 5 minutes

This is the latency achieved by the CDC table, significantly improving upon the previous system's 24-hour update latency.

Reduction in compute costs

40%+

This reduction was achieved by implementing Bucket Join strategies during Upsert operations.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Backend

Change Data Capture

Used to capture database changes in real-time.

Backend

Apache Kafka

Facilitates the streaming of change events.

Backend

Apache Flink

Processes incoming CDC events in near real-time.

Backend

Apache Spark

Used for batch processing and Upsert operations.

Data Lake

Apache Iceberg

Stores data in a format that supports efficient querying and updates.

Key Actionable Insights

1
Transitioning to a real-time ingestion framework can significantly enhance data availability for analytics and machine learning applications.
By moving away from batch processing, organizations can ensure that data is updated in near real-time, allowing for more accurate and timely insights.

2
Implementing Change Data Capture can streamline data processing and reduce resource wastage.
This approach focuses on processing only changed records, which can lead to substantial cost savings in infrastructure.

3
Utilizing partitioning strategies in Spark can drastically improve Upsert performance.
Partitioning by primary key hash allows for parallel processing, reducing the amount of data scanned and improving efficiency.

Common Pitfalls

1

Failing to optimize Upsert operations can lead to high compute costs and latency.

Without proper partitioning and efficient join strategies, Upsert operations can become prohibitively expensive, especially with large datasets.

2

Over-reliance on batch processing can hinder real-time analytics capabilities.

Organizations that do not transition to real-time ingestion frameworks may struggle to provide timely insights, impacting decision-making.

Related Concepts

Change Data Capture

Real-time Data Processing

Data Ingestion Frameworks

Apache Kafka

Apache Spark