Tracker: Ingesting MySQL data at scale — Part 1

Pinterest Engineering

•

Pinterest Engineering

•4 min read•intermediate•

--

•View Original

ApacheMySQLPythonRedis

Overview

The article discusses Pinterest's approach to ingesting large volumes of MySQL data at scale through a system called Tracker. It outlines the challenges faced with the original ingestion framework and how Tracker was designed to improve data movement efficiency and reliability.

What You'll Learn

1

How to implement a two-stage data ingestion pipeline from MySQL to S3

2

Why handling failovers is critical in database ingestion processes

3

How to optimize Hadoop jobs for better performance during data transformation

Prerequisites & Requirements

Understanding of MySQL and Hadoop ecosystems
Familiarity with S3 for data storage(optional)

Key Questions Answered

What challenges did Pinterest face with their original MySQL ingestion framework?

Pinterest encountered several challenges, including performance issues during data dumps, failure to handle database failovers, and difficulties with schema changes. Large tables took hours to process, and the system struggled with binary data, leading to operational inefficiencies.

How does Tracker improve the data ingestion process from MySQL?

Tracker improves the ingestion process by breaking it into two stages: backing up data to S3 from the DB host and then using a Hadoop cluster to read and transform these backups. This method allows for better handling of failovers and increased parallel processing capabilities.

What is the significance of using multiple slaves in the Tracker system?

Using multiple slaves allows Tracker to parallelize the data ingestion process, significantly reducing the time required to move large tables. This approach addresses the limitations of the previous framework, which could only read from a single slave.

What performance optimizations were implemented in Tracker?

Tracker introduced more mappers and speculative execution in Hadoop jobs to improve performance. This allows the system to handle stuck S3 uploads more effectively and reduces turnaround time for data processing.

Key Statistics & Figures

Daily data collection from MySQL

more than 100 terabytes

This volume of data is collected daily to drive Pinterest's offline computation workflows.

Time taken to ingest Pins into Hadoop

more than 12 hours

This was a significant issue with the original ingestion framework, highlighting the need for improvements.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Database

Mysql

Primary data source for storing important objects in Pinterest.

Data Processing

Hadoop

Used for processing and transforming data ingested from MySQL.

Storage

S3

Destination for backing up MySQL data before processing.

Workflow Management

Pinball

Workflow manager used to orchestrate data movement.

Key Actionable Insights

1
Implement a two-stage data ingestion pipeline to enhance reliability and performance.
This approach allows for better management of data backups and transformations, ensuring that data is consistently available for processing without manual intervention.

2
Utilize multiple database slaves to parallelize data ingestion tasks.
By distributing the workload across multiple slaves, you can significantly decrease the time it takes to ingest large datasets, which is crucial for high-volume applications.

3
Incorporate failover handling in your data ingestion processes.
This ensures that data ingestion continues seamlessly even during database failovers, reducing downtime and maintaining data integrity.

Common Pitfalls

1

Failing to account for database schema changes can lead to ingestion failures.

If the extraction process relies on hardcoded schema definitions, any changes in the database schema without corresponding code updates will cause the ingestion process to break.

2

Not handling failovers properly can disrupt data ingestion.

Without a mechanism to manage failovers, the ingestion process may stop or lead to data inconsistencies, impacting the overall reliability of the data pipeline.