Powering big data at Pinterest

Pinterest Engineering
9 min readbeginner
--
View Original

Overview

The article discusses how Pinterest manages its big data infrastructure, detailing the evolution from a single cluster Hadoop setup to a self-serving platform that supports extensive data processing needs. It highlights the challenges faced and the solutions implemented to optimize data handling and improve user experience.

What You'll Learn

1

How to implement a self-serve platform for Hadoop

2

Why decoupling compute and storage can enhance big data applications

3

How to effectively manage Hadoop dependencies using layered approaches

Prerequisites & Requirements

  • Understanding of Hadoop and big data concepts
  • Familiarity with AWS and S3(optional)

Key Questions Answered

What challenges does Pinterest face with big data management?
Pinterest faces challenges in scaling its data infrastructure to handle over 30 billion Pins and logging 20 terabytes of new data daily. The need for a personalized discovery engine requires efficient data processing and experimentation capabilities.
How does Pinterest utilize Hadoop for big data processing?
Pinterest uses Hadoop to process vast amounts of data, enabling features like Related Pins and Guided Search. The platform supports thousands of daily metrics and rigorous experimentation, ensuring relevant content is presented to users.
What are the key features of Pinterest's self-serve Hadoop platform?
The self-serve Hadoop platform includes isolated multitenancy, elasticity for batch processing, multi-cluster support, and an access control layer. These features allow developers to customize jobs without impacting others and scale resources as needed.
Why did Pinterest migrate to Qubole for Hadoop services?
Pinterest migrated to Qubole due to its stability at scale, support for AWS/S3, and features like horizontal scalability, responsive support, and integration with Hive. This transition improved throughput by 30%-60% compared to previous systems.

Key Statistics & Figures

Daily data logged
20 terabytes
This volume of data is crucial for maintaining Pinterest's personalized discovery engine.
Total data in S3
10 petabytes
This large dataset supports various features and functionalities across the platform.
Jobs processed daily
2,000 jobs
Over 100 regular MapReduce users leverage the platform for their data processing needs.
Throughput improvement
30%-60%
This improvement was achieved after migrating to Qubole, enhancing overall data processing efficiency.

Technologies & Tools

Backend
Hadoop
Used for processing and storing large datasets.
Backend
Hive
Provides a SQL-like interface for querying data in Hadoop.
Hadoop As A Service
Qubole
Facilitates scalable and efficient data processing on AWS.
Storage
S3
Used for storing large volumes of data and supporting Hadoop operations.
Configuration Management
Puppet
Automates the configuration and management of Hadoop nodes.

Key Actionable Insights

1
Implement a multi-cluster Hadoop architecture to enhance data processing capabilities.
This setup allows for better isolation and resource allocation, which is crucial for handling diverse workloads and maintaining performance across various applications.
2
Utilize a centralized Hive metastore for efficient data management.
A centralized metastore simplifies data cataloging and improves performance by providing a consistent interface for accessing data, which is essential for large-scale data operations.
3
Consider using Qubole for Hadoop as a Service to reduce operational overhead.
Qubole's integration with AWS and support for spot instances can lead to significant cost savings and improved resource management, making it a viable option for organizations looking to scale their big data operations.

Common Pitfalls

1
Failing to account for the limitations of traditional Hadoop setups can lead to inefficiencies.
Many organizations underestimate the need for a self-serve platform and multi-cluster support, which can hinder scalability and performance.
2
Neglecting to implement proper access control can expose sensitive data.
Without a robust access control layer, organizations risk unauthorized access to data, which can lead to compliance issues and data breaches.

Related Concepts

Big Data Management
Hadoop Architecture
Data Processing Frameworks
Cloud Storage Solutions