Presto at Pinterest

Pinterest Engineering
15 min readintermediate
--
View Original

Overview

The article discusses Pinterest's implementation of Presto, an open-source distributed SQL query engine, detailing the challenges faced and solutions developed to manage large-scale data analysis. It highlights the architecture, deployment, and operational strategies that enable Pinterest to efficiently process petabytes of data for various analytical needs.

What You'll Learn

1

How to effectively deploy Presto for large-scale data analysis

2

Why separating ad-hoc and scheduled queries improves resource management

3

How to detect and resolve slow worker issues in Presto clusters

4

When to implement graceful shutdown procedures for Presto clusters

Prerequisites & Requirements

  • Understanding of SQL and distributed systems
  • Familiarity with AWS services, particularly EC2 and S3(optional)

Key Questions Answered

What challenges did Pinterest face while deploying Presto?
Pinterest encountered several challenges while deploying Presto, including coordinator crashes, slow workers, and issues with unbalanced resources across clusters. These challenges required innovative solutions such as implementing a Presto Controller for health checks and resource management.
How does Pinterest manage its Presto clusters?
Pinterest manages its Presto clusters by using a combination of dedicated AWS EC2 instances and Kubernetes pods, allowing for efficient resource allocation and scaling. The deployment includes a Presto Controller for monitoring and a Presto Gateway for routing queries.
What is the role of the Presto Controller at Pinterest?
The Presto Controller is a critical in-house service that performs health checks, detects slow workers, manages heavy queries, and facilitates rolling restarts and scaling of Presto clusters. This enhances the reliability and performance of the Presto deployment.
What improvements have been made to handle large Thrift schemas?
To manage large and deeply nested Thrift schemas, Pinterest implemented a system where Thrift schema Java archives are loaded at service start time, significantly reducing the memory load on the Presto coordinator and stabilizing its performance.

Key Statistics & Figures

Total data processed
Hundreds of petabytes
This scale of data processing is crucial for supporting Pinterest's analytical needs across various teams.
Presto cluster size
450 r4.8xl EC2 instances with over 100 TBs of memory and 14K vcpu cores
This infrastructure supports around 1,000 monthly active users running approximately 400K queries each month.
P90 query latency
Less than five minutes
This performance metric is critical for ensuring timely data analysis and decision-making at Pinterest.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Backend
Presto
Used as the distributed SQL query engine for data analysis.
Cloud Infrastructure
AWS EC2
Provides the compute resources for running Presto clusters.
Cloud Storage
AWS S3
Serves as the data storage layer for Presto, allowing for separation of compute and storage.
Container Orchestration
Kubernetes
Facilitates dynamic scaling of Presto workers.
Stream Processing
Kafka
Used for logging queries submitted to Presto clusters.

Key Actionable Insights

1
Implement a separation between ad-hoc and scheduled queries to enhance performance and predictability.
By keeping these two types of queries in distinct clusters, Pinterest can provide better service level agreements (SLAs) for scheduled queries, which is crucial for maintaining operational efficiency.
2
Utilize a Presto Controller to automate health checks and resource management.
This proactive approach helps in identifying slow workers and heavy queries, allowing for timely interventions that can prevent larger issues within the Presto clusters.
3
Adopt Kubernetes for dynamic scaling of Presto workers to optimize resource usage.
Kubernetes allows for quick adjustments to worker counts based on demand, which is essential for maintaining performance during peak usage times.

Common Pitfalls

1
Failing to implement graceful shutdown procedures can lead to abrupt failures for clients.
Without a graceful shutdown, clients may experience sudden query failures, prompting them to retry and encounter the same issues. Implementing a controlled shutdown process can mitigate this risk.
2
Neglecting to monitor slow workers can cause performance degradation across the cluster.
If slow workers are not detected and addressed promptly, they can affect the performance of other queries, leading to a cascading effect of slowdowns throughout the Presto environment.

Related Concepts

Distributed SQL Query Engines
Data Warehousing
Cloud-based Data Processing
Resource Management In Cloud Environments