Interactive Querying with Apache Spark SQL at Pinterest

Pinterest Engineering

•

Pinterest Engineering

•18 min read•intermediate•

--

•View Original

ApacheApache SparkAWSAWS S3JSONMachine LearningPythonREST APISQLThrift

Overview

The article discusses how Pinterest utilizes Apache Spark SQL for interactive querying, detailing the architecture, challenges faced, and solutions implemented to enhance user experience. It emphasizes the transition from Hive to Spark SQL and the importance of efficient querying for data-driven decision-making.

What You'll Learn

1

How to implement interactive querying with Apache Spark SQL

2

Why using Apache Livy enhances interactive querying experiences

3

How to optimize DDL queries for faster execution

Prerequisites & Requirements

Understanding of Apache Spark and SQL querying
Familiarity with Apache Livy(optional)

Key Questions Answered

What are the differences between scheduled and interactive querying at Pinterest?

Scheduled queries run on a pre-defined cadence with strict Service Level Objectives (SLO), while interactive queries are executed on-demand without a set schedule. Users wait for interactive queries to finish, making the platform's requirements for handling failures and performance different.

How does Pinterest handle DDL queries to reduce latency?

Pinterest implemented a local session pool in Apache Livy to handle DDL queries, which allows for faster execution without the overhead of container allocation on the YARN cluster. This change reduced query latency from 70 seconds to an average of 10 seconds.

What challenges did Pinterest face with interactive querying and how were they addressed?

Challenges included seamless query submission, fast metadata queries, and error handling. Solutions involved building a generic DB-API compliant Python client called BigPy, implementing a local session pool for DDL queries, and enhancing error handling with automatic troubleshooting information.

Key Statistics & Figures

Reduction in DDL query latency

from 70 seconds to 10 seconds

This improvement was achieved through the implementation of a local session pool in Apache Livy.

Uptime SLO for Livy

99.5%

This uptime is crucial for maintaining reliability in handling approximately 1,500 ad-hoc SparkSQL queries daily.

Technologies & Tools

Backend

Apache Spark SQL

Used for querying large datasets at Pinterest.

Backend

Apache Livy

Facilitates interaction with Spark clusters over a RESTful interface.

Backend

Yarn

Resource management for running Spark applications.

Key Actionable Insights

1
Implement a local session pool for handling DDL queries to improve performance.
This approach significantly reduces latency by avoiding the overhead of container allocation, making it ideal for environments where quick metadata operations are essential.

2
Utilize Apache Livy for managing Spark sessions to enhance interactive querying.
Livy's ability to handle multiple sessions and provide failure isolation makes it a robust choice for interactive applications, improving user experience.

3
Integrate automatic error handling and troubleshooting information into your querying platform.
Providing users with immediate feedback on common errors can streamline the debugging process, making it easier for them to resolve issues quickly.

Common Pitfalls

1

Relying solely on cluster mode for all query types can lead to significant latency.

Cluster mode is not optimized for DDL queries, which can be executed more efficiently in local mode. Avoiding this mistake can lead to faster query execution times.

2

Not providing adequate error handling and troubleshooting information can frustrate users.

Users may struggle to diagnose issues without clear feedback on query failures. Implementing automatic error reporting can greatly enhance user experience.

Related Concepts

Apache Spark

Apache Livy

Data Engineering

Big Data Query Platforms