Shopify's Path to a Faster Trino Query Execution: Custom Verification, Benchmarking, and Profiling Tooling

Shopify’s Data Reliability team built custom verification, benchmarking, and profiling tooling for testing and analyzing Trino. Our tooling is designed to minimize the risk of various changes at scale.

Noam Hacker
13 min readintermediate
--
View Original

Overview

Shopify's Data Reliability team developed custom verification, benchmarking, and profiling tooling to enhance Trino query execution speed and reliability. The article details the challenges faced in managing a large Trino cluster and the solutions implemented to ensure fast query results while minimizing risks associated with system changes.

What You'll Learn

1

How to implement a verification framework for Trino using PyTest

2

Why benchmarking is essential for evaluating performance changes in Trino

3

How to profile Trino queries to optimize performance at scale

Prerequisites & Requirements

  • Understanding of distributed SQL query engines like Trino
  • Familiarity with Python and testing frameworks like PyTest

Key Questions Answered

How does Shopify ensure fast query results with Trino?
Shopify ensures fast query results by using a custom verification, benchmarking, and profiling tooling developed by their Data Reliability team. This tooling minimizes risks associated with system changes and allows for efficient testing of Trino configurations, aiming for p95 query results in five seconds or less.
What are the main concerns when managing a large Trino cluster?
The main concerns when managing a large Trino cluster include optimizing configurations through experiments and keeping up with software updates for new features and security patches. These changes require careful vetting to avoid disruptions to data scientists and engineers.
What methodologies did Shopify use for benchmarking Trino?
Shopify's benchmarking methodology involved running TPC-DS queries on sample datasets generated deterministically by Trino. This approach allows for consistent performance evaluation under standardized conditions across various scale factors, such as sf10 for 10 GB and sf1000 for 1 TB datasets.
How does Shopify profile Trino queries to optimize performance?
Shopify profiles Trino queries by executing background queries that ignore individual results, allowing them to simulate high loads without overwhelming their resources. This method enables them to evaluate the performance of their configurations under realistic conditions.

Key Statistics & Figures

p95 query results
five seconds or less
This is the target execution time for 95th percentile queries at Shopify.
Data handling capacity
over 500 Gbps
This is the volume of data Shopify's Trino cluster is designed to handle.
Trino cluster scale
hundreds of nodes and tens of thousands of virtual CPUs
This scale is necessary to manage the increasing volume of data and user queries.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Key Actionable Insights

1
Implement a structured verification framework using PyTest to streamline testing processes for Trino changes.
This framework allows developers to focus on logic rather than underlying complexities, improving code readability and maintainability.
2
Utilize benchmarking to establish performance baselines for Trino configurations, ensuring consistent evaluation across different environments.
By standardizing benchmarking practices, teams can avoid inconsistencies that arise from ad-hoc testing methods, leading to more reliable performance assessments.
3
Leverage profiling tools to simulate high-load scenarios and optimize resource allocation in Trino.
Profiling helps identify performance bottlenecks and informs decisions on scaling infrastructure, particularly during peak usage times.

Common Pitfalls

1
Failing to thoroughly vet changes to Trino can lead to significant disruptions for data scientists and engineers.
Without a structured verification process, teams may encounter unexpected failures that require manual rollbacks, causing delays and inefficiencies.
2
Inconsistent benchmarking practices can result in unreliable performance metrics.
Using ad-hoc methods for benchmarking can lead to variations in results, making it difficult to draw accurate conclusions about performance changes.

Related Concepts

Distributed SQL Query Engines
Performance Benchmarking
Query Profiling Techniques