Shopify's Path to a Faster Trino Query Execution: Infrastructure

We’ll discuss how we scaled our interactive query infrastructure to handle the rapid growth of our datasets, while enabling a query execution time of less than 5 seconds.

Matthew Bruce
12 min readadvanced
--
View Original

Overview

The article discusses Shopify's efforts to enhance the performance of Trino, a distributed SQL query engine, to provide faster query execution times for data scientists. It details the challenges faced due to high data volumes and the solutions implemented to achieve a P95 query latency of less than five seconds.

What You'll Learn

1

How to optimize Trino for faster query execution times

2

Why separating workloads into specific clusters can improve performance

3

How to analyze query performance issues using metrics and logs

4

When to apply JVM tuning settings for better performance

Prerequisites & Requirements

  • Understanding of distributed SQL query engines and data processing
  • Familiarity with Kubernetes and monitoring tools like Datadog(optional)

Key Questions Answered

What infrastructure changes did Shopify implement to improve Trino's performance?
Shopify implemented several infrastructure changes including creating separate clusters for scheduled and ad-hoc queries, optimizing JVM settings, and limiting the maximum number of concurrent queries to reduce lock contention. These changes allowed for a significant reduction in query execution times.
How did Shopify achieve a P95 query latency of less than five seconds?
By analyzing query performance and resource usage, Shopify identified bottlenecks and implemented targeted optimizations such as scaling worker pods, tuning JVM settings, and creating workload-specific clusters. This comprehensive approach led to a dramatic decrease in query execution times.
What were the main performance issues faced by Shopify's Trino deployment?
The main performance issues included inconsistent query execution times due to high load, timeout errors, and resource contention among queries. These challenges necessitated a detailed analysis of query patterns and resource allocation to improve overall performance.
What role did JVM settings play in optimizing Trino's performance?
JVM settings were critical in optimizing Trino's performance as they affected how queries were executed. By adjusting these settings, Shopify was able to prevent methods from running in the slower JVM interpreter, thus improving query throughput and reducing execution time.

Key Statistics & Figures

P95 query execution time
less than five seconds
Achieved after implementing infrastructure changes and optimizations.
Data handling capacity
15 Gbps and over 300 million rows of data per second
This high volume of data necessitated the improvements made to the Trino infrastructure.
Reduction in execution latency
30 times decrease
Targeted optimizations aimed to achieve a P95 query latency significantly lower than previous times.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Key Actionable Insights

1
Implement workload-specific clusters to enhance query performance.
By separating workloads into dedicated clusters, you can minimize contention and ensure that heavy queries do not impact the performance of lighter ones, leading to more consistent execution times.
2
Regularly analyze query performance metrics to identify bottlenecks.
Using tools like Datadog to monitor query performance can help you quickly identify issues such as lock contention or resource starvation, allowing for timely optimizations.
3
Optimize JVM settings based on workload characteristics.
Tuning JVM options can significantly impact performance, especially for data-intensive applications. Ensure that you are using the latest recommended settings to avoid performance degradation.
4
Limit the number of concurrent queries to reduce resource contention.
Setting a hard concurrency limit can help balance the load on your query engine, preventing overload situations that lead to slow query execution and timeouts.

Common Pitfalls

1
Failing to separate workloads can lead to performance degradation.
When heavy queries run alongside lighter ones in the same cluster, it can cause delays and timeouts. To avoid this, create dedicated clusters for different types of workloads.
2
Ignoring JVM settings can result in suboptimal performance.
Not tuning JVM settings according to workload requirements can lead to slower query execution times. Regularly review and update these settings based on the latest recommendations.

Related Concepts

Distributed SQL Query Engines
Performance Optimization Techniques
Kubernetes For Scalable Infrastructure
Monitoring Tools For Performance Analysis