Aria Presto: Making table scan more efficient

Aria is a set of initiatives to dramatically increase PrestoDB efficiency. Our goal is to achieve a 2-3x decrease in CPU time for Hive queries against tables stored in ORC format. For Aria, we are …

Maria Basmanova
6 min readadvanced
--
View Original

Overview

The article discusses Aria, a set of initiatives aimed at enhancing PrestoDB efficiency, particularly focusing on optimizing table scans for Hive queries on data stored in ORC format. Key strategies include subfield pruning, adaptive filter ordering, and efficient row skipping, which collectively aim for a 2-3x reduction in CPU time.

What You'll Learn

1

How to implement subfield pruning to enhance query performance

2

Why adaptive filter ordering can reduce CPU cycles in queries

3

When to apply efficient row skipping for better resource management

Key Questions Answered

What are the main strategies for optimizing table scans in PrestoDB?
The article outlines three main strategies for optimizing table scans in PrestoDB: subfield pruning, which reduces the complexity of data extraction; adaptive filter ordering, which optimizes the order of filter application to minimize CPU usage; and efficient row skipping, which avoids unnecessary reads of irrelevant data. These strategies aim to significantly reduce CPU time for queries.
How does Aria improve the efficiency of Hive queries?
Aria improves the efficiency of Hive queries by focusing on reducing CPU time through optimized table scans. By implementing techniques like subfield pruning and adaptive filter ordering, Aria aims for a 2-3x decrease in CPU time for queries accessing ORC formatted data, which constitutes nearly 60% of global Presto CPU usage.
What is the impact of the new scan architecture on query performance?
The new scan architecture shifts filter evaluation from the engine to the Hive connector, allowing for more efficient data extraction. This change enables subfield pruning and improves overall query performance by reducing the amount of data processed, as demonstrated by a prototype showing approximately 20% CPU gains on a sample production workload.

Key Statistics & Figures

CPU time reduction
2-3x
Targeted decrease for Hive queries against tables stored in ORC format.
CPU gains from prototype
~20 percent
Observed on a small sample of production workload for overall query performance.
Global Presto CPU time attributed to table scan
60 percent
Indicates the significance of optimizing table scans in the overall resource usage.

Technologies & Tools

Backend
Prestodb
Used for executing queries against data stored in ORC and Parquet formats.
Data Storage
Orc Format
The primary format for data storage being optimized in the article.

Key Actionable Insights

1
Implementing subfield pruning can significantly enhance query performance by reducing the amount of data processed.
This is particularly useful in scenarios where complex data types are used, as it allows for more efficient extraction of only the necessary elements from ORC files.
2
Adopting adaptive filter ordering can lead to substantial CPU savings in query execution.
By reordering filters based on their efficiency, you can minimize the data extracted from ORC files, which is crucial for optimizing resource usage in large-scale data environments.
3
Efficient row skipping is essential for optimizing data reads in PrestoDB.
This technique prevents unnecessary reads of irrelevant data, thus saving CPU cycles and improving overall query execution speed, especially in large datasets.

Common Pitfalls

1
Failing to implement efficient row skipping can lead to unnecessary CPU usage.
This happens when queries read all values in a column even if only a few match the filter, resulting in wasted resources. To avoid this, ensure that row skipping is properly implemented in the data reading process.

Related Concepts

Data Optimization Techniques
Query Performance Enhancement
Data Storage Formats