Engineering Data Analytics with Presto and Apache Parquet at Uber

Zhenxiao Luo
10 min readintermediate
--
View Original

Overview

The article discusses how Uber utilizes Presto and Apache Parquet for data analytics, highlighting the architecture of their Presto ecosystem and the development of a new Parquet reader to enhance performance. It details the challenges faced due to rapid growth and the solutions implemented to optimize data querying and analytics.

What You'll Learn

1

How to leverage Presto for scalable SQL querying on large datasets

2

Why columnar storage improves performance in data analytics

3

How to implement optimizations in a custom Parquet reader

Prerequisites & Requirements

  • Understanding of SQL and data analytics concepts
  • Familiarity with Presto and Apache Parquet(optional)

Key Questions Answered

How does Uber utilize Presto for data analytics?
Uber uses Presto as a distributed SQL engine to run analytic queries across multiple data sources, enabling efficient data processing and decision-making. The architecture includes a coordinator node and several worker nodes that execute tasks, allowing for scalability and high performance.
What are the benefits of using Apache Parquet at Uber?
Apache Parquet is chosen for its compression and encoding functionalities, which optimize storage and improve query performance. Its support for nested data sets allows Uber to efficiently manage and analyze large volumes of data, enhancing overall analytics capabilities.
What optimizations were made in Uber's new Parquet reader?
Uber's new Parquet reader implements optimizations such as nested column pruning, columnar reads, predicate pushdowns, and dictionary pushdowns. These enhancements allow for more efficient querying, reducing processing time and resource usage significantly compared to the original open source reader.

Key Statistics & Figures

Number of analytic queries run daily
over one hundred thousand
This volume necessitated the development of a robust data querying system.
Speed improvement of data processing
2-10x faster
This improvement was observed after implementing the new Parquet reader compared to the original open source reader.
Number of nodes in Presto cluster
over 300 nodes
This cluster is capable of accessing over five petabytes of data.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Key Actionable Insights

1
Implementing a columnar storage format like Parquet can drastically improve query performance for large datasets.
By using columnar storage, you can reduce the amount of data scanned during queries, which leads to faster response times and lower resource consumption.
2
Customizing data readers to leverage specific features of storage formats can lead to significant performance gains.
Uber's development of a new Parquet reader illustrates how tailored solutions can optimize data processing, making it essential to adapt tools to fit specific use cases.

Common Pitfalls

1
Relying on generic open source solutions without customization can lead to performance bottlenecks.
Uber's experience with the original Parquet reader highlights the importance of tailoring tools to specific data needs to avoid inefficiencies.

Related Concepts

Data Analytics
Distributed SQL Querying
Columnar Storage Formats