Presto: Interacting with petabytes of data at Facebook

Martin Traverso

Visit the post for more.

Overview

The article discusses Presto, a distributed SQL query engine developed by Facebook to enable interactive analysis of large datasets stored in their data warehouse. It highlights Presto's architecture, performance improvements over traditional systems, and its open-source release.

What You'll Learn

1

How to optimize SQL queries for low latency in large datasets

2

Why Presto is more efficient than Hive/MapReduce for interactive analytics

3

How to integrate Presto with various data sources using connectors

Prerequisites & Requirements

Understanding of SQL and distributed systems
Familiarity with Hadoop and data warehousing concepts(optional)

Key Questions Answered

How does Presto improve query performance compared to Hive/MapReduce?

Presto is 10x better than Hive/MapReduce in terms of CPU efficiency and latency for most queries at Facebook. It achieves this by using an in-memory processing model and pipelined execution, which reduces end-to-end latency significantly.

What is the architecture of Presto?

Presto is a distributed SQL query engine that supports ANSI SQL and operates by sending SQL queries to a coordinator that manages execution across nodes. It avoids MapReduce, using a custom execution engine that processes data in memory and streams it between stages.

What are the current capabilities and limitations of Presto?

Presto supports a large subset of ANSI SQL, including joins and aggregations, but has limitations on join table sizes and cannot write output data back to tables. It processes over 30,000 queries daily, handling one petabyte of data.

What future improvements are planned for Presto?

Future enhancements for Presto include removing restrictions on join and aggregation sizes, introducing the ability to write output tables, and developing a query accelerator with a new data format optimized for processing.

Key Statistics & Figures

Data processed daily

1 petabyte

Presto processes this amount of data daily across the company.

CPU efficiency improvement

10x better

Presto outperforms Hive/MapReduce in CPU efficiency and latency for most queries.

Number of queries run daily

30,000 queries

This reflects the high demand and usage of Presto among Facebook employees.

Cluster size

1,000 nodes

Presto has successfully scaled to this size in a single cluster.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Backend

Presto

A distributed SQL query engine for interactive analytics.

Backend

Hadoop

Used for data storage in Facebook's data warehouse.

Storage

Hdfs

The primary storage system for Facebook's data warehouse.

Programming Language

Java

The language used to implement Presto.

Key Actionable Insights

1
Leverage Presto's in-memory processing capabilities to reduce query latency in your data analysis tasks.
By using Presto, you can significantly improve the speed of your data queries, especially when working with large datasets, making it ideal for real-time analytics.

2
Utilize Presto's extensibility to connect to various data sources beyond HDFS.
Presto's design allows for easy integration with different data stores, enabling a unified SQL querying capability across diverse data environments.

3
Monitor the performance of your Presto queries to identify bottlenecks and optimize execution plans.
Understanding how Presto schedules and executes queries can help you fine-tune performance and improve overall system efficiency.

Common Pitfalls

1

Assuming Presto can replace all functionalities of Hive/MapReduce without understanding its limitations.

While Presto offers significant advantages in speed and efficiency, it currently lacks the ability to write output data back to tables, which may limit its use in certain scenarios.

Related Concepts

Distributed SQL Query Engines

Data Warehousing

Real-time Analytics

Hadoop Ecosystem