Using Presto in our Big Data Platform on AWS

Netflix Technology Blog

Netflix

•

Netflix Technology Blog

•11 min read•intermediate•

--

•View Original

AWSAWS S3JavaJSONSQL

Overview

The article discusses Netflix's implementation of Presto within their Big Data Platform on AWS, detailing its architecture, performance, and integration with S3. It highlights the benefits of using Presto for interactive querying and data exploration across a multi-petabyte data warehouse.

What You'll Learn

1

How to integrate Presto with AWS S3 for interactive querying

2

Why Presto is suitable for ad hoc data exploration at scale

3

How to optimize query performance using Presto with Parquet file format

Prerequisites & Requirements

Understanding of data warehousing concepts and SQL
Familiarity with AWS services, particularly S3(optional)
Experience with data analytics and ETL processes

Key Questions Answered

What are the benefits of using Presto for data exploration?

Presto allows for low latency interactive data exploration on large datasets, making it ideal for ad hoc queries. Its architecture supports a multi-petabyte data warehouse on S3, enabling users to run diverse queries efficiently without the need for extensive caching.

How does Presto compare to Hive in terms of performance?

Presto significantly outperforms Hive, with queries that require one or two map-reduce phases running 10 to 100 times faster. This performance boost is linear to the number of map-reduce jobs involved, making Presto a better choice for interactive querying.

What is the current setup of Netflix's Presto cluster?

Netflix's Presto cluster consists of approximately 250 m2.4xlarge EC2 worker instances and one r3.4xlarge coordinator. The cluster runs around 2500 queries per workday, focusing on CPU-bound jobs with a high memory allocation for intensive queries.

What contributions has Netflix made to the Presto project?

Netflix has contributed to the Presto project by integrating the S3 FileSystem, optimizing S3 multipart uploads, and enhancing various functionalities such as disabling recursive directory listing and improving JSON tuple generation. These contributions aim to enhance Presto's performance and usability.

Key Statistics & Figures

Data warehouse size

10 petabytes

Netflix's data warehouse on S3 supports extensive querying across diverse datasets.

EC2 worker instances

250 m2.4xlarge

This setup allows Netflix to handle approximately 2500 queries per workday.

Task memory allocation

7GB

This high memory allocation is essential for running memory-intensive queries like big joins or aggregations.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Data Processing

Presto

Used for interactive querying and data exploration on large datasets.

Cloud Storage

AWS S3

Serves as the data warehouse for storing multi-petabyte datasets.

File Format

Parquet

Utilized for efficient data storage and retrieval in conjunction with Presto.

Key Actionable Insights

1
Integrate Presto with your existing data warehouse to enhance interactive querying capabilities.
By leveraging Presto's architecture with AWS S3, organizations can improve their data exploration processes, allowing analysts to gain insights quickly from large datasets.

2
Consider using Parquet file format for better performance with Presto.
Parquet's efficiency in data storage and retrieval can significantly enhance query performance, particularly for complex queries involving large datasets.

3
Engage with the open-source community to improve your data tools.
Contributing to projects like Presto not only helps tailor the tool to your needs but also fosters collaboration and innovation within the community.

Common Pitfalls

1

Underestimating the complexity of user-defined functions in Presto.

Developing user-defined functions in Presto is more involved than in Hive or Pig, which can lead to delays in implementation if not properly planned.

2

Neglecting the need for query optimization.

Without proper tuning and optimization, users may experience suboptimal performance, especially with large datasets and complex queries.

Related Concepts

Big Data Analytics

Data Warehousing

Etl Processes

Open Source Contributions