Improving efficiency and reducing runtime using S3 read optimization

Pinterest Engineering

•

Pinterest Engineering

•6 min read•advanced•

--

•View Original

SQL

Overview

The article discusses a novel approach to improving S3 read throughput, resulting in significant efficiency gains for production jobs. The implementation achieved a 12x improvement in throughput, leading to reductions in vcore-hours and memory-hours.

What You'll Learn

1

How to optimize S3 read throughput for production jobs

2

Why asynchronous data reading improves CPU utilization

3

When to implement local caching for Parquet file reads

Key Questions Answered

How much did S3 read throughput improve with the new implementation?

The new implementation improved S3 read throughput from 21 MB/s to 269 MB/s, achieving a 12x increase. This enhancement allowed production jobs to complete more quickly, resulting in significant resource savings.

What are the efficiency gains observed in production jobs after implementing S3 read optimization?

After implementing the S3 read optimization, there was a 22% reduction in vcore-hours and a 23% reduction in memory-hours. This led to overall reduced job runtimes and improved CPU utilization.

What specific problems were identified in the S3AInputStream implementation?

The S3AInputStream implementation had issues such as single-threaded reads, which caused delays, and multiple unnecessary reopens of the stream, further slowing throughput. These bottlenecks were addressed in the new optimization.

How does the new implementation handle sequential data reads?

The new implementation allows data consumers, like mappers, to process data sequentially while asynchronously prefetching the next blocks. This reduces wait times and increases CPU utilization, as data is often ready by the time the mapper needs it.

Key Statistics & Figures

Improvement in S3 read throughput

12x

Increased from 21 MB/s to 269 MB/s

Reduction in vcore-hours

22%

Observed after implementing the new S3 read optimization

Reduction in memory-hours

23%

Also observed after implementing the new S3 read optimization

Improvement in Parquet file reading throughput

5x

Compared to the stock reader after implementing local caching

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Storage

Amazon S3

Used for storing and processing large datasets in production jobs

Framework

Mapreduce

Framework used for processing petabytes of data

Framework

Cascading

Framework utilized for data processing

Framework

Scalding

Framework used for data processing in conjunction with S3

Framework

Spark

Used for data processing with the new optimization in preliminary evaluations

File Format

Parquet

File format requiring non-sequential access, optimized in the new implementation

Key Actionable Insights

1
Implement asynchronous reading for data processing jobs to enhance throughput.
This approach minimizes wait times for data retrieval, allowing jobs to complete faster and utilize CPU resources more effectively.

2
Consider using local caching when working with Parquet files to improve read performance.
Local caching can significantly enhance throughput, especially for non-sequential access patterns typical of Parquet files.

3
Evaluate the split size and prefetch cache size for optimization based on job characteristics.
Tuning these parameters can lead to better performance and resource savings tailored to specific workloads.

Common Pitfalls

1

Relying on single-threaded reads can severely limit data processing speed.

This occurs because jobs spend excessive time waiting for data to be read over the network, which can be mitigated by implementing asynchronous reading strategies.

2

Failing to cache prefetched data can lead to performance degradation, especially with non-sequential access patterns.

Without caching, any seek outside the current block results in discarding prefetched data, which can slow down operations significantly.

Related Concepts

Data Processing Optimization Techniques

Caching Strategies For Data Retrieval

Performance Tuning For Big Data Frameworks