Improving efficiency and reducing runtime using S3 read optimization

Pinterest Engineering
6 min readadvanced
--
View Original

Overview

The article discusses a novel approach to improving S3 read throughput, resulting in significant efficiency gains for production jobs. The implementation achieved a 12x improvement in throughput, leading to reductions in vcore-hours and memory-hours.

What You'll Learn

1

How to optimize S3 read throughput for production jobs

2

Why asynchronous data reading improves CPU utilization

3

When to implement local caching for Parquet file reads

Key Questions Answered

How much did S3 read throughput improve with the new implementation?
The new implementation improved S3 read throughput from 21 MB/s to 269 MB/s, achieving a 12x increase. This enhancement allowed production jobs to complete more quickly, resulting in significant resource savings.
What are the efficiency gains observed in production jobs after implementing S3 read optimization?
After implementing the S3 read optimization, there was a 22% reduction in vcore-hours and a 23% reduction in memory-hours. This led to overall reduced job runtimes and improved CPU utilization.
What specific problems were identified in the S3AInputStream implementation?
The S3AInputStream implementation had issues such as single-threaded reads, which caused delays, and multiple unnecessary reopens of the stream, further slowing throughput. These bottlenecks were addressed in the new optimization.
How does the new implementation handle sequential data reads?
The new implementation allows data consumers, like mappers, to process data sequentially while asynchronously prefetching the next blocks. This reduces wait times and increases CPU utilization, as data is often ready by the time the mapper needs it.

Key Statistics & Figures

Improvement in S3 read throughput
12x
Increased from 21 MB/s to 269 MB/s
Reduction in vcore-hours
22%
Observed after implementing the new S3 read optimization
Reduction in memory-hours
23%
Also observed after implementing the new S3 read optimization
Improvement in Parquet file reading throughput
5x
Compared to the stock reader after implementing local caching

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Storage
Amazon S3
Used for storing and processing large datasets in production jobs
Framework
Mapreduce
Framework used for processing petabytes of data
Framework
Cascading
Framework utilized for data processing
Framework
Scalding
Framework used for data processing in conjunction with S3
Framework
Spark
Used for data processing with the new optimization in preliminary evaluations
File Format
Parquet
File format requiring non-sequential access, optimized in the new implementation

Key Actionable Insights

1
Implement asynchronous reading for data processing jobs to enhance throughput.
This approach minimizes wait times for data retrieval, allowing jobs to complete faster and utilize CPU resources more effectively.
2
Consider using local caching when working with Parquet files to improve read performance.
Local caching can significantly enhance throughput, especially for non-sequential access patterns typical of Parquet files.
3
Evaluate the split size and prefetch cache size for optimization based on job characteristics.
Tuning these parameters can lead to better performance and resource savings tailored to specific workloads.

Common Pitfalls

1
Relying on single-threaded reads can severely limit data processing speed.
This occurs because jobs spend excessive time waiting for data to be read over the network, which can be mitigated by implementing asynchronous reading strategies.
2
Failing to cache prefetched data can lead to performance degradation, especially with non-sequential access patterns.
Without caching, any seek outside the current block results in discarding prefetched data, which can slow down operations significantly.

Related Concepts

Data Processing Optimization Techniques
Caching Strategies For Data Retrieval
Performance Tuning For Big Data Frameworks