Selective Column Reduction for DataLake Storage Cost Efficiency

Xinli Shang, Kai Jiang, Ryan Chen, Jing Zhao, Mingmin Chen, Mohammad Islam, Karthik Natarajan, Ajit Panda

Uber

•

Xinli Shang, Kai Jiang, Ryan Chen, Jing Zhao, Mingmin Chen, Mohammad Islam, Karthik Natarajan, Ajit Panda

•7 min read•intermediate•

--

•View Original

ApacheApache Spark

Overview

The article discusses the challenges Uber faces with increasing data storage costs and presents a solution through selective column reduction in Apache Parquet™ files. By eliminating unused columns, the approach aims to optimize storage efficiency while minimizing resource consumption and enhancing overall system performance.

What You'll Learn

1

How to implement selective column reduction in Apache Parquet™ files

2

Why reducing unused columns can optimize data storage costs

3

How to benchmark performance improvements in data processing

4

When to apply selective column pruning in large datasets

Prerequisites & Requirements

Understanding of Apache Parquet™ file format
Familiarity with Apache Spark™ for parallel execution

Key Questions Answered

What is selective column reduction in data lakes?

Selective column reduction is a method used to optimize data storage by removing unused columns from Apache Parquet™ files. This process reduces storage costs and improves performance by minimizing the amount of data stored and processed, allowing for more efficient data management.

How does the selective column pruning process work?

The selective column pruning process involves copying the necessary columns from an existing Parquet™ file while skipping the columns marked for removal. This method avoids expensive operations like encryption and compression, resulting in faster processing and reduced resource consumption.

What performance improvements were observed with the selective column pruner?

Benchmarking tests showed that the Selective Column Pruner is approximately 27 times faster for 1GB files and about 9 times faster for 1MB files compared to Apache Spark™. This highlights the efficiency of the new approach in handling large datasets.

When should selective column pruning be applied?

Selective column pruning should be applied when dealing with large datasets that contain unused columns. This approach is particularly beneficial in scenarios where storage costs are high and performance optimization is critical, such as in data lakes.

Key Statistics & Figures

Performance improvement

27x faster for 1GB files

Compared to Apache Spark™

Performance improvement

9x faster for 1MB files

Compared to Apache Spark™

Technologies & Tools

Data Storage

Apache Parquet™

Main file format used for data storage in Uber's data lake.

Data Processing

Apache Spark™

Used for parallel execution of the selective column pruning process.

Key Actionable Insights

1
Implement selective column reduction to optimize storage costs in your data lake.
By removing unused columns, you can significantly reduce the amount of data stored, leading to lower storage costs and improved performance.

2
Utilize benchmarking to assess the performance of your data processing methods.
Regular benchmarking helps identify performance bottlenecks and validate the effectiveness of optimizations like selective column pruning.

3
Consider using Apache Spark™ for parallel execution of column pruning tasks.
Leveraging Spark™ can enhance the efficiency of processing large datasets, making it easier to implement selective column reduction across multiple files.

Common Pitfalls

1

Over-reliance on traditional data processing steps can lead to inefficiencies.

Many developers may default to using encryption, compression, and encoding during data processing, which can be costly in terms of performance. Avoiding these steps when unnecessary can significantly improve processing times.

Related Concepts

Data Lake Architecture

Columnar Storage Formats

Data Optimization Techniques