Accelerating Apache Parquet Scans on Apache Spark with GPUs

Matt Ahrens

As data sizes have grown in enterprises across industries, Apache Parquet has become a prominent format for storing data. Apache Parquet is a columnar storage…

NVIDIA

•

Matt Ahrens

•7 min read•advanced•

--

•View Original

ApacheApache SparkSQL

Overview

The article discusses how to accelerate Apache Parquet scans on Apache Spark using GPUs, specifically through the RAPIDS Accelerator for Apache Spark. It highlights the benefits of using microkernels in cuDF to improve performance and occupancy limitations in GPU processing.

What You'll Learn

1

How to accelerate Apache Parquet scans using GPUs

2

Why microkernels improve GPU occupancy and performance in data processing

3

How to leverage the RAPIDS Accelerator for Apache Spark in existing workloads

Prerequisites & Requirements

Understanding of Apache Spark and GPU architectures
Familiarity with cuDF and RAPIDS libraries(optional)

Key Questions Answered

How does the RAPIDS Accelerator improve Apache Spark performance with Parquet?

The RAPIDS Accelerator for Apache Spark enhances performance by utilizing GPUs to accelerate data processing through optimized microkernels in cuDF. This allows for efficient scanning of Parquet data, significantly improving runtime performance for large-scale workloads.

What are the limitations of the previous monolithic kernel for Parquet scans?

The previous monolithic kernel for Parquet scans had high shared memory and register usage, leading to lower GPU occupancy. This complexity constrained optimizations and resulted in performance limitations, especially for large-scale data processing.

What performance improvements can be achieved with the new microkernel approach?

The new microkernel approach showed faster runtimes and improved GPU occupancy. Benchmarks indicated significant throughput improvements across various Parquet column types, with some optimizations yielding up to a 117% increase in throughput for specific read operations.

When should enterprises consider using GPUs for Apache Spark workloads?

Enterprises should consider using GPUs for Apache Spark workloads when dealing with large-scale data processing, particularly with Parquet formats, as the RAPIDS Accelerator allows for seamless migration and significant performance enhancements without code changes.

Key Statistics & Figures

Throughput improvement for list columns

117%

This improvement was observed during chunked reads of 500-KB data.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Data Format

Apache Parquet

Used for efficient columnar storage and processing of large datasets.

Software

Rapids Accelerator For Apache Spark

Accelerates Apache Spark workloads on NVIDIA GPUs.

Library

Cudf

Provides GPU-accelerated DataFrame operations for data processing.

Key Actionable Insights

1
Utilize the RAPIDS Accelerator for Apache Spark to enhance data processing performance.
By leveraging the RAPIDS Accelerator, enterprises can accelerate their existing Apache Spark applications on GPUs without needing to modify code, which can lead to substantial performance gains.

2
Adopt the microkernel approach for processing Parquet data to improve GPU occupancy.
Implementing microkernels allows for more efficient use of GPU resources, reducing register usage and enhancing performance, particularly for large datasets.

3
Benchmark performance improvements regularly when optimizing data processing workflows.
Regular benchmarking helps identify bottlenecks and assess the impact of optimizations, ensuring that performance gains are realized and maintained over time.

Common Pitfalls

1

Overlooking the complexity of monolithic kernels can lead to performance issues.

Monolithic kernels can create inefficiencies due to high shared memory and register usage, which may limit GPU occupancy and overall performance. Transitioning to microkernels can mitigate these issues.

Related Concepts

GPU Acceleration In Data Processing

Microkernel Architecture

Optimizing Apache Spark Workloads

Columnar Data Storage Formats