RAPIDS Accelerator for Apache Spark Release v21.10

Karthikeyan Rajendran

This post details the latest functionality of RAPIDS Accelerator for Apache Spark.

NVIDIA

•

Karthikeyan Rajendran

•4 min read•intermediate•

--

•View Original

ApacheApache SparkAzureGoogle CloudRapids

Overview

RAPIDS Accelerator for Apache Spark v21.10 introduces significant performance improvements and new functionalities tailored for GPU acceleration, responding to community requests. This release enhances I/O capabilities, nested data processing, and machine learning support, while also providing updates to the community resources.

What You'll Learn

1

How to leverage RAPIDS Accelerator for Apache Spark to improve data processing speed

2

Why using nested data types can enhance machine learning workflows in Spark

3

When to utilize the Profiling and Qualification tool for optimizing data formats

Key Questions Answered

What performance improvements does RAPIDS Accelerator for Apache Spark v21.10 offer?

The RAPIDS Accelerator for Apache Spark v21.10 offers performance improvements with speed-ups ranging from 1.5x to 27x depending on the compute intensity of the operations performed. This is particularly evident in common data preprocessing queries like Count Distinct, Window, Intersect, and Cross-join.

How does the new plug-in support machine learning in Spark?

The new plug-in jar in RAPIDS Accelerator for Apache Spark v21.10 supports machine learning by enabling training for the Principal Component Analysis algorithm and extending input type support for Parquet and ORC formats, enhancing the capabilities for nested data processing.

What new features were added to the Qualification and Profiling tool?

The Qualification tool now reports on nested data types and includes support for conjunction and disjunction filters, while the Profiling tool provides structured output formats and scales to handle large event logs, improving data analysis capabilities.

What community updates are included in this release?

The release includes updates for Azure users, inviting them to try RAPIDS Accelerator on Azure Synapse, and highlights talks presented at NVIDIA’s GTC event that showcase new functionalities and performance benchmarks.

Key Statistics & Figures

Speed-up range for common queries

1.5x to 27x

This range varies depending on the compute intensity of the operations performed during data preprocessing.

Dataset size used in benchmarks

3TB

The benchmarks were conducted on a dataset of this size to evaluate performance improvements.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Backend

Rapids Accelerator For Apache Spark

Used for GPU acceleration of Apache Spark workloads.

Backend

Cuda

Provides the necessary framework for GPU acceleration in the RAPIDS Accelerator.

Backend

Apache Spark

The primary framework being accelerated by RAPIDS for improved data processing.

Cloud

Google Cloud Platform

Environment where the performance benchmarks were conducted.

Key Actionable Insights

1
Utilize the new nested data type features to enhance your data processing workflows in Spark.
Nested data types allow for more complex data structures, which can improve the efficiency of machine learning algorithms and data analytics tasks.

2
Leverage the Profiling and Qualification tool to identify and optimize data formats in your Spark applications.
This tool can help you understand the structure of your data and apply the right filters, leading to better performance and resource utilization.

3
Take advantage of the community resources and examples available on GitHub to accelerate your learning and implementation of RAPIDS Accelerator.
Community-driven examples can provide practical insights and help you avoid common pitfalls when integrating GPU acceleration into your Spark workflows.

Common Pitfalls

1

Overlooking the importance of nested data types can lead to suboptimal performance in machine learning tasks.

Nested data types can significantly enhance the efficiency of data processing, and failing to utilize them may result in slower computations and increased resource usage.

Related Concepts

GPU Acceleration In Data Processing

Machine Learning With Apache Spark

Nested Data Structures In Data Analytics