RAPIDS Accelerator for Apache Spark v21.06 Release

Saloni Jain

RAPIDS Accelerator for Apache Spark v21.06 is here! You may notice right away that we’ve had a huge leap in version number since we announced our last release.

NVIDIA

•

Saloni Jain

•4 min read•intermediate•

--

•View Original

ApacheApache SparkAzureSQL

Overview

The RAPIDS Accelerator for Apache Spark v21.06 release introduces significant enhancements, including support for Apache Spark version 3.1.2, simplified installation processes, and a new profiling tool for GPU acceleration. This release aims to streamline data science workflows and improve performance with new functionalities and community partnerships.

What You'll Learn

1

How to utilize the new profiling tool to analyze Spark logs for GPU acceleration suitability

2

Why using RAPIDS Accelerator simplifies installation and enhances performance for Apache Spark applications

3

When to leverage new functionalities for arrays and structs in data processing tasks

Prerequisites & Requirements

Basic understanding of Apache Spark and GPU acceleration concepts
Familiarity with NVIDIA CUDA and its versions(optional)

Key Questions Answered

What new features are included in RAPIDS Accelerator for Apache Spark v21.06?

The RAPIDS Accelerator for Apache Spark v21.06 includes support for Apache Spark version 3.1.2, a simplified installation process with a single RAPIDS cuDF jar compatible with all NVIDIA CUDA 11.x versions, and new functionalities for handling arrays and structs. Additionally, it features a profiling tool to analyze Spark logs for GPU acceleration suitability.

How does the new profiling tool assist in optimizing Spark jobs?

The profiling tool analyzes Spark logs to identify jobs suitable for GPU acceleration and profiles jobs running with the plug-in. It provides insights into CPU event logs, runtime spent on SQL/Dataframe operations, and helps debug jobs by listing failed jobs and executors, along with query duration comparisons.

What improvements have been made for Cloudera and Azure users with this release?

The RAPIDS Accelerator for Apache Spark v21.06 enhances GPU acceleration for Cloudera Data Platform users, making ETL workloads easier to accelerate. Additionally, it integrates with Azure Synapse, allowing users to utilize NVIDIA GPUs for Apache Spark applications without code changes, ensuring a seamless experience.

Key Statistics & Figures

Supported Apache Spark version

3.1.2

This version is now compatible with the RAPIDS Accelerator for enhanced performance.

CUDA versions tested

11.0 and 11.2

The RAPIDS cuDF jar is compatible with all versions of NVIDIA CUDA 11.x.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Backend

Apache Spark

Used for data processing and analytics, now enhanced with GPU acceleration.

Tools

Nvidia Cuda

Provides the necessary framework for GPU acceleration in data science workflows.

Key Actionable Insights

1
Leverage the new profiling tool to analyze your Spark jobs and identify which workloads can benefit from GPU acceleration.
This tool allows you to optimize performance by focusing on jobs that spend significant time on SQL/Dataframe operations, thus maximizing your GPU resources.

2
Utilize the simplified installation process with the new RAPIDS cuDF jar to streamline your setup for Apache Spark.
This change reduces complexity and ensures compatibility across different versions of NVIDIA CUDA, making it easier for teams to adopt GPU acceleration.

3
Explore the new functionalities for arrays and structs to enhance your data processing capabilities.
These features allow for more complex data manipulations and can significantly improve the efficiency of your data workflows.

Common Pitfalls

1

Failing to analyze Spark logs before running jobs can lead to suboptimal performance when using GPU acceleration.

Without using the profiling tool, users may miss identifying jobs that are not suited for GPU processing, resulting in wasted resources and longer runtimes.

Related Concepts

GPU Acceleration In Data Processing

Integration Of Rapids With Cloudera Data Platform

Enhancements In Data Manipulation With Arrays And Structs