Accelerate Apache Spark ML on NVIDIA GPUs with Zero Code Change

Erik Ordentlich

The NVIDIA RAPIDS Accelerator for Apache Spark software plug-in pioneered a zero code change user experience (UX) for GPU-accelerated data processing.

NVIDIA

•

Erik Ordentlich

•5 min read•intermediate•

--

•View Original

ApacheApache SparkAWSPandasPySparkPythonSQL

Overview

The article discusses how the NVIDIA RAPIDS Accelerator for Apache Spark enables zero code change for GPU-accelerated data processing, enhancing the performance of Apache Spark ML applications. It highlights the new Spark RAPIDS ML library, which can accelerate applications by over 100x, and describes the latest functionalities that allow users to skip import statement changes for a seamless experience.

What You'll Learn

1

How to accelerate Apache Spark ML applications without changing import statements

2

Why using the Spark RAPIDS ML library can improve application performance by over 100x

3

When to use the spark-rapids-submit command for accelerated execution

Prerequisites & Requirements

Basic understanding of Apache Spark and MLlib
Installation of NVIDIA RAPIDS Accelerator for Apache Spark and Spark RAPIDS ML library

Key Questions Answered

How can I achieve zero code change acceleration in Spark ML applications?

You can achieve zero code change acceleration by using the Spark RAPIDS ML library and the new spark_rapids_ml.install module, which automatically redirects imports of pyspark.ml estimators to their accelerated counterparts without requiring any changes to your code.

What performance improvements can I expect from using NVIDIA RAPIDS with Apache Spark?

Using NVIDIA RAPIDS with Apache Spark can lead to performance improvements of over 9x for Spark SQL and DataFrame applications and over 100x for applications utilizing the Spark RAPIDS ML library.

What command do I use to run an accelerated PySpark application?

To run an accelerated PySpark application, replace the traditional spark-submit command with spark-rapids-submit, which allows for the same options while enabling acceleration for MLlib parts of your application.

Key Statistics & Figures

Performance improvement for Spark SQL and DataFrame applications

over 9x

This applies when using the NVIDIA RAPIDS Accelerator for Apache Spark.

Performance improvement for applications using Spark RAPIDS ML

over 100x

This applies to applications leveraging the new Spark RAPIDS ML library.

Technologies & Tools

Backend

Nvidia Rapids Accelerator For Apache Spark

Used to accelerate data processing in Apache Spark applications.

Backend

Spark Rapids ML

A Python library that enhances machine learning performance in Spark applications.

Key Actionable Insights

1
Leverage the Spark RAPIDS ML library to enhance your existing Spark ML applications without modifying your codebase.
This approach allows for significant performance gains while maintaining the integrity of your original application, making it easier to adopt GPU acceleration.

2
Utilize the new spark-rapids-submit command to streamline the process of launching accelerated applications.
This command simplifies the execution of your Spark applications, ensuring that you can take full advantage of GPU acceleration with minimal effort.

3
Explore the integration of Jupyter notebooks with Spark RAPIDS for interactive data analysis.
Running Jupyter notebooks with the pyspark-rapids command enables real-time experimentation and analysis while benefiting from GPU acceleration.

Common Pitfalls

1

Failing to install the necessary libraries before attempting to run accelerated applications.

Without the Spark RAPIDS ML library and the NVIDIA RAPIDS Accelerator, users may encounter errors or fail to achieve the expected performance improvements.

Related Concepts

GPU Acceleration In Data Processing

Machine Learning With Apache Spark

Performance Optimization Techniques