Boosting Data Ingest Throughput with GPUDirect Storage and RAPIDS cuDF

Vukasin Milovanovic

Learn how RAPIDS cuDF accelerates data science with the help of GPUDirect Storage. Dive into the techniques that minimize the time to upload data to the GPU.

NVIDIA

•

Vukasin Milovanovic

•13 min read•intermediate•

--

•View Original

Docker

Overview

The article discusses how GPUDirect Storage (GDS) and RAPIDS cuDF can significantly enhance data ingest throughput in data analytics workflows. By enabling direct data transfers from storage to GPU memory, GDS can achieve up to 3-4 times higher read throughput, alleviating the bottleneck often faced during data preprocessing.

What You'll Learn

1

How to leverage GPUDirect Storage for enhanced data ingest performance

2

Why optimizing data ingest is crucial for data preprocessing workflows

3

When to use multithreading to maximize throughput in cuDF

4

How to configure cuDF for optimal performance with GDS

Prerequisites & Requirements

Understanding of data preprocessing workflows and GPU architecture
Familiarity with RAPIDS cuDF and GPUDirect Storage(optional)

Key Questions Answered

What is GPUDirect Storage and how does it work?

GPUDirect Storage is a technology that allows direct data transfer between storage and GPU memory, bypassing the CPU to reduce latency and improve throughput. It enables efficient data movement from storage devices like NVMe drives directly to the GPU, enhancing performance for data-intensive applications.

How does RAPIDS cuDF utilize GPUDirect Storage?

RAPIDS cuDF uses the cuFile APIs to directly read data from storage into GPU memory, minimizing CPU involvement. This approach allows cuDF to optimize data ingest by only accessing metadata through the CPU, leading to significant performance improvements in data processing tasks.

What benchmarks were used to evaluate cuDF's data ingest performance?

The benchmarks included various datasets such as the New York Taxi trip record and Yelp reviews, focusing on different data properties like file format, data type, run-length, cardinality, and compression type. This comprehensive benchmarking approach helps analyze performance across diverse scenarios.

What are the optimal configurations for using cuDF with GDS?

The optimal configurations for using cuDF with GDS involve setting the thread pool size to 16 and slice size to 4 MiB. These settings help maximize throughput by balancing the number of read tasks and minimizing overhead, especially for larger datasets.

Key Statistics & Figures

cuDF read throughput improvement

3-4x higher

This improvement is achieved when using GPUDirect Storage compared to traditional data ingest methods.

Average throughput increase across various data profiles

30-50%

This increase is observed when leveraging GPUDirect Storage for data reads.

GDS speedup for larger files

up to 270%

Speedup is noted for larger files when comparing GDS reads to traditional bounce buffer reads.

Technologies & Tools

Storage Technology

Gpudirect Storage

Enables direct data transfer between storage and GPU memory.

Data Processing Framework

Rapids Cudf

Accelerates data ingest and processing workflows using GPU capabilities.

Key Actionable Insights

1
To enhance data ingest performance, consider implementing GPUDirect Storage with RAPIDS cuDF in your data workflows. This integration can lead to significant throughput improvements, especially with large datasets.
Utilizing GDS allows you to bypass CPU bottlenecks, making your data processing pipelines more efficient and responsive.

2
Experiment with different thread pool sizes and read slice sizes to find the optimal configuration for your specific workload. Adjusting these parameters can lead to better utilization of your storage hardware.
Performance can vary greatly depending on the data properties and system architecture, so fine-tuning these settings is crucial for achieving peak performance.

3
Leverage the benchmarking suite provided in the cuDF repository to understand how different data properties affect performance. This can guide you in selecting the best formats and configurations for your use case.
Benchmarking helps identify potential performance bottlenecks and informs decisions on data storage and processing strategies.

Common Pitfalls

1

Failing to optimize the level of parallelism can lead to suboptimal performance during data ingest.

If too few threads are used, the read bandwidth may not be fully utilized, while too many threads can introduce overhead. It's essential to find a balance to maximize throughput.

Related Concepts

Data Preprocessing Workflows

GPU Architecture And Performance

Benchmarking Techniques For Data Processing