New Features in CUDA 7.5

Mark Harris

Today I’m happy to announce that the CUDA Toolkit 7.5 Release Candidate is now available. The CUDA Toolkit 7.5 adds support for FP16 storage for up to 2x larger…

NVIDIA

•

Mark Harris

•10 min read•advanced•

--

•View Original

C++Natural Language Processing

Overview

The article discusses the new features introduced in CUDA Toolkit 7.5, including support for 16-bit floating point (FP16) data, new cuSPARSE routines for natural language processing, instruction-level profiling enhancements, and experimental GPU lambdas. These updates aim to improve performance and usability for developers working with NVIDIA GPUs.

What You'll Learn

1

How to utilize FP16 data types for larger datasets in CUDA applications

2

Why instruction-level profiling is essential for optimizing GPU performance

3

How to implement GPU lambdas for in-line parallel computations

Prerequisites & Requirements

Understanding of CUDA programming and GPU architecture
Familiarity with the CUDA Toolkit and NVIDIA Visual Profiler(optional)

Key Questions Answered

What are the main new features in CUDA 7.5?

CUDA 7.5 introduces several key features including support for 16-bit floating point (FP16) data, new cuSPARSE routines for matrix-vector operations, instruction-level profiling for performance optimization, and experimental GPU lambdas for parallel programming. These enhancements aim to improve performance and usability for developers.

How does instruction-level profiling improve performance optimization?

Instruction-level profiling in CUDA 7.5 allows developers to identify specific lines of code that are causing performance bottlenecks. By pinpointing hotspots in the code, developers can focus their optimization efforts where they will have the greatest impact, improving overall application performance.

What benefits do GPU lambdas provide in CUDA 7.5?

GPU lambdas enable developers to write concise, in-line parallel computations directly in their CUDA code. This feature simplifies the coding process and enhances performance by allowing for easier integration of parallel algorithms without the need for separate device functions.

What improvements are made in cuSPARSE routines for natural language processing?

The new cuSPARSE routines, specifically the cusparse{S,D,C,Z}gemvi(), facilitate efficient multiplication of dense matrices by sparse vectors, which is particularly useful in machine learning and natural language processing applications. This enhancement allows for faster computations and better performance in handling large datasets.

Key Statistics & Figures

Potential speedup from using FP16 data

up to 2x

Applications bottlenecked by memory bandwidth may achieve this speedup when using FP16 data types.

Memory capacity increase with FP16

2x larger models

CUDA 7.5 allows applications to store models that are twice the size in GPU memory when using FP16.

Kernel performance improvement from profiling

2.5x

Developers were able to achieve this speedup by optimizing kernels based on insights gained from instruction-level profiling.

Technologies & Tools

Framework

Cuda

Used for parallel computing and GPU programming.

Library

Cusparse

Provides routines for sparse matrix operations.

Tool

Nvidia Visual Profiler

Used for profiling and optimizing CUDA applications.

Key Actionable Insights

1
Leverage FP16 data types to maximize GPU memory usage and performance in your applications.
By utilizing FP16, developers can store larger models in GPU memory, potentially doubling the size of datasets processed, which is particularly beneficial for applications constrained by memory bandwidth.

2
Implement instruction-level profiling to identify and optimize performance bottlenecks in your CUDA applications.
Using the enhanced profiling tools in CUDA 7.5, developers can pinpoint specific lines of code that slow down performance, allowing for targeted optimizations that can significantly improve execution speed.

3
Experiment with GPU lambdas to simplify parallel programming in CUDA.
GPU lambdas allow for more readable and maintainable code by enabling developers to define parallel operations inline, making it easier to implement complex algorithms without extensive boilerplate code.

Common Pitfalls

1

Neglecting to profile your CUDA applications can lead to missed optimization opportunities.

Without profiling, developers may not be aware of the specific lines of code that are causing performance issues, leading to inefficient optimizations that do not address the root problems.

Related Concepts

Fp16 Data Types

Cusparse Routines

Instruction-level Profiling

GPU Programming Techniques