Optimizing Recurrent Neural Networks in cuDNN 5

Jeremy Appleyard

This week at GTC 2016, we announced the latest update to NVIDIA Deep Learning SDK, which now includes cuDNN 5. Version 5 offers new features…

NVIDIA

•

Jeremy Appleyard

•9 min read•intermediate•

--

•View Original

Deep LearningGRULSTMNeural NetworksRecurrent Neural NetworksTransformer

Overview

The article discusses the optimizations made in cuDNN 5 for Recurrent Neural Networks (RNNs), focusing on performance improvements and new features that enhance the efficiency of sequence learning tasks. Key optimizations include combining GEMM operations, streaming GEMMs, and fusing point-wise operations, resulting in significant speedups for LSTM networks.

What You'll Learn

1

How to optimize LSTM networks using cuDNN 5

2

Why combining GEMM operations can improve GPU performance

3

When to use CUDA streams for concurrent execution

4

How to fuse point-wise operations to reduce overhead

Prerequisites & Requirements

Understanding of Recurrent Neural Networks and LSTM concepts
Familiarity with CUDA programming and NVIDIA GPUs(optional)

Key Questions Answered

What optimizations are introduced in cuDNN 5 for RNNs?

cuDNN 5 introduces several optimizations for RNNs, including faster convolutions using the Winograd algorithm, support for LSTM networks, and improved performance with FP16 routines on Pascal GPUs. These enhancements lead to significant speedups, with LSTM networks achieving up to 6x faster performance.

How does combining GEMM operations affect performance?

Combining GEMM operations allows for larger matrix multiplications, increasing parallelism and improving GPU utilization. This optimization can lead to a speedup of approximately 2x compared to separate GEMM operations, as it reduces the number of calls and maximizes the use of available CUDA threads.

What is the impact of fusing point-wise operations in LSTM implementations?

Fusing point-wise operations into a single kernel reduces data transfers to and from global memory and minimizes kernel launch overhead. This optimization significantly decreases the runtime spent on these operations, contributing to an overall performance improvement of the LSTM implementation.

When should CUDA streams be utilized in RNN computations?

CUDA streams should be utilized when there are independent GEMM operations that can be computed concurrently. This approach allows for doubling the number of concurrent blocks, enhancing performance by better utilizing the GPU's resources and achieving higher throughput.

Key Statistics & Figures

Baseline performance (GFLOPS)

349

This is the performance achieved by the initial LSTM implementation on a Tesla M40 GPU.

Performance after combining GEMMs (GFLOPS)

724

This indicates a 2.1x speedup compared to the baseline performance.

Performance after fusing point-wise operations (GFLOPS)

1942

This shows a 5.5x speedup from the baseline, demonstrating the effectiveness of this optimization.

Performance with four layers (GFLOPS)

3898

This performance level represents an 11.1x speedup compared to the baseline implementation.

Technologies & Tools

Library

Cudnn 5

Used for optimizing the performance of Recurrent Neural Networks.

Programming Model

Cuda

Utilized for parallel computing to enhance performance of matrix operations.

Hardware

Nvidia Tesla M40

The GPU used for benchmarking the LSTM implementation.

Key Actionable Insights

1
To maximize the performance of LSTM networks, consider combining GEMM operations to reduce the number of matrix multiplications required. This approach not only increases the size of the operations but also enhances the parallelism available on the GPU.
This optimization is particularly effective when working with large datasets or complex models, as it allows for better utilization of GPU resources and can lead to significant speed improvements.

2
Utilize CUDA streams to execute independent operations concurrently. This can effectively increase the number of active CUDA blocks, leading to better performance and throughput in your RNN implementations.
By managing the execution of multiple streams, you can ensure that the GPU remains busy and that resources are used efficiently, especially in scenarios with high computational demands.

3
Fusing point-wise operations into a single kernel can drastically reduce overhead and improve runtime efficiency. This technique minimizes the number of kernel launches and data transfers, which are common bottlenecks in GPU computations.
Implementing this optimization is crucial when dealing with complex RNN architectures where point-wise operations constitute a significant portion of the execution time.

Common Pitfalls

1

One common pitfall is underutilizing the GPU by not maximizing the number of concurrent thread blocks during GEMM operations. This leads to poor performance and inefficient use of resources.

To avoid this, ensure that the number of thread blocks is sufficient to match or exceed the number of Streaming Multiprocessors (SMs) on the GPU, thereby improving occupancy and throughput.

2

Another issue is failing to fuse point-wise operations, which can cause excessive kernel launch overhead and data transfer times. This can significantly slow down the overall computation.

By fusing these operations into a single kernel, you can minimize the number of launches and improve performance, especially in RNN implementations where point-wise operations are frequent.

Related Concepts

Recurrent Neural Networks

Long Short-term Memory (lstm)

Cuda Programming

Performance Optimization Techniques