NVIDIA Deep Learning SDK Update for Volta Now Available

At GTC 2017, NVIDIA announced Volta optimized updates to the NVIDIA Deep Learning SDK. Today, we’re making these updates available as free downloads to members…

Brad Nemire
2 min readadvanced
--
View Original

Overview

NVIDIA has released updates to the Deep Learning SDK optimized for the Volta architecture, enhancing performance for deep learning frameworks. These updates, available for free to NVIDIA Developer Program members, leverage cuDNN 7 and NCCL 2 to significantly improve training speeds and multi-node scaling efficiency.

What You'll Learn

1

How to utilize cuDNN 7 for faster training of deep learning models

2

Why NCCL 2 is essential for optimizing multi-node deep learning training

3

How to implement mixed-precision training using Tensor Cores on Volta GPUs

Prerequisites & Requirements

  • Understanding of deep learning frameworks and GPU acceleration
  • Access to NVIDIA GPUs and the Deep Learning SDK

Key Questions Answered

What performance improvements does cuDNN 7 provide for deep learning models?
cuDNN 7 enables up to 2.5x faster training of ResNet50 and 3x faster training of NMT language translation LSTM RNNs on Tesla V100 compared to Tesla P100. This significant enhancement is due to optimized convolution operations and mixed-precision training capabilities.
How does NCCL 2 enhance multi-node scaling for deep learning?
NCCL 2 achieves over 90% multi-node scaling efficiency using up to 8 GPU-accelerated servers. It automatically detects the optimal communication path and is optimized for high bandwidth over PCIe and NVLink, making it crucial for efficient distributed training.
What are the key features of Tensor Cores in the Volta architecture?
Tensor Cores in the Volta architecture accelerate convolutions using mixed-precision operations, which significantly boosts performance for deep learning tasks. This allows for faster training and improved efficiency in handling complex models.

Key Statistics & Figures

Training speed improvement for ResNet50
Up to 2.5x faster
Compared to training on Tesla P100 GPUs
Training speed improvement for NMT language translation LSTM RNNs
3x faster
On Tesla V100 vs. Tesla P100
Multi-node scaling efficiency
Over 90%
Using up to 8 GPU-accelerated servers

Technologies & Tools

Library
Cudnn 7
Used for accelerating deep learning training
Library
Nccl 2
Optimizes multi-node communication for deep learning
Hardware
Tensor Cores
Accelerates mixed-precision operations on Volta GPUs

Key Actionable Insights

1
Leverage cuDNN 7 to enhance the training speed of your deep learning models, especially if you are using ResNet50 or LSTM RNNs.
By adopting cuDNN 7, you can drastically reduce training times, which is crucial for iterative model development and experimentation.
2
Utilize NCCL 2 for optimizing multi-node training setups to maximize resource efficiency and performance.
NCCL 2's automatic topology detection and high bandwidth capabilities can significantly improve the scalability of your distributed deep learning applications.

Common Pitfalls

1
Neglecting to optimize your deep learning models for the Volta architecture can lead to suboptimal performance.
Without utilizing the specific features of cuDNN 7 and NCCL 2, you may miss out on significant speed improvements and efficiency gains in training.

Related Concepts

Deep Learning
GPU Acceleration
Mixed-precision Training
Distributed Systems