NVIDIA Announces New Software and Updates to CUDA, Deep Learning SDK and More

Brad Nemire

At the GPU Technology Conference, NVIDIA announced new updates and software available to download for members of the NVIDIA Developer Program. CUDA 9.2…

NVIDIA

•

Brad Nemire

•4 min read•intermediate•

--

•View Original

Deep LearningKongKubernetesMATLABNeural NetworksPyTorchRecurrent Neural NetworksTensorFlow

Overview

NVIDIA announced significant updates to its software suite, including the CUDA Toolkit, NV Deep Learning SDK, and TensorRT, aimed at enhancing performance for deep learning and AI applications. Key features include optimizations for RNNs and CNNs, faster multi-GPU training, and improved inference capabilities across various frameworks.

What You'll Learn

1

How to utilize CUDA 9.2 for optimizing deep learning models

2

Why cuDNN 7 enhances training performance on Volta architecture

3

How to implement TensorRT for accelerating inference applications

4

When to use NCCL for multi-GPU training in deep learning frameworks

Prerequisites & Requirements

Understanding of deep learning frameworks and GPU architectures
Familiarity with CUDA and deep learning SDKs(optional)

Key Questions Answered

What are the key features of CUDA 9.2?

CUDA 9.2 includes updates to libraries, a new library for custom linear-algebra algorithms, and optimizations that reduce kernel launch latency. It also enhances performance for RNNs and CNNs through cuBLAS optimizations and improves FFT processing with Bluestein kernels.

How does cuDNN 7 improve deep learning training performance?

cuDNN 7 allows deep learning frameworks to leverage the Volta architecture, providing up to 3x faster training performance compared to Pascal GPUs. Key features include an RNN search API for optimal implementation selection and support for grouped convolutions.

What benefits does TensorRT 4 offer for inference applications?

TensorRT 4 accelerates inference applications with up to 45x higher throughput compared to CPU and 50x faster performance on V100 GPUs for ONNX models. It also supports NVIDIA DRIVE Xavier for autonomous vehicles and provides APIs for Volta Tensor Cores.

When will NCCL 2.2 be available and what does it offer?

NCCL 2.2 will be available in May and provides faster multi-GPU training for deep neural networks, enhancing inter-GPU reduction operations for models like ResNet50, which is crucial for scaling deep learning applications.

Key Statistics & Figures

Training performance improvement with cuDNN 7

up to 3x faster

compared to Pascal GPUs

Inference performance improvement with TensorRT 4

50x faster

on V100 GPUs for ONNX models

Throughput increase with TensorRT 4

45x higher

compared to CPU

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Software

Cuda

Used for optimizing deep learning models and reducing kernel launch latency.

Software

Cudnn

Enhances deep learning framework performance on NVIDIA GPUs.

Software

Nccl

Facilitates multi-GPU and multi-node communication for deep learning applications.

Software

Tensorrt

Accelerates inference applications in various domains including NLP and recommendation systems.

Framework

Tensorflow

Integrated with TensorRT for improved inference performance.

Software

Matlab

Announced integration with TensorRT for generating CUDA code.

Orchestration

Kubernetes

Enables scaling of training and inference deployment on multi-cloud GPU clusters.

Software

Isaac Robotics SDK

Provides tools for developing AI in robotics applications.

Key Actionable Insights

1
Leverage CUDA 9.2 to optimize your deep learning models by utilizing the new library for custom linear algebra algorithms and cuBLAS optimizations for RNNs and CNNs.
This is particularly useful for developers looking to enhance the performance of their models on NVIDIA GPUs, especially when working with complex architectures.

2
Integrate TensorRT 4 into your inference pipeline to achieve significant performance gains, especially for applications in speech recognition and natural language processing.
By using TensorRT, developers can drastically reduce inference times, making applications more responsive and efficient.

3
Utilize NCCL for efficient multi-GPU training to improve the scalability of your deep learning models, particularly when working with large datasets and complex networks.
This is essential for teams working on high-performance computing tasks that require collaboration across multiple GPUs.

Common Pitfalls

1

Many developers underestimate the importance of optimizing their deep learning models for specific hardware architectures.

Without leveraging the optimizations provided by tools like CUDA and cuDNN, applications may not perform efficiently, leading to longer training times and suboptimal inference performance.

Related Concepts

Deep Learning Optimization Techniques

GPU Architecture And Performance Tuning

Integration Of AI In Robotics