Programming Tensor Cores in CUDA 9

Brad Nemire

A defining feature of the new Volta GPU Architecture is its Tensor Cores, which give the Tesla V100 accelerator a peak throughput 12 times the 32-bit floating…

NVIDIA

•

Brad Nemire

•2 min read•advanced•

--

•View Original

Artificial IntelligencePyTorchTensorFlow

Overview

The article discusses the capabilities and programming of Tensor Cores in CUDA 9, highlighting their significant performance improvements in deep learning applications. It details how Tensor Cores can achieve up to 125 Tensor TFLOPS and their integration with popular deep learning frameworks.

What You'll Learn

1

How to use Tensor Cores for matrix-multiply-and-accumulate operations in CUDA

2

Why Tensor Cores are essential for improving performance in deep learning applications

3

How to enable Tensor Cores in TensorFlow, PyTorch, MXNet, and Caffe2

Prerequisites & Requirements

Understanding of deep learning concepts and frameworks
Familiarity with CUDA programming

Key Questions Answered

What are Tensor Cores and how do they enhance GPU performance?

Tensor Cores are programmable matrix-multiply-and-accumulate units found in the Tesla V100 GPU, capable of delivering up to 125 Tensor TFLOPS. They significantly increase floating-point compute throughput while maintaining modest area and power costs, making them essential for deep learning tasks.

How can Tensor Cores be utilized in popular deep learning frameworks?

Tensor Cores are supported in frameworks like TensorFlow, PyTorch, MXNet, and Caffe2, allowing developers to leverage their performance benefits for training and inference. The Mixed-Precision Training Guide provides details on enabling Tensor Cores in these frameworks.

What performance improvements do Tensor Cores provide for GEMM and convolutions?

Tensor Cores enhance performance for GEMM computations and convolutions, which are critical in many applications, including signal processing and fluid dynamics. They address the increasing need for speed as data sizes grow exponentially.

Key Statistics & Figures

Peak throughput of Tensor Cores

125 Tensor TFLOPS

This performance metric is crucial for training and inference applications in deep learning.

Number of Tensor Cores in Tesla V100

640 Tensor Cores

Each Streaming Multiprocessor (SM

Throughput improvement over Tesla P100

12 times

The Tesla V100's Tensor Cores provide a significant increase in throughput compared to the previous-generation Tesla P100.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Backend

Cuda

Used for programming Tensor Cores and optimizing deep learning applications.

Framework

Tensorflow

One of the deep learning frameworks that supports Tensor Cores.

Framework

Pytorch

Another deep learning framework that utilizes Tensor Cores for enhanced performance.

Library

Cublas

A CUDA library that uses Tensor Cores to accelerate GEMM computations.

Library

Cudnn

A CUDA library that leverages Tensor Cores for speeding up convolutions and recurrent neural networks.

Key Actionable Insights

1
Integrate Tensor Cores into your deep learning models to significantly boost performance during training and inference.
Utilizing Tensor Cores can lead to faster model training times and improved inference speeds, which is crucial as models become more complex and data-intensive.

2
Leverage the cuBLAS and cuDNN libraries to optimize matrix operations and convolutions in your applications.
These libraries are specifically designed to take advantage of Tensor Cores, ensuring that your applications run efficiently on NVIDIA hardware.

3
Explore the Mixed-Precision Training Guide to understand how to effectively use Tensor Cores in various deep learning frameworks.
This guide provides practical steps and considerations for maximizing the benefits of Tensor Cores in your projects.

Common Pitfalls

1

Failing to enable Tensor Cores in deep learning frameworks can lead to suboptimal performance.

Without enabling Tensor Cores, developers may miss out on significant speed improvements that are critical for handling large datasets and complex models.

Related Concepts

Deep Learning Frameworks

Matrix-multiply-and-accumulate Operations

Performance Optimization Techniques