Programming Tensor Cores in CUDA 9

A defining feature of the new Volta GPU Architecture is its Tensor Cores, which give the Tesla V100 accelerator a peak throughput 12 times the 32-bit floating…

Brad Nemire
2 min readadvanced
--
View Original

Overview

The article discusses the capabilities and programming of Tensor Cores in CUDA 9, highlighting their significant performance improvements in deep learning applications. It details how Tensor Cores can achieve up to 125 Tensor TFLOPS and their integration with popular deep learning frameworks.

What You'll Learn

1

How to use Tensor Cores for matrix-multiply-and-accumulate operations in CUDA

2

Why Tensor Cores are essential for improving performance in deep learning applications

3

How to enable Tensor Cores in TensorFlow, PyTorch, MXNet, and Caffe2

Prerequisites & Requirements

  • Understanding of deep learning concepts and frameworks
  • Familiarity with CUDA programming

Key Questions Answered

What are Tensor Cores and how do they enhance GPU performance?
Tensor Cores are programmable matrix-multiply-and-accumulate units found in the Tesla V100 GPU, capable of delivering up to 125 Tensor TFLOPS. They significantly increase floating-point compute throughput while maintaining modest area and power costs, making them essential for deep learning tasks.
How can Tensor Cores be utilized in popular deep learning frameworks?
Tensor Cores are supported in frameworks like TensorFlow, PyTorch, MXNet, and Caffe2, allowing developers to leverage their performance benefits for training and inference. The Mixed-Precision Training Guide provides details on enabling Tensor Cores in these frameworks.
What performance improvements do Tensor Cores provide for GEMM and convolutions?
Tensor Cores enhance performance for GEMM computations and convolutions, which are critical in many applications, including signal processing and fluid dynamics. They address the increasing need for speed as data sizes grow exponentially.

Key Statistics & Figures

Peak throughput of Tensor Cores
125 Tensor TFLOPS
This performance metric is crucial for training and inference applications in deep learning.
Number of Tensor Cores in Tesla V100
640 Tensor Cores
Each Streaming Multiprocessor (SM
Throughput improvement over Tesla P100
12 times
The Tesla V100's Tensor Cores provide a significant increase in throughput compared to the previous-generation Tesla P100.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Backend
Cuda
Used for programming Tensor Cores and optimizing deep learning applications.
Framework
Tensorflow
One of the deep learning frameworks that supports Tensor Cores.
Framework
Pytorch
Another deep learning framework that utilizes Tensor Cores for enhanced performance.
Library
Cublas
A CUDA library that uses Tensor Cores to accelerate GEMM computations.
Library
Cudnn
A CUDA library that leverages Tensor Cores for speeding up convolutions and recurrent neural networks.

Key Actionable Insights

1
Integrate Tensor Cores into your deep learning models to significantly boost performance during training and inference.
Utilizing Tensor Cores can lead to faster model training times and improved inference speeds, which is crucial as models become more complex and data-intensive.
2
Leverage the cuBLAS and cuDNN libraries to optimize matrix operations and convolutions in your applications.
These libraries are specifically designed to take advantage of Tensor Cores, ensuring that your applications run efficiently on NVIDIA hardware.
3
Explore the Mixed-Precision Training Guide to understand how to effectively use Tensor Cores in various deep learning frameworks.
This guide provides practical steps and considerations for maximizing the benefits of Tensor Cores in your projects.

Common Pitfalls

1
Failing to enable Tensor Cores in deep learning frameworks can lead to suboptimal performance.
Without enabling Tensor Cores, developers may miss out on significant speed improvements that are critical for handling large datasets and complex models.

Related Concepts

Deep Learning Frameworks
Matrix-multiply-and-accumulate Operations
Performance Optimization Techniques