NVIDIA Announces TensorRT 6; Breaks 10 millisecond barrier for BERT-Large

Nefi Alarcon

Today, NVIDIA released TensorRT 6 which includes new capabilities that dramatically accelerate conversational AI applications, speech recognition…

NVIDIA

•

Nefi Alarcon

•2 min read•intermediate•

--

•View Original

BERTDeep LearningTransformersU-Net

Overview

NVIDIA has announced TensorRT 6, which significantly enhances the performance of conversational AI applications, speech recognition, and image segmentation. The new version achieves BERT-Large inference in just 5.8 milliseconds on T4 GPUs, making it feasible for enterprise deployment.

What You'll Learn

1

How to achieve real-time natural language understanding with BERT-Large inference

2

Why TensorRT 6 is essential for deploying AI applications on NVIDIA GPUs

3

How to optimize applications for dynamic input shapes using TensorRT

Key Questions Answered

How fast can BERT-Large inference be achieved with TensorRT 6?

With TensorRT 6, BERT-Large inference can be achieved in just 5.8 milliseconds on NVIDIA T4 GPUs. This speed allows enterprises to deploy the model in production effectively for the first time.

What new capabilities does TensorRT 6 offer for conversational AI?

TensorRT 6 introduces new optimizations and APIs that enhance the performance of conversational AI applications, enabling tighter integrations with frameworks and support for dynamic input shapes, which is crucial for real-time applications.

What improvements does TensorRT 6 provide for medical applications?

TensorRT 6 offers up to 5x faster inference compared to CPU for image segmentation in medical applications, thanks to new layers designed for 3D convolutions, thereby improving processing efficiency in critical healthcare scenarios.

Key Statistics & Figures

BERT-Large inference time

5.8 ms

Achieved on NVIDIA T4 GPUs, enabling practical deployment in enterprise environments.

BERT-Base inference time

2 ms

This optimization allows for efficient processing of language-based tasks.

Inference speed improvement for medical applications

up to 5x faster

Compared to CPU, enhancing image segmentation tasks in healthcare.

Technologies & Tools

Backend

Tensorrt

Used for optimizing deep learning inference and runtime performance.

AI/ML

Bert

Utilized for natural language understanding and processing tasks.

Key Actionable Insights

1
Leverage TensorRT 6 to optimize your AI applications for lower latency and higher throughput.
This is particularly important for applications requiring real-time processing, such as conversational AI and speech recognition, where every millisecond counts.

2
Utilize the new API features in TensorRT 6 to handle dynamic input shapes efficiently.
This capability is essential for applications with fluctuating compute needs, allowing for more adaptable and responsive AI solutions.

3
Explore the TensorRT Open Source Repo for new samples to accelerate various applications.
The samples include implementations for language processing and image recognition, providing a practical starting point for developers looking to enhance their applications.

Common Pitfalls

1

Overlooking the importance of optimizing input shapes for AI applications.

Failing to account for dynamic input shapes can lead to inefficient processing and increased latency, particularly in real-time applications.

Related Concepts

Natural Language Processing

Deep Learning Optimization

AI Application Deployment