NVIDIA AI Inference Performance Milestones: Delivering Leading Throughput, Latency and Efficiency

Inference is where AI-based applications really go to work. Object recognition, image classification, natural language processing…

Dave Salvator
4 min readintermediate
--
View Original

Overview

The article discusses NVIDIA's advancements in AI inference performance, highlighting the capabilities of the TensorRT 5 optimizer and the NVIDIA T4 and V100 GPUs. It emphasizes the importance of throughput, latency, and efficiency in AI applications, showcasing how these technologies can enhance real-time services and data center productivity.

What You'll Learn

1

How to leverage TensorRT 5 for optimizing AI inference

2

Why NVIDIA T4 and V100 GPUs are suitable for different server designs

3

When to choose between scale-up and scale-out deployments for AI workloads

Key Questions Answered

What are the key performance metrics for NVIDIA GPUs in AI inference?
NVIDIA T4 and V100 GPUs deliver high throughput, low latency, and great efficiency. T4 achieves up to 85 images/sec/Watt, while both GPUs can provide latencies around 1ms, making them suitable for real-time services.
How does TensorRT 5 enhance AI inference performance?
TensorRT 5 optimizes inference by supporting the NVIDIA T4 Cloud GPU and its Turing Tensor Cores, which accelerate INT8 precision more than 2x compared to previous generations. It also includes the TensorRT Inference Server for better GPU utilization.
What types of workloads are suited for NVIDIA V100 and T4 GPUs?
NVIDIA V100 is designed for scale-up deployments handling heavy workloads like AI training and HPC, while T4 is optimized for scale-out designs, capable of handling high-batch inference, real-time services, and video transcoding.
What inference operations are involved in a digital assistant's response?
A digital assistant's response involves multiple inference operations: Automatic Speech Recognition, Natural Language Processing for understanding and responding, a Recommender System, and Speech Synthesis to deliver the answer. All must occur within one second.

Key Statistics & Figures

T4 GPU efficiency
85 images/sec/Watt
This efficiency metric highlights T4's performance in delivering high throughput with low energy consumption.
Latency for real-time services
around 1ms
Both NVIDIA V100 and T4 GPUs can achieve this latency, making them suitable for applications requiring immediate responses.
INT8 precision acceleration
more than 2x faster
This acceleration is achieved with the Turing Tensor Cores in the T4 GPU compared to previous generation low-power offerings.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Software
Tensorrt 5
Used as an inference optimizer and runtime for AI applications.
Hardware
Nvidia T4 Cloud GPU
Provides enhanced performance for AI inference with Turing Tensor Cores.
Hardware
Nvidia V100
Designed for high-performance AI training and inference workloads.
Tools
Docker
Facilitates the integration of TensorRT Inference Server into DevOps deployments.
Tools
Kubernetes
Supports the deployment and management of containerized applications including inference services.

Key Actionable Insights

1
Utilize TensorRT 5 to optimize your AI inference workflows for better performance.
By implementing TensorRT 5, you can take advantage of its support for the latest NVIDIA GPUs, enhancing the speed and efficiency of your AI applications.
2
Choose the right GPU based on your deployment strategy—V100 for scale-up and T4 for scale-out.
Understanding the differences between these GPUs allows for better resource allocation and performance optimization in your data center.
3
Focus on minimizing latency for real-time AI services to improve user experience.
With both T4 and V100 capable of achieving around 1ms latencies, optimizing your inference operations can significantly enhance the responsiveness of AI applications.

Common Pitfalls

1
Over-relying on CPU-only practices for AI inference can lead to performance bottlenecks.
As AI applications become more complex, leveraging GPU acceleration is essential to meet the demands of real-time processing and high throughput.