Getting the Most Out of NVIDIA T4 on AWS G4 Instances

Learn how to get the best natural language inference performance from AWS G4dn instance powered by NVIDIA T4 GPUs, and how to deploy BERT networks easily using…

Overview

This article discusses optimizing AI inference performance using NVIDIA T4 GPUs on AWS G4 instances, particularly for natural language processing applications like BERT. It highlights the cost efficiency, performance metrics, and deployment strategies using NVIDIA Triton Inference Server and TensorRT.

What You'll Learn

1

How to deploy BERT models for inference using NVIDIA Triton Inference Server

2

Why using INT8 precision can improve inference performance significantly

3

How to optimize inference performance with TensorRT on AWS G4 instances

Prerequisites & Requirements

  • Understanding of AI inference and natural language processing concepts
  • Familiarity with Docker and NVIDIA Triton Inference Server(optional)

Key Questions Answered

What is the cost of running BERT-based networks on AWS G4 instances?
Running BERT-based networks on AWS G4 instances costs about 10 cents for a million sentences with BERT Base and around 30 cents for BERT Large. This cost efficiency allows organizations to deploy powerful natural language applications affordably.
How does NVIDIA Triton Inference Server enhance model deployment?
NVIDIA Triton Inference Server simplifies model deployment by providing features like automatic load balancing, automatic scaling, and dynamic batching, which are crucial for real-time AI applications. This helps maximize GPU utilization and throughput.
What performance improvements can be achieved using INT8 precision?
Using INT8 precision with NVIDIA Triton can yield up to an 80% performance improvement compared to FP16, allowing for more simultaneous requests at any given latency requirement, which is essential for real-time applications.
What are the performance metrics for BERT inference on AWS G4 instances?
For BERT Base, the T4 GPU can process nearly 1,800 sentences per second within a 10ms latency budget. For BERT Large, it can deliver 449 samples per second under the same conditions, showcasing the capability of T4 GPUs for high-throughput inference.

Key Statistics & Figures

Cost per million sentences (BERT Base)
10 cents
Cost efficiency for deploying BERT-based networks on AWS G4 instances.
Cost per million sentences (BERT Large)
30 cents
Cost for deploying the larger BERT model on AWS G4 instances.
Performance improvement using INT8
80%
Performance gain compared to FP16 precision in inference tasks.
Throughput for BERT Base
1,800 sentences/sec
Achieved within a 10ms latency budget on T4 GPU.
Throughput for BERT Large
449 sentences/sec
Achieved within a 10ms latency budget on T4 GPU.

Technologies & Tools

Hardware
Nvidia T4
Used for AI inference on AWS G4 instances.
Software
Nvidia Triton Inference Server
Facilitates model deployment and inference optimization.
Software
Tensorrt
Optimizes deep learning models for inference performance.
Cloud Service
AWS G4 Instances
Provides GPU-based instances for machine learning inference.

Key Actionable Insights

1
Utilize NVIDIA Triton Inference Server to enhance your AI model deployment efficiency.
Triton offers features like dynamic batching and concurrent model execution, which can significantly improve throughput and reduce latency for real-time applications.
2
Leverage TensorRT for optimizing your deep learning models to achieve lower latency and higher throughput.
TensorRT's capabilities, such as mixed precision and layer fusion, ensure that your models run efficiently on NVIDIA GPUs, maximizing performance.
3
Consider using INT8 precision for your inference workloads to improve performance metrics.
The article notes an 80% performance improvement when using INT8 over FP16, making it a crucial consideration for applications requiring high throughput.

Common Pitfalls

1
Neglecting to optimize model precision can lead to suboptimal performance.
Using FP16 instead of INT8 can significantly reduce throughput, especially for real-time applications that require high performance.
2
Failing to utilize dynamic batching may result in inefficient GPU utilization.
Without dynamic batching, individual inference requests may not be processed efficiently, leading to increased latency and reduced throughput.

Related Concepts

AI Inference Optimization
Nvidia Triton Inference Server Features
Tensorrt Performance Enhancements