Extending NVIDIA Performance Leadership with MLPerf Inference 1.0 Results

Dave Salvator

In this post, we step through some of these optimizations, including the use of Triton Inference Server and the A100 Multi-Instance GPU (MIG) feature.

NVIDIA

•

Dave Salvator

•6 min read•intermediate•

--

•View Original

LSTM

Overview

The article discusses NVIDIA's performance leadership in AI inference as demonstrated by the MLPerf Inference 1.0 results. It highlights the introduction of new GPUs, optimizations made for inference tasks, and the capabilities of the Triton Inference Server and Multi-Instance GPU (MIG) feature.

What You'll Learn

1

How to utilize the Triton Inference Server for deploying AI models

2

Why using INT8 precision can improve inference performance

3

How to leverage Multi-Instance GPU (MIG) for better server utilization

Prerequisites & Requirements

Understanding of AI inference concepts
Familiarity with Triton Inference Server and TensorRT(optional)

Key Questions Answered

What are the key features of MLPerf Inference 1.0?

MLPerf Inference 1.0 introduces new features such as tests for power and energy efficiency, and increased test runtimes from 1 minute to 10 minutes. These changes aim to better evaluate the performance of AI inference across various platforms and applications.

How did NVIDIA perform in the MLPerf Inference 1.0 benchmarks?

NVIDIA was the only company to submit results for all data center and edge tests, achieving the best performance across all categories. This included the debut of the A10 and A30 GPUs, which are designed for specific AI workloads.

What optimizations did NVIDIA implement for inference tasks?

NVIDIA implemented several optimizations, including the use of INT8 precision for models like RNN-T, layer fusion to reduce computational load, and leveraging the Triton Inference Server for efficient model deployment. These optimizations significantly improved performance and efficiency.

What is the significance of the Multi-Instance GPU (MIG) feature?

The MIG feature allows a single A100 GPU to be partitioned into multiple instances, enabling better server utilization and the ability to run multiple inference workloads simultaneously. This approach helps maximize resource usage while maintaining high performance.

Key Statistics & Figures

Performance improvement

46%

The MLPerf Inference 1.0 results show a 46% performance increase compared to the previous MLPerf 0.7 submission.

Technologies & Tools

Software

Triton Inference Server

Used for deploying AI models at scale across various infrastructures.

Software

Tensorrt

Utilized for optimizing AI inference performance.

Hardware

Multi-instance GPU (mig)

Enables partitioning of GPUs to run multiple inference workloads simultaneously.

Key Actionable Insights

1
Utilize the Triton Inference Server to streamline your AI model deployment process.
Triton allows for deploying models from various frameworks on any infrastructure, making it easier to manage and scale AI applications across cloud, data center, or edge environments.

2
Consider using INT8 precision for your inference tasks to enhance performance and reduce latency.
By adopting INT8, you can achieve better performance metrics while maintaining the necessary accuracy, which is particularly beneficial for real-time applications.

3
Leverage the Multi-Instance GPU (MIG) feature to optimize resource allocation in your data center.
MIG enables you to run multiple workloads on a single GPU, improving overall server utilization and allowing for efficient handling of varying compute demands.

Common Pitfalls

1

Failing to optimize AI models for inference can lead to suboptimal performance.

Many developers overlook the importance of precision and optimization techniques like layer fusion, which can significantly impact the efficiency and speed of AI applications.

Related Concepts

AI Inference Optimization Techniques

Performance Benchmarking In AI

Deployment Strategies For AI Models