Winning MLPerf Inference 0.7 with a Full-Stack Approach

Dave Salvator

Three trends continue to drive the AI inference market for both training and inference: growing data sets, increasingly complex and diverse networks…

NVIDIA

•

Dave Salvator

•7 min read•advanced•

--

•View Original

AzureBERTDockerGoogle CloudHelmKubernetesPythonPyTorchResNetTensorFlowU-Net

Overview

The article discusses NVIDIA's success in the MLPerf Inference 0.7 benchmark through a full-stack approach that incorporates advanced software optimizations, Multi-Instance GPU (MIG) technology, and the Triton Inference Server. It highlights the importance of these innovations in addressing the growing demands of AI inference workloads across various applications.

What You'll Learn

1

How to optimize AI inference workloads using NVIDIA technologies

2

Why Multi-Instance GPU (MIG) technology enhances GPU utilization

3

How to deploy AI models at scale using Triton Inference Server

Prerequisites & Requirements

Understanding of AI inference concepts and GPU architecture
Familiarity with Triton Inference Server and TensorRT(optional)

Key Questions Answered

What are the key optimizations used in MLPerf Inference 0.7?

The article details several optimizations including the use of int8 and FP16 precision for inferencing, the DALI library for preprocessing, and TensorRT 7.2's support for variable sequence lengths in NLP. These optimizations significantly enhance performance, achieving accuracy targets while improving throughput.

How does Multi-Instance GPU (MIG) technology improve performance?

MIG technology allows a single A100 GPU to be partitioned into up to seven independent instances, enabling multiple workloads to run simultaneously with hardware isolation. This leads to better resource utilization, reduced server counts, and improved energy efficiency, as demonstrated in the MLPerf results.

What role does Triton Inference Server play in AI model deployment?

Triton Inference Server simplifies the deployment of AI models by allowing teams to serve models from various frameworks and manage server resources dynamically. It integrates with Kubernetes for load balancing and can handle multiple networks on individual GPUs, ensuring optimal performance and resource utilization.

What performance improvements were achieved with the new optimizations?

The article mentions that using int8 precision led to 2x higher math throughput rates, while the BERT submission achieved over 99.9% accuracy using FP16. Additionally, the use of sparsity in the Open category BERT submission resulted in a 21% improvement in throughput.

Key Statistics & Figures

Throughput improvement with int8 precision

2x higher

Achieved by quantizing the network to int8 precision while maintaining accuracy.

Accuracy target for BERT

>99.9% of FP32

Required FP16 precision to meet the highest accuracy target.

Throughput improvement with sparsity in BERT

21%

Achieved while preserving the same accuracy as the Closed submission.

Technologies & Tools

Hardware

Multi-instance GPU (mig)

Allows partitioning of A100 GPUs into multiple instances for better resource utilization.

Software

Triton Inference Server

Facilitates deployment and management of AI models at scale.

Software

Dali

Accelerates preprocessing on GPU to avoid CPU bottlenecks.

Software

Tensorrt

Optimizes inference performance for AI models.

Key Actionable Insights

1
Leverage Multi-Instance GPU (MIG) technology to maximize GPU resource utilization.
By partitioning A100 GPUs into multiple instances, organizations can run different workloads simultaneously, reducing hardware costs and improving efficiency. This is particularly beneficial for applications with varying workloads.

2
Utilize the Triton Inference Server for scalable AI model deployment.
Triton allows for dynamic load balancing and efficient resource management, making it easier to handle fluctuating workloads and maintain service level agreements (SLAs). This is crucial for businesses that rely on AI services.

3
Implement preprocessing optimizations using the DALI library to enhance performance.
Using DALI can significantly reduce CPU bottlenecks in preprocessing tasks, leading to faster inference times. This is especially important for real-time AI applications where latency is critical.

Common Pitfalls

1

Underestimating the importance of preprocessing in AI inference workloads.

Many developers focus solely on model architecture and forget that preprocessing can significantly impact overall performance. Utilizing tools like DALI can mitigate this issue.

2

Neglecting to optimize for variable sequence lengths in NLP tasks.

Padding inputs to a fixed length can lead to unnecessary compute overhead. Using TensorRT's plugins for variable sequence lengths can enhance performance by over 35%.

Related Concepts

AI Inference Optimization Techniques

GPU Architecture And Performance

Real-time AI Services And Deployment Strategies