Deploying YOLOv5 on NVIDIA Jetson Orin with cuDLA: Quantization-Aware Training to Inference

Learn how to run an entire object detection pipeline on Orin in the most efficient way using YOLOv5 on its dedicated Deep Learning Accelerator.

Lynette Zhao
11 min readintermediate
--
View Original

Overview

This article provides a comprehensive guide on deploying YOLOv5 on the NVIDIA Jetson Orin platform using cuDLA, focusing on Quantization-Aware Training (QAT) and its conversion to Post-Training Quantization (PTQ) for efficient inference. It details the steps for training, deploying, and validating the model while optimizing performance on the Orin's Deep Learning Accelerator (DLA).

What You'll Learn

1

How to train a YOLOv5 model using Quantization-Aware Training (QAT)

2

How to convert a QAT model to a Post-Training Quantization (PTQ) model for deployment

3

How to deploy a YOLOv5 model on NVIDIA Jetson Orin using cuDLA

4

Why performance profiling is essential for validating inference accuracy on DLA

Prerequisites & Requirements

  • Understanding of deep learning concepts and object detection algorithms
  • Familiarity with TensorRT and cuDLA(optional)
  • Experience with PyTorch and model training

Key Questions Answered

How does Quantization-Aware Training (QAT) improve YOLOv5 model performance?
Quantization-Aware Training (QAT) helps in balancing inference performance and accuracy by simulating the effects of quantization during training. This approach allows the model to learn how to minimize the quantization error, leading to better performance when deployed on hardware like the NVIDIA Jetson Orin's DLA.
What are the steps to convert a QAT model to a PTQ model?
To convert a QAT model to a PTQ model, you need to extract quantization scales from Q/DQ nodes in the QAT model, use neighboring layer information to infer scales for other layers, and then export the ONNX model without Q/DQ nodes along with the calibration cache for TensorRT to build a DLA engine.
What is the performance of YOLOv5 on NVIDIA Jetson Orin DLA?
The YOLOv5 model achieves a mean Average Precision (mAP) of 37.3 on the COCO dataset when deployed on the NVIDIA Jetson Orin DLA, with an inference speed of over 400 frames per second (FPS). This demonstrates the effectiveness of using DLA for real-time object detection tasks.
What are the differences between hybrid mode and standalone mode in cuDLA?
In hybrid mode, DLA tasks are submitted to a CUDA stream, allowing seamless synchronization with other CUDA tasks. In standalone mode, cuDLA requires explicit wait and signal events, which can save resources by avoiding the creation of a CUDA context, making it suitable for pipelines without CUDA dependencies.

Key Statistics & Figures

mean Average Precision (mAP)
37.3
Achieved on the COCO dataset with DLA INT8
Frames per second (FPS)
over 400
Inference speed of YOLOv5 on a single NVIDIA Jetson Orin DLA
Inference time
2.4 ms
Improved inference time for YOLOv5 in INT8 with some layers in FP16

Technologies & Tools

Hardware
Nvidia Jetson Orin
Embedded platform for AI workloads
Software
Cudla
CUDA runtime interface for DLA
Software
Tensorrt
Framework for optimizing deep learning models for inference
Algorithm
Yolov5
Object detection algorithm used for training and inference

Key Actionable Insights

1
Implementing Quantization-Aware Training (QAT) can significantly enhance the accuracy of your YOLOv5 model when deploying on DLA.
By training with QAT, you prepare the model to handle quantization effects, which is crucial for maintaining performance on hardware with limited precision like the DLA.
2
Utilizing the cuDLA APIs for inference can streamline the integration of DLA tasks with existing CUDA workflows.
This allows developers to leverage the computational power of DLA while maintaining compatibility with other CUDA tasks, optimizing overall system performance.
3
Profiling your model's performance on DLA can help identify bottlenecks and areas for optimization.
Using tools like cuDLA sample for layer-wise profiling enables developers to make informed decisions about model architecture adjustments to improve inference speed.

Common Pitfalls

1
Failing to properly calibrate the model during the quantization process can lead to significant drops in accuracy.
Calibration is crucial as it determines the scale values for quantization. Without accurate calibration, the model may not perform well on DLA, resulting in lower mAP scores.
2
Not validating the model on the target hardware can lead to discrepancies between expected and actual performance.
Since the computations on DLA may not be bit-wise accurate compared to GPU, it's essential to validate the model on the actual deployment hardware to ensure it meets performance expectations.

Related Concepts

Quantization-aware Training
Post-training Quantization
Deep Learning Accelerator (dla)
Performance Profiling In Deep Learning