Optimizing Microsoft Bing Visual Search with NVIDIA Accelerated Libraries

Microsoft Bing Visual Search enables people around the world to find content using photographs as queries. The heart of this capability is Microsoft’s TuringMM…

William Raveane
10 min readintermediate
--
View Original

Overview

The article discusses the optimization of Microsoft Bing Visual Search using NVIDIA accelerated libraries, focusing on the TuringMM visual embedding model. It highlights the collaboration with the Microsoft Bing team to achieve a 5.13x speedup in performance and significant cost reductions through the use of NVIDIA TensorRT, CV-CUDA, and nvImageCodec.

What You'll Learn

1

How to optimize image processing pipelines using NVIDIA libraries

2

Why using TensorRT can significantly improve deep learning inference performance

3

When to implement batch decoding for image processing tasks

Prerequisites & Requirements

  • Understanding of deep learning and image processing concepts
  • Familiarity with NVIDIA TensorRT and CV-CUDA libraries(optional)

Key Questions Answered

How did NVIDIA libraries improve the performance of Bing Visual Search?
NVIDIA libraries like TensorRT, CV-CUDA, and nvImageCodec optimized the Bing Visual Search pipeline, achieving a 5.13x speedup in end-to-end throughput. The use of TensorRT enhanced model inference performance, while CV-CUDA and nvImageCodec accelerated image decoding and preprocessing, resulting in significant efficiency gains.
What was the baseline performance of Bing's original implementation?
The baseline performance of Bing's original implementation using OpenCV and ONNXRuntime-CUDA was 88 queries per second (QPS). This was significantly improved to 356 QPS with ONNXRuntime-TensorRT and further to 452 QPS with the optimized pipeline using nvImageCodec and CV-CUDA.
What specific optimizations were made to the image processing pipeline?
The optimizations included the introduction of nvImageCodec for image decoding and CV-CUDA for preprocessing. These libraries allowed for batch processing and GPU acceleration, which reduced image processing time by up to 6.2x compared to the original OpenCV implementation.
What are the benefits of using batch decoding in image processing?
Batch decoding allows multiple images to be decoded simultaneously, maximizing GPU efficiency and significantly reducing the overall processing time. This is particularly beneficial when handling large datasets, as it can lead to substantial performance improvements in image processing tasks.

Key Statistics & Figures

End-to-end throughput improvement
5.13x
Achieved through the use of NVIDIA acceleration libraries in the Bing Visual Search pipeline.
Baseline performance with OpenCV + ONNXRuntime-CUDA
88 QPS
This was the initial throughput before optimizations were implemented.
Throughput with nvImageCodec + CV-CUDA + ONNXRuntime-TensorRT
452 QPS
This represents the performance after applying all optimizations.
Image processing speedup
up to 6.2x
Achieved by using CV-CUDA and nvImageCodec for image decoding and preprocessing.

Technologies & Tools

Backend
Nvidia Tensorrt
Used for optimizing deep learning model inference.
Backend
Cv-cuda
Used for GPU-accelerated image processing operations.
Backend
Nvimagecodec
Used for efficient image format decoding.
Backend
Onnxruntime
Used as the model execution backend for ONNX graphs.

Key Actionable Insights

1
Leverage NVIDIA TensorRT to optimize deep learning models for better performance.
Using TensorRT can significantly enhance inference speeds, especially for transformer architectures, making it ideal for applications requiring real-time responsiveness.
2
Implement batch processing for image decoding to improve throughput.
Batch processing can reduce latency and increase efficiency, particularly when dealing with large volumes of images, as seen in the Bing Visual Search optimization.
3
Utilize CV-CUDA for GPU-accelerated image processing tasks.
CV-CUDA's optimized operations for image processing can lead to substantial speed improvements, especially when processing diverse image sizes and formats.

Common Pitfalls

1
Relying solely on CPU for image processing can lead to significant bottlenecks.
This occurs because CPU processing is often slower than GPU processing, especially when handling large volumes of images, which can delay overall pipeline performance.

Related Concepts

Deep Learning Optimization
Image Processing Techniques
Nvidia GPU Acceleration