Getting a Real Time Factor Over 60 for Text-To-Speech Services Using NVIDIA Riva

NVIDIA Riva is an application framework that provides several pipelines for accomplishing conversational AI tasks. Generating high-quality…

Dominique LaSalle
17 min readadvanced
--
View Original

Overview

The article discusses optimizations made to the Text-to-Speech (TTS) pipeline using NVIDIA Riva, focusing on achieving a real-time factor (RTF) over 60. It covers the architecture of the TTS model, including the Tacotron2 and WaveGlow networks, and details the implementation strategies that enhance performance using NVIDIA TensorRT and CUDA.

What You'll Learn

1

How to implement a high-performance TTS pipeline using NVIDIA Riva

2

Why using the C++ TensorRT interface can reduce CPU overhead

3

How to optimize neural network performance with custom CUDA plugins

4

When to use the ONNX parser for model conversion

Prerequisites & Requirements

  • Understanding of neural networks and deep learning concepts
  • Familiarity with NVIDIA TensorRT and CUDA(optional)
  • Experience with PyTorch and model optimization techniques

Key Questions Answered

What are the main components of the TTS pipeline in NVIDIA Riva?
The TTS pipeline in NVIDIA Riva consists of two main components: the Tacotron2 network, which converts text to mel-scale spectrograms, and the WaveGlow network, which generates audio waveforms from these spectrograms. This architecture allows for efficient and high-quality speech synthesis.
How does the implementation achieve a real-time factor over 60?
The implementation achieves a real-time factor over 60 by optimizing the TTS pipeline using NVIDIA TensorRT and custom CUDA plugins. These optimizations reduce latency and improve GPU utilization, allowing the system to generate audio at a speed significantly faster than real-time.
What performance improvements are achieved with TensorRT 7.1?
With TensorRT 7.1, the TTS pipeline on the A100 GPU achieves a remarkable 61.4x real-time factor, generating 7.3 seconds of audio in less than 120 milliseconds. This demonstrates significant enhancements in performance compared to previous versions.
What are the benefits of using custom plugins in the Tacotron2 decoder?
Custom plugins in the Tacotron2 decoder allow for low-level optimizations that reduce CPU overhead and improve GPU utilization. By fusing multiple operations into single kernels, the implementation minimizes the number of kernel launches, leading to faster execution times.

Key Statistics & Figures

Real-time factor (RTF)
33.7x
This RTF was achieved with the optimized implementation on the V100 GPU using TensorRT 7.0.
Latency for generating audio
200 ms
This latency corresponds to generating 6.7 seconds of audio using the TensorRT C++ API with plugins.
Performance improvement with TensorRT 7.1
61.4x
This improvement was noted on the A100 GPU, generating 7.3 seconds of audio in less than 120 milliseconds.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Application Framework
Nvidia Riva
Used for building and optimizing the TTS pipeline.
Performance Optimization Tool
Tensorrt
Utilized for optimizing neural network inference performance.
Parallel Computing Platform
Cuda
Employed for implementing custom plugins and optimizing performance.
Deep Learning Framework
Pytorch
Used for training the Tacotron2 and WaveGlow models.
Model Format
Onnx
Facilitates the conversion of models from PyTorch to TensorRT.

Key Actionable Insights

1
Utilize the C++ TensorRT interface for building neural networks to minimize CPU overhead.
This approach is particularly beneficial for applications requiring low latency, as it reduces the time spent coordinating tasks between the CPU and GPU.
2
Implement custom CUDA plugins to optimize specific layers in your neural network.
By doing so, you can achieve significant performance improvements, especially in scenarios where traditional layers may introduce bottlenecks.
3
Leverage the ONNX parser for efficient model conversion when transitioning from PyTorch to TensorRT.
This method simplifies the process of optimizing models for inference, ensuring that you can take advantage of TensorRT's capabilities quickly.

Common Pitfalls

1
Failing to optimize the decoder loop can lead to low GPU utilization.
This occurs when the CPU cannot generate work fast enough for the GPU, resulting in idle GPU time. To avoid this, implement custom plugins to reduce CPU overhead and improve execution efficiency.
2
Not using the ONNX parser effectively may hinder performance.
If the ONNX model is not optimized for TensorRT, it can lead to suboptimal inference speeds. Ensure that the model is properly exported and configured for best performance.

Related Concepts

Neural Network Optimization Techniques
Performance Profiling With Nvidia Nsight Systems
Advanced Cuda Programming