Deploying NVIDIA Riva Multilingual ASR with Whisper and Canary Architectures While Selectively Deactivating

NVIDIA has consistently developed automatic speech recognition (ASR) models that set the benchmark in the industry. Earlier versions of NVIDIA Riva…

Sven Chilton
12 min readintermediate
--
View Original

Overview

The article discusses the deployment of NVIDIA Riva's multilingual Automatic Speech Recognition (ASR) capabilities using Whisper and Canary architectures. It highlights the new features in Riva 2.18.0, including support for various ASR models, the introduction of SSML tags for selective translation, and practical implementation examples.

What You'll Learn

1

How to deploy NVIDIA Riva for multilingual ASR using Whisper and Canary architectures

2

How to utilize SSML tags for selective translation in NVIDIA Riva

3

How to perform Any-to-English Automatic Speech Translation (AST) with Riva

Prerequisites & Requirements

  • Familiarity with Automatic Speech Recognition (ASR) concepts
  • Access to NVIDIA Riva SDK and Docker

Key Questions Answered

What are the new features in Riva 2.18.0 for ASR and AST?
Riva 2.18.0 introduces support for the Parakeet model for streaming multilingual ASR, Whisper and Distil-Whisper for offline ASR, and Canary models for various AST tasks. It also adds new SSML tags and dictionaries to enhance translation capabilities.
How can I launch a Riva server with Whisper capabilities?
To launch a Riva server with Whisper, set the appropriate variables in the config.sh script, run riva_init.sh to download models, and then execute riva_start.sh to start the server. Ensure the NGC API key is set as an environmental variable.
What is the purpose of the <dnt> SSML tag in Riva?
<dnt> SSML tags instruct the Megatron NMT model not to translate the enclosed text. This is useful for preserving proper names or phrases that should remain untranslated in the target language.
How does Whisper handle language detection for ASR?
In Riva 2.18.0, Whisper can automatically detect the language of the input audio when the language_code parameter is set to 'multi'. However, Canary does not support automatic language detection.

Technologies & Tools

Backend
Nvidia Riva
Used for automatic speech recognition and translation tasks.
Model
Whisper
Provides offline ASR and AST capabilities.
Model
Canary
Supports offline ASR and various AST tasks.

Key Actionable Insights

1
Implementing the new SSML tags in your ASR workflows can significantly enhance the accuracy of translations, especially for specialized terms.
By using <dnt> tags, you can prevent critical terms from being altered during translation, ensuring that the output retains its intended meaning.
2
Utilizing the Whisper model for offline ASR can improve performance in environments with limited internet connectivity.
This is particularly beneficial for applications in remote areas where real-time internet access is unreliable, allowing for seamless transcription and translation.
3
Leverage the Riva Skills Quick Start resource for a streamlined setup process.
The provided scripts and configuration examples can save time and reduce errors during deployment, making it easier to integrate Riva into your applications.

Common Pitfalls

1
Failing to set the correct language code for the Canary model can lead to errors in transcription.
Canary does not support the 'multi' language code, so users must specify a single language code to avoid issues during inference.

Related Concepts

Automatic Speech Recognition (asr)
Automatic Speech Translation (ast)
Neural Machine Translation (nmt)
Nvidia Riva SDK