New Standard for Speech Recognition and Translation from the NVIDIA NeMo Canary Model

Elena Rastorgueva

NVIDIA NeMo is an end-to-end platform for the development of multimodal generative AI models at scale anywhere—on any cloud and on-premises. The NeMo team just…

NVIDIA

•

Elena Rastorgueva

•4 min read•intermediate•

--

•View Original

ApacheCythonGradioPyTorchWhisper

Overview

The article discusses the release of the NVIDIA NeMo Canary model, a state-of-the-art multilingual model for speech recognition and translation. It highlights its capabilities in transcribing and translating audio in English, Spanish, German, and French with high accuracy, and provides insights into its architecture and usage.

What You'll Learn

1

How to use the Canary model for speech transcription and translation

2

Why the Canary model outperforms other models in transcription accuracy

3

How to install and set up NVIDIA NeMo for using the Canary model

Prerequisites & Requirements

Installation of NVIDIA NeMo, Cython, and PyTorch (2.0 and later)

Key Questions Answered

What languages does the Canary model support for transcription and translation?

The Canary model supports transcription and translation between English, Spanish, German, and French. It can transcribe speech in these languages and provide bi-directional translations among them.

How does the performance of the Canary model compare to other models?

The Canary model ranks at the top of the HuggingFace Open ASR Leaderboard with an average word error rate (WER) of 6.67%, outperforming models like Whisper-large-v3 and SeamlessM4T-Medium-v1 on transcription and translation tasks.

What is the architecture of the Canary model?

The Canary model is an encoder-decoder architecture utilizing the Fast-Conformer encoder, which is optimized for efficiency, achieving approximately 3x savings on compute and 4x savings on memory. It processes audio as log-mel spectrogram features.

How can you transcribe audio files using the Canary model?

To transcribe audio files, you can load the Canary model from NeMo and use the transcribe method, specifying the audio file path and language parameters. This allows for transcription in various supported languages.

Key Statistics & Figures

Average word error rate (WER)

6.67%

This is the performance metric for the Canary model on the HuggingFace Open ASR Leaderboard.

Training data used for speech recognition

85K hours

This data was used to train the Canary model for effective speech recognition.

WER on MCV 16.1 test sets

5.77

This is the WER achieved by the Canary model on the MCV 16.1 test sets for English, Spanish, French, and German.

Average BLEU score for translation from English

30.57

This score indicates the translation quality of the Canary model when translating from English to other languages.

Average BLEU score for translation to English

34.25

This score reflects the translation quality of the Canary model when translating from Spanish, French, and German to English.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Framework

Nvidia Nemo

Used for developing and deploying the Canary model.

Architecture

Fast-conformer

The encoder used in the Canary model for efficient speech recognition.

Library

Pytorch

Required for installing and running the NeMo toolkit.

Key Actionable Insights

1
Leverage the Canary model for multilingual applications to enhance user experience in diverse markets.
By implementing the Canary model, developers can provide accurate speech recognition and translation services, catering to a broader audience and improving accessibility.

2
Utilize the efficient architecture of the Canary model to optimize resource usage in applications.
The Fast-Conformer encoder's efficiency can lead to reduced computational costs, making it suitable for deployment in resource-constrained environments.

3
Explore the NVIDIA NeMo toolkit for building custom AI models tailored to specific needs.
NVIDIA NeMo provides a flexible framework that allows developers to create and fine-tune models, enabling innovation in speech recognition and translation technologies.

Common Pitfalls

1

Failing to install the required dependencies before using the Canary model.

If developers do not install NVIDIA NeMo, Cython, and PyTorch as specified, they may encounter errors when trying to load or use the model.

Related Concepts

Speech Recognition

Machine Translation

Generative AI

Nvidia Riva