New GAN Can Lipread and Synthesize Speech

Researchers from Samsung and Imperial College in London developed a deep learning solution that uses computer vision for visual speech recognition.

Nefi Alarcon
2 min readbeginner
--
View Original

Overview

Researchers from Samsung and Imperial College London have developed a deep learning model that utilizes Generative Adversarial Networks (GANs) for lipreading and synthesizing speech from video. This innovative approach addresses the limitations of traditional audio speech recognition models in noisy environments, producing intelligible speech synchronized with video.

What You'll Learn

1

How to implement a GAN-based model for visual speech recognition

2

Why lipreading technology is beneficial for communication in noisy environments

3

How to leverage NVIDIA GPUs for deep learning model training and inference

Prerequisites & Requirements

  • Understanding of deep learning and GANs
  • Familiarity with PyTorch and cuDNN
  • Experience with training deep learning models(optional)

Key Questions Answered

What is the main innovation of the new GAN model developed by researchers?
The new GAN model innovatively maps video directly to raw audio, enabling it to produce intelligible speech synchronized with the video, even from previously unseen speakers. This is a significant advancement in visual speech recognition technology.
How does the model handle noisy environments for speech recognition?
The model utilizes computer vision for visual speech recognition, which allows it to lipread and synthesize audio from video, making it effective in noisy environments where traditional audio recognition fails.
What hardware was used for training and inference of the model?
The training of the model was conducted using an NVIDIA GeForce 1080 TI GPU, while inference was performed on an NVIDIA TITAN V GPU, allowing the model to generate audio for a three-second video in just 60 milliseconds.

Key Statistics & Figures

Inference time for audio generation
60ms
The model can generate audio for a three-second video in this time frame.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Hardware
Nvidia Geforce 1080 Ti
Used for training the deep learning model.
Hardware
Nvidia Titan V
Used for inference of the model.
Software
Cudnn
Accelerated the training process within the PyTorch framework.
Software
Pytorch
Deep learning framework used for developing the model.

Key Actionable Insights

1
Implementing GANs for visual speech recognition can significantly enhance communication in challenging environments.
This technology is particularly useful in settings like video conferencing where background noise can hinder audio clarity, allowing for clearer communication.
2
Utilizing NVIDIA GPUs can drastically reduce training and inference times for deep learning models.
By leveraging powerful GPUs like the GeForce 1080 TI and TITAN V, developers can achieve faster model performance, making it feasible to deploy complex models in real-time applications.

Common Pitfalls

1
Assuming that traditional audio recognition models will perform adequately in all environments.
Many audio speech recognition systems struggle in noisy settings, which can lead to misunderstandings and communication failures. It's crucial to explore alternative methods, such as visual speech recognition, in these scenarios.