Skip to main content

Research Repository

Advanced Search

Robust Real-time Audio-Visual Speech Enhancement based on DNN and GAN

Gogate, Mandar; Dashtipour, Kia; Hussain, Amir

Authors



Abstract

The human auditory cortex contextually integrates audio-visual (AV) cues to better understand speech in a cocktail party situation. Recent studies have shown that AV speech enhancement (SE) models can significantly improve speech quality and intelligibility in low signal-to-noise ratios ( SNR < −5dB ) environments compared to audio-only (A-only) SE models. However, despite substantial research in the area of AV SE, development of real-time processing models that can generalise across various types of visual and acoustic noises remains a formidable technical challenge. This paper introduces a novel framework for low-latency, speaker-independent AV SE. The proposed framework is designed to generalise to visual and acoustic noises encountered in real world settings. In particular, a generative adversarial network (GAN) is proposed to address the issue of visual speech noise including poor lighting in real noisy environments. In addition, a novel real-time AV SE based on a deep neural network is proposed. The model leverages the enhanced visual speech from the GAN to deliver robust SE. The effectiveness of the proposed framework is evaluated on synthetic AV datasets using objective speech quality and intelligibility metrics. Furthermore, subjective listening tests are conducted using real noisy AV corpora. The results demonstrate that the proposed real-time AV SE framework improves the mean opinion score by 20% as compared to state-of-the-art SE approaches including recent DNN based AV SE models.

Citation

Gogate, M., Dashtipour, K., & Hussain, A. (in press). Robust Real-time Audio-Visual Speech Enhancement based on DNN and GAN. IEEE Transactions on Artificial Intelligence, https://doi.org/10.1109/tai.2024.3366141

Journal Article Type Article
Acceptance Date Feb 1, 2024
Online Publication Date Feb 15, 2024
Deposit Date Apr 19, 2024
Publicly Available Date Apr 22, 2024
Publisher Institute of Electrical and Electronics Engineers
Peer Reviewed Peer Reviewed
DOI https://doi.org/10.1109/tai.2024.3366141
Keywords audio-visual, speech enhancement, generative adversarial network
Public URL http://researchrepository.napier.ac.uk/Output/3597066

Files

Robust Real-time Audio-Visual Speech Enhancement based on DNN and GAN (accepted manuscript) (7.1 Mb)
PDF





You might also like



Downloadable Citations