Robust Real-time Audio-Visual Speech Enhancement based on DNN and GAN

Gogate, Mandar; Dashtipour, Kia; Hussain, Amir

doi:10.1109/tai.2024.3366141

Robust Real-time Audio-Visual Speech Enhancement based on DNN and GAN

Gogate, Mandar; Dashtipour, Kia; Hussain, Amir

Authors

Dr. Mandar Gogate M.Gogate@napier.ac.uk
Principal Research Fellow

Dr Kia Dashtipour K.Dashtipour@napier.ac.uk
Lecturer

Prof Amir Hussain A.Hussain@napier.ac.uk / hussain.doctor@gmail.com
Professor

Abstract

The human auditory cortex contextually integrates audio-visual (AV) cues to better understand speech in a cocktail party situation. Recent studies have shown that AV speech enhancement (SE) models can significantly improve speech quality and intelligibility in low signal-to-noise ratios ( SNR < −5dB ) environments compared to audio-only (A-only) SE models. However, despite substantial research in the area of AV SE, development of real-time processing models that can generalise across various types of visual and acoustic noises remains a formidable technical challenge. This paper introduces a novel framework for low-latency, speaker-independent AV SE. The proposed framework is designed to generalise to visual and acoustic noises encountered in real world settings. In particular, a generative adversarial network (GAN) is proposed to address the issue of visual speech noise including poor lighting in real noisy environments. In addition, a novel real-time AV SE based on a deep neural network is proposed. The model leverages the enhanced visual speech from the GAN to deliver robust SE. The effectiveness of the proposed framework is evaluated on synthetic AV datasets using objective speech quality and intelligibility metrics. Furthermore, subjective listening tests are conducted using real noisy AV corpora. The results demonstrate that the proposed real-time AV SE framework improves the mean opinion score by 20% as compared to state-of-the-art SE approaches including recent DNN based AV SE models.

Citation

Gogate, M., Dashtipour, K., & Hussain, A. (online). Robust Real-time Audio-Visual Speech Enhancement based on DNN and GAN. IEEE Transactions on Artificial Intelligence, https://doi.org/10.1109/tai.2024.3366141

Journal Article Type	Article
Acceptance Date	Feb 1, 2024
Online Publication Date	Feb 15, 2024
Deposit Date	Apr 19, 2024
Publicly Available Date	Apr 22, 2024
Publisher	Institute of Electrical and Electronics Engineers
Peer Reviewed	Peer Reviewed
DOI	https://doi.org/10.1109/tai.2024.3366141
Keywords	audio-visual, speech enhancement, generative adversarial network
Public URL	http://researchrepository.napier.ac.uk/Output/3597066