Dr. Mandar Gogate M.Gogate@napier.ac.uk
Principal Research Fellow
Robust Real-time Audio-Visual Speech Enhancement based on DNN and GAN
Gogate, Mandar; Dashtipour, Kia; Hussain, Amir
Authors
Dr Kia Dashtipour K.Dashtipour@napier.ac.uk
Lecturer
Prof Amir Hussain A.Hussain@napier.ac.uk
Professor
Abstract
The human auditory cortex contextually integrates audio-visual (AV) cues to better understand speech in a cocktail party situation. Recent studies have shown that AV speech enhancement (SE) models can significantly improve speech quality and intelligibility in low signal-to-noise ratios ( SNR < −5dB ) environments compared to audio-only (A-only) SE models. However, despite substantial research in the area of AV SE, development of real-time processing models that can generalise across various types of visual and acoustic noises remains a formidable technical challenge. This paper introduces a novel framework for low-latency, speaker-independent AV SE. The proposed framework is designed to generalise to visual and acoustic noises encountered in real world settings. In particular, a generative adversarial network (GAN) is proposed to address the issue of visual speech noise including poor lighting in real noisy environments. In addition, a novel real-time AV SE based on a deep neural network is proposed. The model leverages the enhanced visual speech from the GAN to deliver robust SE. The effectiveness of the proposed framework is evaluated on synthetic AV datasets using objective speech quality and intelligibility metrics. Furthermore, subjective listening tests are conducted using real noisy AV corpora. The results demonstrate that the proposed real-time AV SE framework improves the mean opinion score by 20% as compared to state-of-the-art SE approaches including recent DNN based AV SE models.
Citation
Gogate, M., Dashtipour, K., & Hussain, A. (in press). Robust Real-time Audio-Visual Speech Enhancement based on DNN and GAN. IEEE Transactions on Artificial Intelligence, https://doi.org/10.1109/tai.2024.3366141
Journal Article Type | Article |
---|---|
Acceptance Date | Feb 1, 2024 |
Online Publication Date | Feb 15, 2024 |
Deposit Date | Apr 19, 2024 |
Publicly Available Date | Apr 22, 2024 |
Publisher | Institute of Electrical and Electronics Engineers |
Peer Reviewed | Peer Reviewed |
DOI | https://doi.org/10.1109/tai.2024.3366141 |
Keywords | audio-visual, speech enhancement, generative adversarial network |
Public URL | http://researchrepository.napier.ac.uk/Output/3597066 |
Files
Robust Real-time Audio-Visual Speech Enhancement based on DNN and GAN (accepted manuscript)
(7.1 Mb)
PDF
You might also like
Statistical Downscaling Modeling for Temperature Prediction
(2024)
Book Chapter
Federated Learning for Market Surveillance
(2024)
Book Chapter
Intrusion Detection Systems Using Machine Learning
(2023)
Book Chapter
Downloadable Citations
About Edinburgh Napier Research Repository
Administrator e-mail: repository@napier.ac.uk
This application uses the following open-source libraries:
SheetJS Community Edition
Apache License Version 2.0 (http://www.apache.org/licenses/)
PDF.js
Apache License Version 2.0 (http://www.apache.org/licenses/)
Font Awesome
SIL OFL 1.1 (http://scripts.sil.org/OFL)
MIT License (http://opensource.org/licenses/mit-license.html)
CC BY 3.0 ( http://creativecommons.org/licenses/by/3.0/)
Powered by Worktribe © 2024
Advanced Search