CochleaNet: A robust language-independent audio-visual model for real-time speech enhancement

Gogate, Mandar; Dashtipour, Kia; Adeel, Ahsan; Hussain, Amir

doi:10.1016/j.inffus.2020.04.001

CochleaNet: A robust language-independent audio-visual model for real-time speech enhancement

Gogate, Mandar; Dashtipour, Kia; Adeel, Ahsan; Hussain, Amir

Authors

Dr. Mandar Gogate M.Gogate@napier.ac.uk
Principal Research Fellow

Dr Kia Dashtipour K.Dashtipour@napier.ac.uk
Lecturer

Ahsan Adeel

Prof Amir Hussain A.Hussain@napier.ac.uk
Professor

Abstract

Noisy situations cause huge problems for the hearing-impaired, as hearing aids often make speech more audible but do not always restore intelligibility. In noisy settings, humans routinely exploit the audio-visual (AV) nature of speech to selectively suppress background noise and focus on the target speaker. In this paper, we present a novel language-, noise- and speaker-independent AV deep neural network (DNN) architecture, termed CochleaNet, for causal or real-time speech enhancement (SE). The model jointly exploits noisy acoustic cues and noise robust visual cues to focus on the desired speaker and improve speech intelligibility. The proposed SE framework is evaluated using a first of its kind AV binaural speech corpus, ASPIRE, recorded in real noisy environments, including cafeteria and restaurant settings. We demonstrate superior performance of our approach in terms of both objective measures and subjective listening tests, over state-of-the-art SE approaches, including recent DNN based SE models. In addition, our work challenges a popular belief that scarcity of a multi-lingual, large vocabulary AV corpus and a wide variety of noises is a major bottleneck to build robust language, speaker and noise-independent SE systems. We show that a model trained on a synthetic mixture of the benchmark GRID corpus (with 33 speakers and a small English vocabulary) and CHiME 3 noises (comprising bus, pedestrian, cafeteria, and street noises) can generalise well, not only on large vocabulary corpora with a wide variety of speakers and noises, but also on completely unrelated languages such as Mandarin.

Citation

Gogate, M., Dashtipour, K., Adeel, A., & Hussain, A. (2020). CochleaNet: A robust language-independent audio-visual model for real-time speech enhancement. Information Fusion, 63, 273-285. https://doi.org/10.1016/j.inffus.2020.04.001

Journal Article Type	Article
Acceptance Date	Apr 11, 2020
Online Publication Date	Apr 21, 2020
Publication Date	2020-11
Deposit Date	Oct 12, 2020
Journal	Information Fusion
Print ISSN	1566-2535
Publisher	Elsevier
Peer Reviewed	Peer Reviewed
Volume	63
Pages	273-285
DOI	https://doi.org/10.1016/j.inffus.2020.04.001
Keywords	Audio-Visual, Speech enhancement, Speech separation, Deep learning, Real noisy audio-visual corpus, Speaker independent, Noise-independent, Language-independent, Multi-modal Hearing aids
Public URL	http://researchrepository.napier.ac.uk/Output/2692701

Impact of the Covid-19 pandemic on audiology service delivery: Observational study of the role of social media in patient communication (2024)
Journal Article

Robust Real-time Audio-Visual Speech Enhancement based on DNN and GAN (2024)
Journal Article

Arabic Sentiment Analysis Based on Word Embeddings and Deep Learning (2023)
Journal Article

A Novel Hierarchical Extreme Machine-Learning-Based Approach for Linear Attenuation Coefficient Forecasting (2023)
Journal Article

Arabic sentiment analysis using dependency-based rules and deep neural networks (2022)
Journal Article

Downloadable Citations

HTML

BIB

RTF

Authors

Abstract

Citation

You might also like

Downloadable Citations