Dr. Mandar Gogate M.Gogate@napier.ac.uk
Senior Research Fellow
CochleaNet: A robust language-independent audio-visual model for real-time speech enhancement
Gogate, Mandar; Dashtipour, Kia; Adeel, Ahsan; Hussain, Amir
Authors
Dr Kia Dashtipour K.Dashtipour@napier.ac.uk
Lecturer
Ahsan Adeel
Prof Amir Hussain A.Hussain@napier.ac.uk
Professor
Abstract
Noisy situations cause huge problems for the hearing-impaired, as hearing aids often make speech more audible but do not always restore intelligibility. In noisy settings, humans routinely exploit the audio-visual (AV) nature of speech to selectively suppress background noise and focus on the target speaker. In this paper, we present a novel language-, noise- and speaker-independent AV deep neural network (DNN) architecture, termed CochleaNet, for causal or real-time speech enhancement (SE). The model jointly exploits noisy acoustic cues and noise robust visual cues to focus on the desired speaker and improve speech intelligibility. The proposed SE framework is evaluated using a first of its kind AV binaural speech corpus, ASPIRE, recorded in real noisy environments, including cafeteria and restaurant settings. We demonstrate superior performance of our approach in terms of both objective measures and subjective listening tests, over state-of-the-art SE approaches, including recent DNN based SE models. In addition, our work challenges a popular belief that scarcity of a multi-lingual, large vocabulary AV corpus and a wide variety of noises is a major bottleneck to build robust language, speaker and noise-independent SE systems. We show that a model trained on a synthetic mixture of the benchmark GRID corpus (with 33 speakers and a small English vocabulary) and CHiME 3 noises (comprising bus, pedestrian, cafeteria, and street noises) can generalise well, not only on large vocabulary corpora with a wide variety of speakers and noises, but also on completely unrelated languages such as Mandarin.
Journal Article Type | Article |
---|---|
Acceptance Date | Apr 11, 2020 |
Online Publication Date | Apr 21, 2020 |
Publication Date | 2020-11 |
Deposit Date | Oct 12, 2020 |
Journal | Information Fusion |
Print ISSN | 1566-2535 |
Publisher | Elsevier |
Peer Reviewed | Peer Reviewed |
Volume | 63 |
Pages | 273-285 |
DOI | https://doi.org/10.1016/j.inffus.2020.04.001 |
Keywords | Audio-Visual, Speech enhancement, Speech separation, Deep learning, Real noisy audio-visual corpus, Speaker independent, Noise-independent, Language-independent, Multi-modal Hearing aids |
Public URL | http://researchrepository.napier.ac.uk/Output/2692701 |
You might also like
Applications of Deep Learning and Reinforcement Learning to Biological Data
(2018)
Journal Article
Guided Policy Search for Sequential Multitask Learning
(2018)
Journal Article
Learning Latent Features With Infinite Nonnegative Binary Matrix Trifactorization
(2018)
Journal Article
Cross-modality interactive attention network for multispectral pedestrian detection
(2018)
Journal Article
Downloadable Citations
About Edinburgh Napier Research Repository
Administrator e-mail: repository@napier.ac.uk
This application uses the following open-source libraries:
SheetJS Community Edition
Apache License Version 2.0 (http://www.apache.org/licenses/)
PDF.js
Apache License Version 2.0 (http://www.apache.org/licenses/)
Font Awesome
SIL OFL 1.1 (http://scripts.sil.org/OFL)
MIT License (http://opensource.org/licenses/mit-license.html)
CC BY 3.0 ( http://creativecommons.org/licenses/by/3.0/)
Powered by Worktribe © 2024
Advanced Search