Skip to main content

Research Repository

Advanced Search

Visual Speech In Real Noisy Environments (VISION): A Novel Benchmark Dataset and Deep Learning-Based Baseline System.

Gogate, Mandar; Dashtipour, Kia; Hussain, Amir



In this paper, we present VIsual Speech In real nOisy eNvironments (VISION), a first of its kind audio-visual (AV) corpus comprising 2500 utterances from 209 speakers, recorded in real noisy environments including social gatherings, streets, cafeterias and restaurants. While a number of speech enhancement frameworks have been proposed in the literature that exploit AV cues, there are no visual speech corpora recorded in real environments with a sufficient variety of speakers, to enable evaluation of AV frameworks' generalisation capability in a wide range of background visual and acoustic noises. The main purpose of our AV corpus is to foster research in the area of AV signal processing and to provide a benchmark corpus that can be used for reliable evaluation of AV speech enhancement systems in everyday noisy settings. In addition, we present a baseline deep neural network (DNN) based spectral mask estimation model for speech enhancement. Comparative simulation results with subjective listening tests demonstrate significant performance improvement of the baseline DNN compared to state-of-the-art speech enhancement approaches.

Presentation Conference Type Conference Paper (Published)
Conference Name Interspeech 2020
Start Date Oct 25, 2020
End Date Oct 29, 2020
Online Publication Date Oct 25, 2020
Publication Date 2020
Deposit Date Apr 26, 2022
Pages 4521-4525
Book Title Proc. Interspeech 2020
Public URL
Publisher URL

You might also like

Downloadable Citations