Context-Aware Audio-Visual Speech Enhancement Based on Neuro-Fuzzy Modelling and User Preference Learning

Chen, Song; Kirton-Wingate, Jasper; Doctor, Faiyaz; Arshad, Usama; Dashtipour, Kia; Gogate, Mandar; Halim, Zahid; Al-Dubai, Ahmed; Arslan, Tughrul; Hussain, Amir

doi:10.1109/tfuzz.2024.3435050

Context-Aware Audio-Visual Speech Enhancement Based on Neuro-Fuzzy Modelling and User Preference Learning

Chen, Song; Kirton-Wingate, Jasper; Doctor, Faiyaz; Arshad, Usama; Dashtipour, Kia; Gogate, Mandar; Halim, Zahid; Al-Dubai, Ahmed; Arslan, Tughrul; Hussain, Amir

Authors

Song Chen

Jasper Kirton-Wingate J.Kirton-wingate@napier.ac.uk
Student Experience

Faiyaz Doctor

Usama Arshad

Dr Kia Dashtipour K.Dashtipour@napier.ac.uk
Lecturer

Dr. Mandar Gogate M.Gogate@napier.ac.uk
Principal Research Fellow

Zahid Halim

Prof Ahmed Al-Dubai A.Al-Dubai@napier.ac.uk
Professor

Tughrul Arslan

Prof Amir Hussain A.Hussain@napier.ac.uk
Professor

Abstract

It is estimated that by 2050 approximately one in ten individuals globally will experience disabling hearing impairment. In the presence of everyday reverberant noise, a substantial proportion of individual users encounter challenges in speech comprehension. This study introduces a novel application of neuro-fuzzy modelling that synergizes and fuses audio-visual speech enhancement (AV SE) with an initial user preference learning based framework. Specifically, our approach uniquely integrates multimodal AV speech data with innovative SE methods and fuzzy inferencing techniques. This integration is further enriched by incorporating a user-preference learning model that adapts to environmental and user-specific contexts, including signal-to-noise ratios, sound power, and the quality of visual information. The proposed framework facilitates the incorporation of clinical measures such as user cognitive load (or listening effort) with real-world uncertainty to steer the system outputs. We employ an adaptive fuzzy neural network to derive the most effective Sugeno fuzzy inference model, employing particle swarm optimization to ensure optimal SE by considering sound power, ambient noise levels, and visual quality. Experimental results utilise our new benchmark AV multi-talker Challenge dataset to demonstrate the superiority of our user preference-informed, context-aware AV SE approach in enhancing speech intelligibility and quality in challenging noisy conditions, marking a significant advancement over conventional methods while reducing energy consumption. The conclusion supports the ecological scalability of our approach and its potential for real-world applications, setting a new benchmark in AV SE research, paving the way for future assistive hearing and communication technologies.

Citation

Chen, S., Kirton-Wingate, J., Doctor, F., Arshad, U., Dashtipour, K., Gogate, M., Halim, Z., Al-Dubai, A., Arslan, T., & Hussain, A. (2024). Context-Aware Audio-Visual Speech Enhancement Based on Neuro-Fuzzy Modelling and User Preference Learning. IEEE Transactions on Fuzzy Systems, 32(10), 5400-5412. https://doi.org/10.1109/tfuzz.2024.3435050

Journal Article Type	Article
Acceptance Date	Jul 15, 2024
Online Publication Date	Aug 30, 2024
Publication Date	2024-10
Deposit Date	Oct 3, 2024
Publicly Available Date	Oct 3, 2024
Journal	IEEE Transactions on Fuzzy Systems
Print ISSN	1063-6706
Publisher	Institute of Electrical and Electronics Engineers
Peer Reviewed	Peer Reviewed
Volume	32
Issue	10
Pages	5400-5412
DOI	https://doi.org/10.1109/tfuzz.2024.3435050