Audio-visual speech enhancement and separation by utilizing multi-modal self-supervised embeddings

Chern, I-Chun; Hung, Kuo-Hsuan; Chen, Yi-Ting; Hussain, Tassadaq; Gogate, Mandar; Hussain, Amir; Tsao, Yu; Hou, Jen-Cheng

doi:10.1109/ICASSPW59220.2023.10193049

Audio-visual speech enhancement and separation by utilizing multi-modal self-supervised embeddings

Chern, I-Chun; Hung, Kuo-Hsuan; Chen, Yi-Ting; Hussain, Tassadaq; Gogate, Mandar; Hussain, Amir; Tsao, Yu; Hou, Jen-Cheng

Authors

I-Chun Chern

Kuo-Hsuan Hung

Yi-Ting Chen

Tassadaq Hussain

Dr. Mandar Gogate M.Gogate@napier.ac.uk
Principal Research Fellow

Prof Amir Hussain A.Hussain@napier.ac.uk
Professor

Yu Tsao

Jen-Cheng Hou

Abstract

AV-HuBERT, a multi-modal self-supervised learning model, has been shown to be effective for categorical problems such as automatic speech recognition and lip-reading. This suggests that useful audio-visual speech representations can be obtained via utilizing multi-modal self-supervised embeddings. Nevertheless, it is unclear if such representations can be generalized to solve real-world multi-modal AV regression tasks, such as audio-visual speech enhancement (AVSE) and audio-visual speech separation (AVSS). In this study, we leveraged the pre-trained AV-HuBERT model followed by an SE module for AVSE and AVSS. Comparative experimental results demonstrate that our proposed model performs better than the state-of-the-art AVSE and traditional audio-only SE models. In summary, our results confirm the effectiveness of our proposed model for the AVSS task with proper fine-tuning strategies, demonstrating that multi-modal self-supervised embeddings obtained from AV-HuBERT can be generalized to audio-visual regression tasks.

Citation

Chern, I.-C., Hung, K.-H., Chen, Y.-T., Hussain, T., Gogate, M., Hussain, A., Tsao, Y., & Hou, J.-C. (2023, June). Audio-visual speech enhancement and separation by utilizing multi-modal self-supervised embeddings. Presented at 2023 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW), Rhodes Island, Greece

Presentation Conference Type	Conference Paper (published)
Conference Name	2023 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW)
Start Date	Jun 4, 2023
End Date	Jun 10, 2023
Online Publication Date	Aug 2, 2023
Publication Date	2023
Deposit Date	May 21, 2024
Peer Reviewed	Peer Reviewed
Book Title	2023 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW)
DOI	https://doi.org/10.1109/ICASSPW59220.2023.10193049
Public URL	http://researchrepository.napier.ac.uk/Output/3609429

Impact of the Covid-19 pandemic on audiology service delivery: Observational study of the role of social media in patient communication (2024)
Journal Article

Robust Real-time Audio-Visual Speech Enhancement based on DNN and GAN (2024)
Journal Article

Arabic Sentiment Analysis Based on Word Embeddings and Deep Learning (2023)
Journal Article

A Novel Hierarchical Extreme Machine-Learning-Based Approach for Linear Attenuation Coefficient Forecasting (2023)
Journal Article

Arabic sentiment analysis using dependency-based rules and deep neural networks (2022)
Journal Article

Downloadable Citations

HTML

BIB

RTF

Authors

Abstract

Citation

You might also like

Downloadable Citations