Skip to main content

Research Repository

Advanced Search

Audio-visual speech enhancement and separation by utilizing multi-modal self-supervised embeddings

Chern, I-Chun; Hung, Kuo-Hsuan; Chen, Yi-Ting; Hussain, Tassadaq; Gogate, Mandar; Hussain, Amir; Tsao, Yu; Hou, Jen-Cheng

Authors

I-Chun Chern

Kuo-Hsuan Hung

Yi-Ting Chen

Yu Tsao

Jen-Cheng Hou



Abstract

AV-HuBERT, a multi-modal self-supervised learning model, has been shown to be effective for categorical problems such as automatic speech recognition and lip-reading. This suggests that useful audio-visual speech representations can be obtained via utilizing multi-modal self-supervised embeddings. Nevertheless, it is unclear if such representations can be generalized to solve real-world multi-modal AV regression tasks, such as audio-visual speech enhancement (AVSE) and audio-visual speech separation (AVSS). In this study, we leveraged the pre-trained AV-HuBERT model followed by an SE module for AVSE and AVSS. Comparative experimental results demonstrate that our proposed model performs better than the state-of-the-art AVSE and traditional audio-only SE models. In summary, our results confirm the effectiveness of our proposed model for the AVSS task with proper fine-tuning strategies, demonstrating that multi-modal self-supervised embeddings obtained from AV-HuBERT can be generalized to audio-visual regression tasks.

Presentation Conference Type Conference Paper (published)
Conference Name 2023 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW)
Start Date Jun 4, 2023
End Date Jun 10, 2023
Online Publication Date Aug 2, 2023
Publication Date 2023
Deposit Date May 21, 2024
Peer Reviewed Peer Reviewed
Book Title 2023 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW)
DOI https://doi.org/10.1109/ICASSPW59220.2023.10193049