Skip to main content

Research Repository

Advanced Search

MA-Net: Resource-efficient multi-attentional network for end-to-end speech enhancement

Wahab, Fazal E; Ye, Zhongfu; Saleem, Nasir; Ullah, Rizwan; Hussain, Amir

Authors

Fazal E Wahab

Zhongfu Ye

Nasir Saleem

Rizwan Ullah



Abstract

Deep Neural Networks (DNNs) have transformed speech enhancement (SE) by solving the complex relationships within speech signals through their multi-layered hierarchical representations. However, their computational demands remain a challenging problem. Self-attention has emerged as a key technique for capturing long-range dependencies in speech signals by measuring attention between vectors through scaled-dot products. Despite its widespread utility across various domains, self-attention encounters limitations when applied to SE. Specifically, its efficiency reduces in low signal-to-noise ratio (SNR) conditions due to its sensitivity to the scale of input vectors influenced by factors such as low SNRs. To address these challenges, we propose a resource-efficient Multi-Attention Network (MA-Net) speech enhancement model to effectively capture local and long-range dependencies in speech signals, while maintaining a low computational footprint. MA-Net integrates two fundamental modules: Spectral Temporal Hybrid Attention (STHA) and Dynamic Feedback Shuffle Attention (DFSA). The STHA module is designed to model long-range dependencies in spectral and temporal features by using hybrid self-attention (HSA). This mechanism computes attention weights between query () and key () vectors using dot-product and cosine similarity scores to mitigate the impact of scale variations in input vectors, enabling more consistent and reliable attention mechanisms. The DFSA module iteratively applies channel and spatial attention to dynamically refine feature representations by adjusting the weight of each iteration’s output based on input spectral features. Evaluations performed on two benchmark datasets (WSJ0-SI84 and VCTK+DEMAND) show that the MA-Net outperforms recent models in terms of SE performance at a considerably reduced computational complexity, with 0.92M parameters, 0.09 RTF, and 1.32G/s MACs. On the WSJ0-SI84 dataset, MA-Net improves PESQ, STOI, and SI-SDR by 1.26, 20.3%, and 9.76 dB over noisy mixtures, highlighting the usefulness of MA-Net in real-world SE conditions.

Citation

Wahab, F. E., Ye, Z., Saleem, N., Ullah, R., & Hussain, A. (2025). MA-Net: Resource-efficient multi-attentional network for end-to-end speech enhancement. Neurocomputing, 619, Article 129150. https://doi.org/10.1016/j.neucom.2024.129150

Journal Article Type Article
Acceptance Date Dec 5, 2024
Online Publication Date Dec 12, 2024
Publication Date 2025-02
Deposit Date Jan 21, 2025
Journal Neurocomputing
Print ISSN 0925-2312
Publisher Elsevier
Peer Reviewed Peer Reviewed
Volume 619
Article Number 129150
DOI https://doi.org/10.1016/j.neucom.2024.129150