Fazal E Wahab
MA-Net: Resource-efficient multi-attentional network for end-to-end speech enhancement
Wahab, Fazal E; Ye, Zhongfu; Saleem, Nasir; Ullah, Rizwan; Hussain, Amir
Abstract
Deep Neural Networks (DNNs) have transformed speech enhancement (SE) by solving the complex relationships within speech signals through their multi-layered hierarchical representations. However, their computational demands remain a challenging problem. Self-attention has emerged as a key technique for capturing long-range dependencies in speech signals by measuring attention between vectors through scaled-dot products. Despite its widespread utility across various domains, self-attention encounters limitations when applied to SE. Specifically, its efficiency reduces in low signal-to-noise ratio (SNR) conditions due to its sensitivity to the scale of input vectors influenced by factors such as low SNRs. To address these challenges, we propose a resource-efficient Multi-Attention Network (MA-Net) speech enhancement model to effectively capture local and long-range dependencies in speech signals, while maintaining a low computational footprint. MA-Net integrates two fundamental modules: Spectral Temporal Hybrid Attention (STHA) and Dynamic Feedback Shuffle Attention (DFSA). The STHA module is designed to model long-range dependencies in spectral and temporal features by using hybrid self-attention (HSA). This mechanism computes attention weights between query () and key () vectors using dot-product and cosine similarity scores to mitigate the impact of scale variations in input vectors, enabling more consistent and reliable attention mechanisms. The DFSA module iteratively applies channel and spatial attention to dynamically refine feature representations by adjusting the weight of each iteration’s output based on input spectral features. Evaluations performed on two benchmark datasets (WSJ0-SI84 and VCTK+DEMAND) show that the MA-Net outperforms recent models in terms of SE performance at a considerably reduced computational complexity, with 0.92M parameters, 0.09 RTF, and 1.32G/s MACs. On the WSJ0-SI84 dataset, MA-Net improves PESQ, STOI, and SI-SDR by 1.26, 20.3%, and 9.76 dB over noisy mixtures, highlighting the usefulness of MA-Net in real-world SE conditions.
Citation
Wahab, F. E., Ye, Z., Saleem, N., Ullah, R., & Hussain, A. (2025). MA-Net: Resource-efficient multi-attentional network for end-to-end speech enhancement. Neurocomputing, 619, Article 129150. https://doi.org/10.1016/j.neucom.2024.129150
Journal Article Type | Article |
---|---|
Acceptance Date | Dec 5, 2024 |
Online Publication Date | Dec 12, 2024 |
Publication Date | 2025-02 |
Deposit Date | Jan 21, 2025 |
Journal | Neurocomputing |
Print ISSN | 0925-2312 |
Publisher | Elsevier |
Peer Reviewed | Peer Reviewed |
Volume | 619 |
Article Number | 129150 |
DOI | https://doi.org/10.1016/j.neucom.2024.129150 |
You might also like
Transition-aware human activity recognition using an ensemble deep learning framework
(2024)
Journal Article
Downloadable Citations
About Edinburgh Napier Research Repository
Administrator e-mail: repository@napier.ac.uk
This application uses the following open-source libraries:
SheetJS Community Edition
Apache License Version 2.0 (http://www.apache.org/licenses/)
PDF.js
Apache License Version 2.0 (http://www.apache.org/licenses/)
Font Awesome
SIL OFL 1.1 (http://scripts.sil.org/OFL)
MIT License (http://opensource.org/licenses/mit-license.html)
CC BY 3.0 ( http://creativecommons.org/licenses/by/3.0/)
Powered by Worktribe © 2025
Advanced Search