Erfan Loweimi
Multi-Stream Acoustic Modelling Using Raw Real and Imaginary Parts of the Fourier Transform
Loweimi, Erfan; Yue, Zhengjun; Bell, Peter; Renals, Steve; Cvetkovic, Zoran
Authors
Zhengjun Yue
Peter Bell
Steve Renals
Zoran Cvetkovic
Abstract
In this paper, we investigate multi-stream acoustic modelling using the raw real and imaginary parts of the Fourier transform of speech signals. Using the raw magnitude spectrum, or features derived from it, as a proxy for the real and imaginary parts leads to irreversible information loss and suboptimal information fusion. We discuss and quantify the importance of such information in terms of speech quality and intelligibility. In the proposed framework, the real and imaginary parts are treated as two streams of information, pre-processed via separate convolutional networks, and then combined at an optimal level of abstraction, followed by further post-processing via recurrent and fully-connected layers. The optimal level of information fusion in various architectures, training dynamics in terms of cross-entropy loss, frame classification accuracy and WER as well as the shape and properties of the filters learned in the first convolutional layer of single- and multi-stream models are analysed. We investigated the effectiveness of the proposed systems in various tasks: TIMIT/NTIMIT (phone recognition), Aurora-4 (noise robustness), WSJ (read speech), AMI (meeting) and TORGO (dysarthric speech). Across all tasks we achieved competitive performance: in Aurora-4, down to 4.6% WER on average, in WSJ down to 4.6% and 6.2% WERs for Eval-92 and Eval-93, for Dev/Eval sets of the AMI-IHM down to 23.3%/23.8% WERs and in the AMI-SDM down to 43.7%/47.6% WERs have been achieved. In TORGO, for dysarthric and typical speech we achieved down to 31.7% and 10.2% WERs, respectively.
Journal Article Type | Article |
---|---|
Online Publication Date | Jan 26, 2023 |
Publication Date | 2023 |
Deposit Date | Apr 3, 2024 |
Print ISSN | 2329-9290 |
Electronic ISSN | 2329-9304 |
Publisher | Institute of Electrical and Electronics Engineers |
Peer Reviewed | Peer Reviewed |
Volume | 31 |
Pages | 876-890 |
DOI | https://doi.org/10.1109/taslp.2023.3237167 |
Keywords | Raw signal representation, fourier transform, automatic speech recognition, multi-stream acoustic modelling |
Public URL | http://researchrepository.napier.ac.uk/Output/3585794 |
You might also like
Phonetic Error Analysis Beyond Phone Error Rate
(2023)
Journal Article
Acoustic Modelling From Raw Source and Filter Components for Dysarthric Speech Recognition
(2022)
Journal Article
Dysarthric Speech Recognition, Detection and Classification using Raw Phase and Magnitude Spectra
(2023)
Presentation / Conference Contribution
Dysarthric Speech Recognition From Raw Waveform with Parametric CNNs
(2022)
Presentation / Conference Contribution
RCT: Random consistency training for semi-supervised sound event detection
(2022)
Presentation / Conference Contribution
Downloadable Citations
About Edinburgh Napier Research Repository
Administrator e-mail: repository@napier.ac.uk
This application uses the following open-source libraries:
SheetJS Community Edition
Apache License Version 2.0 (http://www.apache.org/licenses/)
PDF.js
Apache License Version 2.0 (http://www.apache.org/licenses/)
Font Awesome
SIL OFL 1.1 (http://scripts.sil.org/OFL)
MIT License (http://opensource.org/licenses/mit-license.html)
CC BY 3.0 ( http://creativecommons.org/licenses/by/3.0/)
Powered by Worktribe © 2024
Advanced Search