On the Usefulness of Statistical Normalisation of Bottleneck Features for Speech Recognition

Loweimi, Erfan; Bell, Peter; Renals, Steve

doi:10.1109/icassp.2019.8683330

Phonetic Error Analysis Beyond Phone Error Rate (2023)
Journal Article
Loweimi, E., Carmantini, A., Bell, P., Renals, S., & Cvetkovic, Z. (2023). Phonetic Error Analysis Beyond Phone Error Rate. IEEE/ACM Transactions on Audio, Speech and Language Processing, 31, 3346-3361. https://doi.org/10.1109/taslp.2023.3313417

In this article, we analyse the performance of the TIMIT-based phone recognition systems beyond the overall phone error rate (PER) metric. We consider three broad phonetic classes (BPCs): {affricate, diphthong, fricative, nasal, plosive, semi-vowel,... Read More about Phonetic Error Analysis Beyond Phone Error Rate.

Dysarthric Speech Recognition, Detection and Classification using Raw Phase and Magnitude Spectra (2023)
Conference Proceeding
Yue, Z., Loweimi, E., & Cvetkovic, Z. (2023). Dysarthric Speech Recognition, Detection and Classification using Raw Phase and Magnitude Spectra. In Proc. INTERSPEECH 2023 (1533-1537). https://doi.org/10.21437/interspeech.2023-222

In this paper, we explore the effectiveness of deploying the raw phase and magnitude spectra for dysarthric speech recognition, detection and classification. In particular, we scrutinise the usefulness of various raw phase-based representations along... Read More about Dysarthric Speech Recognition, Detection and Classification using Raw Phase and Magnitude Spectra.

Multi-Stream Acoustic Modelling Using Raw Real and Imaginary Parts of the Fourier Transform (2023)
Journal Article
Loweimi, E., Yue, Z., Bell, P., Renals, S., & Cvetkovic, Z. (2023). Multi-Stream Acoustic Modelling Using Raw Real and Imaginary Parts of the Fourier Transform. IEEE/ACM Transactions on Audio, Speech and Language Processing, 31, 876-890. https://doi.org/10.1109/taslp.2023.3237167

In this paper, we investigate multi-stream acoustic modelling using the raw real and imaginary parts of the Fourier transform of speech signals. Using the raw magnitude spectrum, or features derived from it, as a proxy for the real and imaginary part... Read More about Multi-Stream Acoustic Modelling Using Raw Real and Imaginary Parts of the Fourier Transform.

Acoustic Modelling From Raw Source and Filter Components for Dysarthric Speech Recognition (2022)
Journal Article
Yue, Z., Loweimi, E., Christensen, H., Barker, J., & Cvetkovic, Z. (2022). Acoustic Modelling From Raw Source and Filter Components for Dysarthric Speech Recognition. IEEE/ACM Transactions on Audio, Speech and Language Processing, 30, 2968-2980. https://doi.org/10.1109/taslp.2022.3205766

Acoustic modelling for automatic dysarthric speech recognition (ADSR) is a challenging task. Data deficiency is a major problem and substantial differences between typical and dysarthric speech complicate the transfer learning. In this paper, we aim... Read More about Acoustic Modelling From Raw Source and Filter Components for Dysarthric Speech Recognition.

Dysarthric Speech Recognition From Raw Waveform with Parametric CNNs (2022)
Presentation / Conference
Yue, Z., Loweimi, E., Christensen, H., Barker, J., & Cvetkovic, Z. (2022, September). Dysarthric Speech Recognition From Raw Waveform with Parametric CNNs. Paper presented at Interspeech 2022, Incheon, Korea

Raw waveform acoustic modelling has recently received increasing attention. Compared with the task-blind hand-crafted features which may discard useful information, representations directly learned from the raw waveform are task-specific and potentia... Read More about Dysarthric Speech Recognition From Raw Waveform with Parametric CNNs.

RCT: Random consistency training for semi-supervised sound event detection (2022)
Presentation / Conference
Shao, N., Loweimi, E., & Li, X. (2022, September). RCT: Random consistency training for semi-supervised sound event detection. Paper presented at Interspeech 2022, Incheon, Korea

Sound event detection (SED), as a core module of acoustic environmental analysis, suffers from the problem of data deficiency. The integration of semi-supervised learning (SSL) largely mitigates such problem. This paper researches on several core mod... Read More about RCT: Random consistency training for semi-supervised sound event detection.

Raw Source and Filter Modelling for Dysarthric Speech Recognition (2022)
Conference Proceeding
Yue, Z., Loweimi, E., & Cvetkovic, Z. (2022). Raw Source and Filter Modelling for Dysarthric Speech Recognition. In ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). https://doi.org/10.1109/icassp43922.2022.9746553

Acoustic modelling for automatic dysarthric speech recognition (ADSR) is a challenging task. Data deficiency is a major problem and substantial differences between the typical and dysarthric speech complicates transfer learning. In this paper, we bui... Read More about Raw Source and Filter Modelling for Dysarthric Speech Recognition.

Multi-Modal Acoustic-Articulatory Feature Fusion For Dysarthric Speech Recognition (2022)
Conference Proceeding
Yue, Z., Loweimi, E., Cvetkovic, Z., Christensen, H., & Barker, J. (2022). Multi-Modal Acoustic-Articulatory Feature Fusion For Dysarthric Speech Recognition. In ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). https://doi.org/10.1109/icassp43922.2022.9746855

Building automatic speech recognition (ASR) systems for speakers with dysarthria is a very challenging task. Although multi-modal ASR has received increasing attention recently, incorporating real articulatory data with acoustic features has not been... Read More about Multi-Modal Acoustic-Articulatory Feature Fusion For Dysarthric Speech Recognition.

Speech Acoustic Modelling Using Raw Source and Filter Components (2021)
Conference Proceeding
Loweimi, E., Cvetkovic, Z., Bell, P., & Renals, S. (2021). Speech Acoustic Modelling Using Raw Source and Filter Components. In Proc. Interspeech 2021 (276-280). https://doi.org/10.21437/interspeech.2021-53

Source-filter modelling is among the fundamental techniques in speech processing with a wide range of applications. In acoustic modelling, features such as MFCC and PLP which parametrise the filter component are widely employed. In this paper, we inv... Read More about Speech Acoustic Modelling Using Raw Source and Filter Components.

Stochastic Attention Head Removal: A Simple and Effective Method for Improving Transformer Based ASR Models (2021)
Conference Proceeding
Zhang, S., Loweimi, E., Bell, P., & Renals, S. (2021). Stochastic Attention Head Removal: A Simple and Effective Method for Improving Transformer Based ASR Models. In Proc. Interspeech 2021 (2541-2545). https://doi.org/10.21437/interspeech.2021-280

Recently, Transformer based models have shown competitive automatic speech recognition (ASR) performance. One key factor in the success of these models is the multi-head attention mechanism. However, for trained models, we have previously observed th... Read More about Stochastic Attention Head Removal: A Simple and Effective Method for Improving Transformer Based ASR Models.

Speech Acoustic Modelling from Raw Phase Spectrum (2021)
Conference Proceeding
Loweimi, E., Cvetkovic, Z., Bell, P., & Renals, S. (2021). Speech Acoustic Modelling from Raw Phase Spectrum. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). https://doi.org/10.1109/icassp39728.2021.9413727

Magnitude spectrum-based features are the most widely employed front-ends for acoustic modelling in automatic speech recognition (ASR) systems. In this paper, we investigate the possibility and efficacy of acoustic modelling using the raw short-time... Read More about Speech Acoustic Modelling from Raw Phase Spectrum.

Train Your Classifier First: Cascade Neural Networks Training from Upper Layers to Lower Layers (2021)
Conference Proceeding
Zhang, S., Do, C., Doddipatla, R., Loweimi, E., Bell, P., & Renals, S. (2021). Train Your Classifier First: Cascade Neural Networks Training from Upper Layers to Lower Layers. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). https://doi.org/10.1109/icassp39728.2021.9413565

Although the lower layers of a deep neural network learn features which are transferable across datasets, these layers are not transferable within the same dataset. That is, in general, freezing the trained feature extractor (the lower layers) and re... Read More about Train Your Classifier First: Cascade Neural Networks Training from Upper Layers to Lower Layers.

On The Usefulness of Self-Attention for Automatic Speech Recognition with Transformers (2021)
Conference Proceeding
Zhang, S., Loweimi, E., Bell, P., & Renals, S. (2021). On The Usefulness of Self-Attention for Automatic Speech Recognition with Transformers. In 2021 IEEE Spoken Language Technology Workshop (SLT). https://doi.org/10.1109/slt48900.2021.9383521

Self-attention models such as Transformers, which can capture temporal relationships without being limited by the distance between events, have given competitive speech recognition results. However, we note the range of the learned context increases... Read More about On The Usefulness of Self-Attention for Automatic Speech Recognition with Transformers.

On the Robustness and Training Dynamics of Raw Waveform Models (2020)
Conference Proceeding
Loweimi, E., Bell, P., & Renals, S. (2020). On the Robustness and Training Dynamics of Raw Waveform Models. In Proc. Interspeech 2020 (1001-1005). https://doi.org/10.21437/interspeech.2020-17

We investigate the robustness and training dynamics of raw waveform acoustic models for automatic speech recognition (ASR). It is known that the first layer of such models learn a set of filters, performing a form of time-frequency analysis. This lay... Read More about On the Robustness and Training Dynamics of Raw Waveform Models.

Raw Sign and Magnitude Spectra for Multi-Head Acoustic Modelling (2020)
Conference Proceeding
Loweimi, E., Bell, P., & Renals, S. (2020). Raw Sign and Magnitude Spectra for Multi-Head Acoustic Modelling. In Proc. Interspeech 2020 (1644-1648). https://doi.org/10.21437/interspeech.2020-18

In this paper we investigate the usefulness of the sign spectrum and its combination with the raw magnitude spectrum in acoustic modelling for automatic speech recognition (ASR). The sign spectrum is a sequence of ±1s, capturing one bit of the phase... Read More about Raw Sign and Magnitude Spectra for Multi-Head Acoustic Modelling.

Acoustic Model Adaptation from Raw Waveforms with Sincnet (2019)
Conference Proceeding
Fainberg, J., Klejch, O., Loweimi, E., Bell, P., & Renals, S. (2019). Acoustic Model Adaptation from Raw Waveforms with Sincnet. In 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). https://doi.org/10.1109/asru46091.2019.9003974

Raw waveform acoustic modelling has recently gained interest due to neural networks' ability to learn feature extraction, and the potential for finding better representations for a given scenario than hand-crafted features. SincNet has been proposed... Read More about Acoustic Model Adaptation from Raw Waveforms with Sincnet.

On Learning Interpretable CNNs with Parametric Modulated Kernel-Based Filters (2019)
Conference Proceeding
Loweimi, E., Bell, P., & Renals, S. (2019). On Learning Interpretable CNNs with Parametric Modulated Kernel-Based Filters. In Proc. Interspeech 2019 (3480-3484). https://doi.org/10.21437/interspeech.2019-1257

We investigate the problem of direct waveform modelling using parametric kernel-based filters in a convolutional neural network (CNN) framework, building on SincNet, a CNN employing the cardinal sine (sinc) function to implement learnable bandpass fi... Read More about On Learning Interpretable CNNs with Parametric Modulated Kernel-Based Filters.

Trainable Dynamic Subsampling for End-to-End Speech Recognition (2019)
Conference Proceeding
Zhang, S., Loweimi, E., Xu, Y., Bell, P., & Renals, S. (2019). Trainable Dynamic Subsampling for End-to-End Speech Recognition. In Proc. Interspeech 2019 (1413-1417). https://doi.org/10.21437/interspeech.2019-2778

Jointly optimised attention-based encoder-decoder models have yielded impressive speech recognition results. The recurrent neural network (RNN) encoder is a key component in such models — it learns the hidden representations of the inputs. However, i... Read More about Trainable Dynamic Subsampling for End-to-End Speech Recognition.

Learning Temporal Clusters Using Capsule Routing for Speech Emotion Recognition (2019)
Conference Proceeding
Jalal, M. A., Loweimi, E., Moore, R. K., & Hain, T. (2019). Learning Temporal Clusters Using Capsule Routing for Speech Emotion Recognition. In Proc. Interspeech 2019 (1701-1705). https://doi.org/10.21437/interspeech.2019-3068

Emotion recognition from speech plays a significant role in adding emotional intelligence to machines and making human-machine interaction more natural. One of the key challenges from machine learning standpoint is to extract patterns which bear maxi... Read More about Learning Temporal Clusters Using Capsule Routing for Speech Emotion Recognition.

On the Usefulness of Statistical Normalisation of Bottleneck Features for Speech Recognition (2019)
Conference Proceeding
Loweimi, E., Bell, P., & Renals, S. (2019). On the Usefulness of Statistical Normalisation of Bottleneck Features for Speech Recognition. In ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). https://doi.org/10.1109/icassp.2019.8683330

DNNs play a major role in the state-of-the-art ASR systems. They can be used for extracting features and building probabilistic models for acoustic and language modelling. Despite their huge practical success, the level of theoretical understanding h... Read More about On the Usefulness of Statistical Normalisation of Bottleneck Features for Speech Recognition.

All Outputs (29)