Causal speech enhancement using dynamical-weighted loss and attention encoder-decoder recurrent neural network-Reference-Cited by-同舟云学术

Causal speech enhancement using dynamical-weighted loss and attention encoder-decoder recurrent neural network

Published:2023-05-11 Issue:5 Volume:18 Page:e0285629
ISSN:1932-6203
Container-title:PLOS ONE
language:en
Short-container-title:PLoS ONE

Author:

Peracha Fahad Khalil,Khattak Muhammad Irfan,Salem Nema^ORCID,Saleem Nasir^ORCID

Abstract

Speech enhancement (SE) reduces background noise signals in target speech and is applied at the front end in various real-world applications, including robust ASRs and real-time processing in mobile phone communications. SE systems are commonly integrated into mobile phones to increase quality and intelligibility. As a result, a low-latency system is required to operate in real-world applications. On the other hand, these systems need efficient optimization. This research focuses on the single-microphone SE operating in real-time systems with better optimization. We propose a causal data-driven model that uses attention encoder-decoder long short-term memory (LSTM) to estimate the time-frequency mask from a noisy speech in order to make a clean speech for real-time applications that need low-latency causal processing. The encoder-decoder LSTM and a causal attention mechanism are used in the proposed model. Furthermore, a dynamical-weighted (DW) loss function is proposed to improve model learning by varying the weight loss values. Experiments demonstrated that the proposed model consistently improves voice quality, intelligibility, and noise suppression. In the causal processing mode, the LSTM-based estimated suppression time-frequency mask outperforms the baseline model for unseen noise types. The proposed SE improved the STOI by 2.64% (baseline LSTM-IRM), 6.6% (LSTM-KF), 4.18% (DeepXi-KF), and 3.58% (DeepResGRU-KF). In addition, we examine word error rates (WERs) using Google’s Automatic Speech Recognition (ASR). The ASR results show that error rates decreased from 46.33% (noisy signals) to 13.11% (proposed) 15.73% (LSTM), and 14.97% (LSTM-KF).

Publisher

Public Library of Science (PLoS)

Subject

Multidisciplinary

Reference53 articles.

1. Biosignal sensors and deep learning-based speech recognition: A review;W Lee;Sensors,2021

2. Automatic speech recognition and speech variability: A review;M Benzeghiba;Speech communication,2007

3. Model-based speech enhancement for intelligibility improvement in binaural hearing aids;MS Kavalekalam;IEEE/ACM Transactions on Audio, Speech, and Language Processing,2018

4. Suppression of acoustic noise in speech using spectral subtraction;S Boll;IEEE Transactions on acoustics, speech, and signal processing,1979

5. Noise reduction based on soft masks by incorporating SNR uncertainty in frequency domain;N Saleem;Circuits, Systems, and Signal Processing,2018

Cited by 4 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Speech Enhancement with Background Noise Suppression in Various Data Corpus Using Bi-LSTM Algorithm;International Journal of Electrical and Electronics Research;2024-03-28

2. A ChannelWise weighting technique of slice-based Temporal Convolutional Network for noisy speech enhancement;Computer Speech & Language;2024-03

3. Deep causal speech enhancement and recognition using efficient long-short term memory Recurrent Neural Network;PLOS ONE;2024-01-03

4. Speech Enhancement Using Dynamic Learning in Knowledge Distillation via Reinforcement Learning;IEEE Access;2023