Using Data Augmentation and Time-Scale Modification to Improve ASR of Children’s Speech in Noisy Environments-Reference-Cited by-同舟云学术

Using Data Augmentation and Time-Scale Modification to Improve ASR of Children’s Speech in Noisy Environments

Published:2021-09-10 Issue:18 Volume:11 Page:8420
ISSN:2076-3417
Container-title:Applied Sciences
language:en
Short-container-title:Applied Sciences

Author:

Kathania Hemant Kumar^ORCID,Kadiri Sudarsana Reddy^ORCID,Alku Paavo,Kurimo Mikko

Abstract

Current ASR systems show poor performance in recognition of children’s speech in noisy environments because recognizers are typically trained with clean adults’ speech and therefore there are two mismatches between training and testing phases (i.e., clean speech in training vs. noisy speech in testing and adult speech in training vs. child speech in testing). This article studies methods to tackle the effects of these two mismatches in recognition of noisy children’s speech by investigating two techniques: data augmentation and time-scale modification. In the former, clean training data of adult speakers are corrupted with additive noise in order to obtain training data that better correspond to the noisy testing conditions. In the latter, the fundamental frequency (F0) and speaking rate of children’s speech are modified in the testing phase in order to reduce differences in the prosodic characteristics between the testing data of child speakers and the training data of adult speakers. A standard ASR system based on DNN–HMM was built and the effects of data augmentation, F0 modification, and speaking rate modification on word error rate (WER) were evaluated first separately and then by combining all three techniques. The experiments were conducted using children’s speech corrupted with additive noise of four different noise types in four different signal-to-noise (SNR) categories. The results show that the combination of all three techniques yielded the best ASR performance. As an example, the WER value averaged over all four noise types in the SNR category of 5 dB dropped from 32.30% to 12.09% when the baseline system, in which no data augmentation or time-scale modification were used, was replaced with a recognizer that was built using a combination of all three techniques. In summary, in recognizing noisy children’s speech with ASR systems trained with clean adult speech, considerable improvements in the recognition performance can be achieved by combining data augmentation based on noise addition in the system training phase and time-scale modification based on modifying F0 and speaking rate of children’s speech in the testing phase.

Funder

Academy of Finland

Publisher

MDPI AG

Subject

Fluid Flow and Transfer Processes,Computer Science Applications,Process Chemistry and Technology,General Engineering,Instrumentation,General Materials Science

Link

https://www.mdpi.com/2076-3417/11/18/8420/pdf

Reference53 articles.

1. Your Word is my Command: Google Search by Voice: A Case Study;Schalkwyk,2010

2. An Overview of Noise-Robust Automatic Speech Recognition

3. Global noise score indicator for classroom evaluation of acoustic performances in LIFE GIOCONDA project

4. Noise Exposure in Preterm Infants Treated with Respiratory Support Using Neonatal Helmets

5. Annoyance Judgment and Measurements of Environmental Noise: A Focus on Italian Secondary Schools

Cited by 6 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Digit Classification System for Normal and Pathological Speech;2024 International Conference on Smart Systems for applications in Electrical Sciences (ICSSES);2024-05-03

2. ChildAugment: Data augmentation methods for zero-resource children's speaker verification;The Journal of the Acoustical Society of America;2024-03-01

3. Deep Learning-Based Automatic Speaker Recognition Using Self-Organized Feature Mapping;Lecture Notes in Electrical Engineering;2023-12-02

4. Data Augmentation and Deep Learning Methods in Sound Classification: A Systematic Review;Electronics;2022-11-18

5. Audio Augmentation for Non-Native Children’s Speech Recognition through Discriminative Learning;Entropy;2022-10-19