A Methodical Framework Utilizing Transforms and Biomimetic Intelligence-Based Optimization with Machine Learning for Speech Emotion Recognition-Reference-Cited by-同舟云学术

A Methodical Framework Utilizing Transforms and Biomimetic Intelligence-Based Optimization with Machine Learning for Speech Emotion Recognition

Published:2024-08-26 Issue:9 Volume:9 Page:513
ISSN:2313-7673
Container-title:Biomimetics
language:en
Short-container-title:Biomimetics

Author:

Prabhakar Sunil Kumar¹,Won Dong-Ok¹^ORCID

Affiliation:

1. Department of Artificial Intelligence Convergence, Chuncheon 24252, Republic of Korea

Abstract

Speech emotion recognition (SER) tasks are conducted to extract emotional features from speech signals. The characteristic parameters are analyzed, and the speech emotional states are judged. At present, SER is an important aspect of artificial psychology and artificial intelligence, as it is widely implemented in many applications in the human–computer interface, medical, and entertainment fields. In this work, six transforms, namely, the synchrosqueezing transform, fractional Stockwell transform (FST), K-sine transform-dependent integrated system (KSTDIS), flexible analytic wavelet transform (FAWT), chirplet transform, and superlet transform, are initially applied to speech emotion signals. Once the transforms are applied and the features are extracted, the essential features are selected using three techniques: the Overlapping Information Feature Selection (OIFS) technique followed by two biomimetic intelligence-based optimization techniques, namely, Harris Hawks Optimization (HHO) and the Chameleon Swarm Algorithm (CSA). The selected features are then classified with the help of ten basic machine learning classifiers, with special emphasis given to the extreme learning machine (ELM) and twin extreme learning machine (TELM) classifiers. An experiment is conducted on four publicly available datasets, namely, EMOVO, RAVDESS, SAVEE, and Berlin Emo-DB. The best results are obtained as follows: the Chirplet + CSA + TELM combination obtains a classification accuracy of 80.63% on the EMOVO dataset, the FAWT + HHO + TELM combination obtains a classification accuracy of 85.76% on the RAVDESS dataset, the Chirplet + OIFS + TELM combination obtains a classification accuracy of 83.94% on the SAVEE dataset, and, finally, the KSTDIS + CSA + TELM combination obtains a classification accuracy of 89.77% on the Berlin Emo-DB dataset.

Funder

National Research Foundation of Korea

Bio&Medical Technology Development Program

Publisher

MDPI AG

Link

https://www.mdpi.com/2313-7673/9/9/513/pdf

Reference61 articles.

1. Speech emotion recognition algorithm based on SVM;Zhu;Comput. Syst. Appl.,2011

2. Kim, Y., Lee, H., and Provost, E.M. (2013, January 26–31). Deep learning for robust feature generation in audio-visual emotion recognition. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP ‘13), Vancouver, BC, Canada.

3. Analyzing prosodic components of normal speech and emotive speech;Shimmura;Prepr. Acoust. Soc. Jpn.,1995

4. Hybrid Approach for Emotion Classification of Audio Conversation based on text and speech mining;Bhaskar;Procedia Comput. Sci.,2015

5. Research on emotional speech recognition based on pitch;Pengjuan;Appl. Res. Comput.,2007