Affiliation:
1. Department of Artificial Intelligence Convergence, Chuncheon 24252, Republic of Korea
Abstract
Speech emotion recognition (SER) tasks are conducted to extract emotional features from speech signals. The characteristic parameters are analyzed, and the speech emotional states are judged. At present, SER is an important aspect of artificial psychology and artificial intelligence, as it is widely implemented in many applications in the human–computer interface, medical, and entertainment fields. In this work, six transforms, namely, the synchrosqueezing transform, fractional Stockwell transform (FST), K-sine transform-dependent integrated system (KSTDIS), flexible analytic wavelet transform (FAWT), chirplet transform, and superlet transform, are initially applied to speech emotion signals. Once the transforms are applied and the features are extracted, the essential features are selected using three techniques: the Overlapping Information Feature Selection (OIFS) technique followed by two biomimetic intelligence-based optimization techniques, namely, Harris Hawks Optimization (HHO) and the Chameleon Swarm Algorithm (CSA). The selected features are then classified with the help of ten basic machine learning classifiers, with special emphasis given to the extreme learning machine (ELM) and twin extreme learning machine (TELM) classifiers. An experiment is conducted on four publicly available datasets, namely, EMOVO, RAVDESS, SAVEE, and Berlin Emo-DB. The best results are obtained as follows: the Chirplet + CSA + TELM combination obtains a classification accuracy of 80.63% on the EMOVO dataset, the FAWT + HHO + TELM combination obtains a classification accuracy of 85.76% on the RAVDESS dataset, the Chirplet + OIFS + TELM combination obtains a classification accuracy of 83.94% on the SAVEE dataset, and, finally, the KSTDIS + CSA + TELM combination obtains a classification accuracy of 89.77% on the Berlin Emo-DB dataset.
Funder
National Research Foundation of Korea
Bio&Medical Technology Development Program
Reference61 articles.
1. Speech emotion recognition algorithm based on SVM;Zhu;Comput. Syst. Appl.,2011
2. Kim, Y., Lee, H., and Provost, E.M. (2013, January 26–31). Deep learning for robust feature generation in audio-visual emotion recognition. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP ‘13), Vancouver, BC, Canada.
3. Analyzing prosodic components of normal speech and emotive speech;Shimmura;Prepr. Acoust. Soc. Jpn.,1995
4. Hybrid Approach for Emotion Classification of Audio Conversation based on text and speech mining;Bhaskar;Procedia Comput. Sci.,2015
5. Research on emotional speech recognition based on pitch;Pengjuan;Appl. Res. Comput.,2007