Optimizing Speech Emotion Recognition with Machine Learning Based Advanced Audio Cue Analysis-Reference-Cited by-同舟云学术

Optimizing Speech Emotion Recognition with Machine Learning Based Advanced Audio Cue Analysis

Published:2024-07-11 Issue:7 Volume:12 Page:111
ISSN:2227-7080
Container-title:Technologies
language:en
Short-container-title:Technologies

Author:

Pallewela Nuwan¹^ORCID,Alahakoon Damminda¹^ORCID,Adikari Achini¹^ORCID,Pierce John E.²^ORCID,Rose Miranda L.²^ORCID

Affiliation:

1. Centre for Data Analytics and Cognition, La Trobe Business School, La Trobe University, Melbourne, VIC 3083, Australia

2. Centre of Research Excellence in Aphasia Recovery and Rehabilitation, La Trobe University, Melbourne, VIC 3083, Australia

Abstract

In today’s fast-paced and interconnected world, where human–computer interaction is an integral component of daily life, the ability to recognize and understand human emotions has emerged as a crucial facet of technological advancement. However, human emotion, a complex interplay of physiological, psychological, and social factors, poses a formidable challenge even for other humans to comprehend accurately. With the emergence of voice assistants and other speech-based applications, it has become essential to improve audio-based emotion expression. However, there is a lack of specificity and agreement in current emotion annotation practice, as evidenced by conflicting labels in many human-annotated emotional datasets for the same speech segments. Previous studies have had to filter out these conflicts and, therefore, a large portion of the collected data has been considered unusable. In this study, we aimed to improve the accuracy of computational prediction of uncertain emotion labels by utilizing high-confidence emotion labelled speech segments from the IEMOCAP emotion dataset. We implemented an audio-based emotion recognition model using bag of audio word encoding (BoAW) to obtain a representation of audio aspects of emotion in speech with state-of-the-art recurrent neural network models. Our approach improved the state-of-the-art audio-based emotion recognition with a 61.09% accuracy rate, an improvement of 1.02% over the BiDialogueRNN model and 1.72% over the EmoCaps multi-modal emotion recognition models. In comparison to human annotation, our approach achieved similar results in identifying positive and negative emotions. Furthermore, it has proven effective in accurately recognizing the sentiment of uncertain emotion segments that were previously considered unusable in other studies. Improvements in audio emotion recognition could have implications in voice-based assistants, healthcare, and other industrial applications that benefit from automated communication.

Funder

National Health and Medical Research Council Ideas

Publisher

MDPI AG

Link

https://www.mdpi.com/2227-7080/12/7/111/pdf

Reference41 articles.

1. Abeysinghe, S., Manchanayake, I., Samarajeewa, C., Rathnayaka, P., Walpola, M., Nawaratne, R., Bandaragoda, T., and Alahakoon, D. (2018, January 26–29). Enhancing Decision Making Capacity in Tourism Domain Using Social Media Analytics. Proceedings of the 18th International Conference on Advances in ICT for Emerging Regions (ICTer), Colombo, Sri Lanka.

2. Value co-creation for open innovation: An evidence-based study of the data driven paradigm of social media using machine learning;Adikari;Int. J. Inf. Manag. Data Insights,2021

3. Emotions of COVID-19: Content Analysis of Self-Reported Information Using Artificial Intelligence;Adikari;J. Med. Internet Res.,2021

4. Understanding citizens’ emotional pulse in a smart city using artificial intelligence;Adikari;IEEE Trans. Ind. Inform.,2021

5. Self-Building artificial intelligence and machine learning to empower big data analytics in smart cities;Alahakoon;Inf. Syst. Front.,2020