Abstract:
The speech emotion recognition (SER) has gained pivotal attention on various applications in human-computer interaction and affective computing. In these days, there has ...Show MoreMetadata
Abstract:
The speech emotion recognition (SER) has gained pivotal attention on various applications in human-computer interaction and affective computing. In these days, there has been a growing interest in developing robust and accurate systems for identifying emotions from speech utterances. In this work, a novel approach to Wav2Vec2 architecture is used to demonstrate the SER system performance. The Wav2Vec2 model is used to extract the speech features from utterances and feed to feed forward network to identify the emotions accurately on the two datasets, namely, Toronto emotional speech set (TESS) and Crowd-sourced Emotional Multimodal Actors Dataset (CREMA-D). Wav2Vec2 implements a contrastive learning target during the pre-training stage. The CREMA-D achieved an accuracy of 76%. Additionally, the weighted F1 score, precision, and recall were, respectively, 0.76, 0.77, and 0.77. On the other hand, on the TESS dataset achieved an accuracy of 99%, the F1 score was 0.99. Furthermore, the weighted recall and precision were both 0.99 and 0.99.
Date of Conference: 04-06 December 2023
Date Added to IEEE Xplore: 02 April 2024
ISBN Information: