ABSTRACT
At present, the understanding of speech by machines mostly focuses on the understanding of semantics, but speech should also include emotions in the speech. Emotion can not only strengthen semantics, but can even change semantic information. The paper discusses how to realize the emotion classification, which is called SeeSpeech. SeeSpeech chooses MCEP as the speech emotion feature, and inputs it into CNN and Transformer respectively. In order to obtain richer features, CNN uses batch normalization, while Transformer uses layer normalization, and then combines the output of CNN and Transformer. Finally, the type of emotion is obtained through SoftMax. SeeSpeech obtained the highest classification accuracy rate of 97% on the RAVDESS data set, and also obtained the classification accuracy rate of 85% on the actual edge gateway test. It can be seen from the results that SeeSpeech has encouraging performance in speech emotion classification and has a wide range of application prospects in human-computer interaction.
- R. W. Picard, Affective computing. MIT press, 2000. Google ScholarDigital Library
- P. Gupta and N. Rajput, “Two-stream emotion recognition for call center monitoring,” in Eighth Annual Conference of the International Speech Communication Association, 2007.Google Scholar
- B. Schuller, G. Rigoll, and M. Lang, “Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine-belief network architecture,” in 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 1. IEEE, 2004, pp. I–577.Google ScholarCross Ref
- V. Kostov and S. Fukuda, “Emotion in user interface, voice interaction system,” in Smc 2000 conference proceedings. 2000 ieee international conference on systems, man and cybernetics.’cybernetics evolving to systems, humans, organizations, and their complex interactions’(cat. no. 0, vol. 2. IEEE, 2000, pp. 798–803.Google ScholarCross Ref
- H. Boril, S. Omid Sadjadi, T. Kleinschmidt, and J. H. Hansen, “Analysis and detection of cognitive load and frustration in drivers’ speech,” Proceedings of INTERSPEECH 2010, pp. 502–505, 2010.Google ScholarCross Ref
- E. Marchi, B. Schuller, A. Batliner, S. Fridenzon, S. Tal, and O. Golan, “Emotion in the speech of children with autism spectrum conditions: Prosody and everything else,” in Proceedings 3rd Workshop on Child, Computer and Interaction (WOCCI 2012), Satellite Event of INTERSPEECH 2012, 2012.Google Scholar
- R. F. Livingstone SR, “(2018) the ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north american english. plos one 13(5): e0196391.” https://doi.org/10.1371/journal.pone.0196391.Google Scholar
- M. G. de Pinto, M. Polignano, P. Lops, and G. Semeraro, “Emotions understanding model from spoken language using deep neural networks and mel-frequency cepstral coefficients,” in 2020 IEEE Conference on Evolving and Adaptive Intelligent Systems (EAIS). IEEE, 2020, pp. 1 – 5.Google ScholarCross Ref
- Iqbal and K. Barua, “A real-time emotion recognition from speech using gradient boosting,” in 2019 International Conference on Electrical, Computer and Communication Engineering (ECCE). IEEE, 2019, pp. 1 – 5.Google ScholarCross Ref
- R. Jannat, I. Tynes, L. L. Lime, J. Adorno, and S. Canavan, “Ubiquitous emotion recognition using audio and video data,” in Proceedings of the 2018 ACM International Joint Conference and 2018 International Symposium on Pervasive and Ubiquitous Computing and Wearable Computers, 2018, pp. 956–959. Google ScholarDigital Library
- K. S. Rao, S. G. Koolagudi, and R. R. Vempada, “Emotion recognition from speech using global and local prosodic features,” International journal of speech technology, vol. 16, no. 2, pp. 143–160, 2013. Google ScholarDigital Library
- S. G. Koolagudi and K. S. Rao, “Emotion recognition from speech: a review,” International journal of speech technology, vol. 15, no. 2, pp. 99 –117, 2012. Google ScholarDigital Library
- S. Kuchibhotla, H. D. Vankayalapati, R. Vaddi, and K. R. Anne, “A comparative analysis of classifiers in emotion recognition through acoustic features,” International Journal of Speech Technology, vol. 17, no. 4, pp. 401–408, 2014. Google ScholarDigital Library
- D. Bitouk, R. Verma, and A. Nenkova, “Class-level spectral features for emotion recognition,” Speech communication, vol. 52, no. 7-8, pp. 613–625 , 2010. Google ScholarDigital Library
- H. Teager and S. Teager, “Evidence for nonlinear sound production mechanisms in the vocal tract,” in Speech production and speech modelling. Springer, 1990, pp. 241–261.Google ScholarCross Ref
- J. F. Kaiser, “On a simple algorithm to calculate the'energy'of a signal,” in International conference on acoustics, speech, and signal processing. IEEE, 1990, pp. 381–384.Google ScholarCross Ref
- F. Bulagang, N. G. Weng, J. Mountstephens, and J. Teo, “A review of recent approaches for emotion classification using electrocardiography and electrodermography signals,” Informatics in Medicine Unlocked, vol. 20, p. 100363, 2020. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S2352914820301040Google ScholarCross Ref
- B. Schuller, G. Rigoll, and M. Lang, “Hidden markov model-based speech emotion recognition,” in 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings.(ICASSP’03)., vol. 2. IEEE, 2003, pp. II–1.Google ScholarCross Ref
- T. L. Nwe, S. W. Foo, and L. C. De Silva, “Speech emotion recognition using hidden markov models,” Speech communication, vol. 41, no. 4, pp. 603–623, 2003.Google ScholarCross Ref
- Y. Pan, P. Shen, and L. Shen, “Speech emotion recognition using support vector machine,” International Journal of Smart Home, vol. 6, no. 2, pp. 101–108, 2012.Google Scholar
- J. Nicholson, K. Takahashi, and R. Nakatsu, “Emotion recognition in speech using neural networks,” Neural computing & applications, vol. 9, no. 4 , pp. 290–296, 2000.Google ScholarCross Ref
- F. Eyben, M. Wollmer,¨ A. Graves, B. Schuller, E. Douglas-Cowie, and R. Cowie, “On-line emotion recognition in a 3-d activation-valence-time continuum using acoustic and linguistic cues,” Journal on Multimodal User Interfaces, vol. 3, no. 1-2, pp. 7–19, 2010.Google ScholarCross Ref
- G. Trigeorgis, F. Ringeval, R. Brueckner, E. Marchi, M. A. Nicolaou, B. Schuller, and S. Zafeiriou, “Adieu features? end-to-end speech emotion recognition using a deep convolutional recurrent network,” in 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2016, pp. 5200–5204.Google ScholarDigital Library
- W. Lim, D. Jang, and T. Lee, “Speech emotion recognition using convolutional and recurrent neural networks,” in 2016 Asia-Pacific signal and information processing association annual summit and conference (APSIPA). IEEE, 2016, pp. 1 – 4.Google Scholar
Index Terms
- SeeSpeech: See Emotions in The Speech
Recommendations
Emotions and speech disorders: do developmental stutters recognize emotional vocal expressions?
Proceedings of the Third COST 2102 international training school conference on Toward autonomous, adaptive, and context-aware multimodal interfaces: theoretical and practical issuesThis paper intends to evaluate the developmental stutters' ability to recognize emotional vocal expressions. To this aim, a group of diagnosed developmental child stutters and a fluent one are tested on the perception of 5 basic vocal emotional states (...
Emotions, speech and the ASR framework
Special issue on speech and emotionAutomatic recognition and understanding of speech are crucial steps towards natural human-machine interaction. Apart from the recognition of the word sequence, the recognition of properties such as prosody, emotion tags or stress tags may be of ...
Detecting changing emotions in natural speech
IEA/AIE'12: Proceedings of the 25th international conference on Industrial Engineering and Other Applications of Applied Intelligent Systems: advanced research in applied artificial intelligenceThe goal of this research was to develop a system that will automatically measure changes in the emotional state of a speaker, by analyzing his/her voice. Natural (non-acted) human speech of 77 (Dutch) speakers was collected and manually splitted into ...
Comments