Abstract
The human voice, especially nonspeech vocalizations, inherently convey emotions. However, existing efforts have ignored such emotional expressions for a long time. Based on this, we propose a Dual-channel Recurrent Neural Network with Xgboost (DCRNNX) to solve emotion recognition using nonspeech vocalizations. The DCRNNX mainly combines two Backbone models. The first model is a two-channel neural network model based on the Deep Neural Network (DNN) and Channel Recurrent Neural Network (CRNN). Channel 1 is constructed by CRNN, and the other model is constructed by Xgboost. Additionally, we employ a smoothing mechanism to integrate the outputs of the two classifiers to promote our DCRNNX. Compared with the baselines, DCRNNX combines not only multiple features but also combines multiple models, which ensures the generalization performance of DCRNNX. Experimental results show that our method achieves 45% and 42% UAR (Unweighted Average Recall), on the development dataset. After model fusion, DCRNNX achieves 46.89% UAR and 37.0% UAR on development and test datasets, respectively. The performance of our method on the development dataset is nearly 6% better than the baselines. Especially, there is a considerable gap between the performance of DCRNNX on the development and the test set. It may be the reason for the differences in emotional characteristics of the male and female voices.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Schuller, B.W., et al.: The ACM multimedia 2022 computational paralinguistics challenge: vocalisations, stuttering, activity, & mosquitos. In: Proceedings ACM Multimedia 2022, Lisbon, Portugal, ISCA, October 2022 (to appear)
Yan, H., He, Q., Xie, W.: CRNN-CTC based mandarin keywords spotting. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7489–7493 (2020)
Meftah, A.H., Mathkour, H., Kerrache, S., Alotaibi, Y.A.: Speaker identification in different emotional states in Arabic and English. IEEE Access 8, 60070–60083 (2020)
Ma, X., Wu, Z., Jia, J., Xu, M., Meng, H., Cai, L.: Emotion recognition from variable-length speech segments using deep learning on spectrograms. 09, 3683–3687 (2018)
Satt, A., Rozenberg, S., Hoory, R.: Efficient emotion recognition from speech using deep learning on spectrograms. 08, 1089–1093 (2017)
Schmitt, M., Schuller, B.: Openxbow - introducing the Passau open-source crossmodal bag-of-words toolkit. J. Mach. Learn. Res. 18, 1–5 (2017)
Eyben, F., Wöllmer, M., Schuller, B.: Opensmile - the Munich versatile and fast open-source audio feature extractor. 1459–1462 (2010)
Han, K., Yu, D., Tashev, I.: Speech emotion recognition using deep neural network and extreme learning machine (2014)
Latif, S., Rana, R., Khalifa, S., Jurdak, R., Epps, J.: Direct modelling of speech emotion from raw speech (2019)
Shi, B., Bai, X., Yao, C.: An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans. Pattern Anal. Mach. Intell. 39(11), 2298–2304 (2017)
Sainath, T.N., Vinyals, O., Senior, A., Sak, H.: Convolutional, long short-term memory, fully connected deep neural networks. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4580–4584 (2015)
Trigeorgis, G., Ringeval, F., Brueckner, R., Marchi, E., Zafeiriou, S.: Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network. In: IEEE International Conference on Acoustics (2016)
Tzirakis, P., Zhang, J., Schuller, B.: End-to-end speech emotion recognition using deep neural networks. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2018)
Zhu, W., Li, X.: Speech emotion recognition with global-aware fusion on multi-scale feature representation (2022)
Kim, J., Saurous, R.A.: Emotion recognition from human speech using temporal information and deep learning. In: InterSpeech 2018 (2018)
Jian, H., Li, Y., Tao, J., Zheng, L.: Speech emotion recognition from variable-length inputs with triplet loss function. In: InterSpeech 2018 (2018)
Vaswani, A., et al.: Attention is all you need. arXiv (2017)
Zadeh, A., Liang, P.P., Poria, S., Vij, P., Cambria, E., Morency, L.P.: Multi-attention recurrent network for human communication comprehension. In: Proceedings of Conference on AAAI Artificial Intelligence, pp. 5642–5649 (2018)
Luo, D., Zou, Y., Huang, D.: Investigation on joint representation learning for robust feature extraction in speech emotion recognition. In: InterSpeech 2018 (2018)
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. JMLR.org (2015)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Liang, X., Zou, Y., Xie, T., Zhou, Q. (2022). DCRNNX: Dual-Channel Recurrent Neural Network with Xgboost for Emotion Identification Using Nonspeech Vocalizations. In: Pan, X., Jin, T., Zhang, LJ. (eds) Artificial Intelligence and Mobile Services – AIMS 2022. AIMS 2022. Lecture Notes in Computer Science, vol 13729. Springer, Cham. https://doi.org/10.1007/978-3-031-23504-7_2
Download citation
DOI: https://doi.org/10.1007/978-3-031-23504-7_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-23503-0
Online ISBN: 978-3-031-23504-7
eBook Packages: Computer ScienceComputer Science (R0)