Skip to main content

DCRNNX: Dual-Channel Recurrent Neural Network with Xgboost for Emotion Identification Using Nonspeech Vocalizations

  • Conference paper
  • First Online:
  • 193 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13729))

Abstract

The human voice, especially nonspeech vocalizations, inherently convey emotions. However, existing efforts have ignored such emotional expressions for a long time. Based on this, we propose a Dual-channel Recurrent Neural Network with Xgboost (DCRNNX) to solve emotion recognition using nonspeech vocalizations. The DCRNNX mainly combines two Backbone models. The first model is a two-channel neural network model based on the Deep Neural Network (DNN) and Channel Recurrent Neural Network (CRNN). Channel 1 is constructed by CRNN, and the other model is constructed by Xgboost. Additionally, we employ a smoothing mechanism to integrate the outputs of the two classifiers to promote our DCRNNX. Compared with the baselines, DCRNNX combines not only multiple features but also combines multiple models, which ensures the generalization performance of DCRNNX. Experimental results show that our method achieves 45% and 42% UAR (Unweighted Average Recall), on the development dataset. After model fusion, DCRNNX achieves 46.89% UAR and 37.0% UAR on development and test datasets, respectively. The performance of our method on the development dataset is nearly 6% better than the baselines. Especially, there is a considerable gap between the performance of DCRNNX on the development and the test set. It may be the reason for the differences in emotional characteristics of the male and female voices.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   44.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   59.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    https://github.com/openXBOW/openXBOW.

  2. 2.

    https://github.com/audeering/opensmile.

  3. 3.

    https://github.com/dmlc/xgboost.

  4. 4.

    https://github.com/librosa/librosa.

  5. 5.

    https://github.com/YannickJadoul/Parselmouth.

References

  1. Schuller, B.W., et al.: The ACM multimedia 2022 computational paralinguistics challenge: vocalisations, stuttering, activity, & mosquitos. In: Proceedings ACM Multimedia 2022, Lisbon, Portugal, ISCA, October 2022 (to appear)

    Google Scholar 

  2. Yan, H., He, Q., Xie, W.: CRNN-CTC based mandarin keywords spotting. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7489–7493 (2020)

    Google Scholar 

  3. Meftah, A.H., Mathkour, H., Kerrache, S., Alotaibi, Y.A.: Speaker identification in different emotional states in Arabic and English. IEEE Access 8, 60070–60083 (2020)

    Article  Google Scholar 

  4. Ma, X., Wu, Z., Jia, J., Xu, M., Meng, H., Cai, L.: Emotion recognition from variable-length speech segments using deep learning on spectrograms. 09, 3683–3687 (2018)

    Google Scholar 

  5. Satt, A., Rozenberg, S., Hoory, R.: Efficient emotion recognition from speech using deep learning on spectrograms. 08, 1089–1093 (2017)

    Google Scholar 

  6. Schmitt, M., Schuller, B.: Openxbow - introducing the Passau open-source crossmodal bag-of-words toolkit. J. Mach. Learn. Res. 18, 1–5 (2017)

    Google Scholar 

  7. Eyben, F., Wöllmer, M., Schuller, B.: Opensmile - the Munich versatile and fast open-source audio feature extractor. 1459–1462 (2010)

    Google Scholar 

  8. Han, K., Yu, D., Tashev, I.: Speech emotion recognition using deep neural network and extreme learning machine (2014)

    Google Scholar 

  9. Latif, S., Rana, R., Khalifa, S., Jurdak, R., Epps, J.: Direct modelling of speech emotion from raw speech (2019)

    Google Scholar 

  10. Shi, B., Bai, X., Yao, C.: An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans. Pattern Anal. Mach. Intell. 39(11), 2298–2304 (2017)

    Article  Google Scholar 

  11. Sainath, T.N., Vinyals, O., Senior, A., Sak, H.: Convolutional, long short-term memory, fully connected deep neural networks. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4580–4584 (2015)

    Google Scholar 

  12. Trigeorgis, G., Ringeval, F., Brueckner, R., Marchi, E., Zafeiriou, S.: Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network. In: IEEE International Conference on Acoustics (2016)

    Google Scholar 

  13. Tzirakis, P., Zhang, J., Schuller, B.: End-to-end speech emotion recognition using deep neural networks. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2018)

    Google Scholar 

  14. Zhu, W., Li, X.: Speech emotion recognition with global-aware fusion on multi-scale feature representation (2022)

    Google Scholar 

  15. Kim, J., Saurous, R.A.: Emotion recognition from human speech using temporal information and deep learning. In: InterSpeech 2018 (2018)

    Google Scholar 

  16. Jian, H., Li, Y., Tao, J., Zheng, L.: Speech emotion recognition from variable-length inputs with triplet loss function. In: InterSpeech 2018 (2018)

    Google Scholar 

  17. Vaswani, A., et al.: Attention is all you need. arXiv (2017)

    Google Scholar 

  18. Zadeh, A., Liang, P.P., Poria, S., Vij, P., Cambria, E., Morency, L.P.: Multi-attention recurrent network for human communication comprehension. In: Proceedings of Conference on AAAI Artificial Intelligence, pp. 5642–5649 (2018)

    Google Scholar 

  19. Luo, D., Zou, Y., Huang, D.: Investigation on joint representation learning for robust feature extraction in speech emotion recognition. In: InterSpeech 2018 (2018)

    Google Scholar 

  20. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)

    MathSciNet  MATH  Google Scholar 

  21. Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. JMLR.org (2015)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xingwei Liang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Liang, X., Zou, Y., Xie, T., Zhou, Q. (2022). DCRNNX: Dual-Channel Recurrent Neural Network with Xgboost for Emotion Identification Using Nonspeech Vocalizations. In: Pan, X., Jin, T., Zhang, LJ. (eds) Artificial Intelligence and Mobile Services – AIMS 2022. AIMS 2022. Lecture Notes in Computer Science, vol 13729. Springer, Cham. https://doi.org/10.1007/978-3-031-23504-7_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-23504-7_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-23503-0

  • Online ISBN: 978-3-031-23504-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics