Skip to main content
Log in

An optimal model with a lower bound of recall for imbalanced speech emotion recognition

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

In an early complain warning system, we encounter a common problem - the lack of angry emotions for training classification models. Moreover, the recognition of angry emotion is more important than that of no-anger emotion. Based on this, the main purpose of this paper is to train an optimal model which achieves a high recall above a lower bound and a maximum of F1 score. It is divided into three aspects: 1) A variant of F1 score (TF1 score) takes recall above a lower bound and F1 score into consideration; 2) A Single Emotion Deep Neural Network (SEDNN) and its training process are designed to find an optimal model with a maximum of TF1 score. 3) A performance comparison of different methods is conducted on IEMOCAP and Emo-DB database. Extensive experiments show that when a BCE loss function or a focal loss function is used, the training process can find a model with a recall above a high threshold and a maximum of F1 score. Especially, SEDNN with the focal loss function performs better than SEDNN with the BCE loss function.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

References

  1. Buda M, Maki A, Mazurowski MA (2018) A systematic study of the class imbalance problem in convolutional neural networks. Neural Netw 106:249–259

    Article  Google Scholar 

  2. Busso C, Bulut M, Lee C, Kazemzadeh A, Mower E, Kim S, Chang JN, Lee S, Narayanan SS (2008) IEMOCAP: interactive emotional dyadic motion capture database. Language Resources and Evaluation 42(4):335

    Article  Google Scholar 

  3. Chen M, He X, Yang J, Zhang H (2018) 3-d convolutional recurrent neural networks with attention model for speech emotion recognition. IEEE Signal Processing Letters 25(10):1440–1444

    Article  Google Scholar 

  4. Chen X, Ragni A, Liu X, Gales M (2017) Investigating bidirectional recurrent neural network language models for speech recognition. pp 269–273

  5. Chollet F, et al. (2018) Keras: the python deep learning library. Astrophysics Source Code Library

  6. Dong Y, Zhang Z, Hong W (2018) A hybrid seasonal mechanism with a chaotic cuckoo search algorithm with a support vector regression model for electric load forecasting. Energies 11(4):1009

    Article  Google Scholar 

  7. Eyben F, Wöllmer M, Schuller B (2009) OpenEAR—introducing the munich open-source emotion and affect recognition toolkit. In: 2009 3rd international conference on affective computing and intelligent interaction and workshops. IEEE, pp 1–6

  8. Fan DP, Cheng MM, Liu Y, Li T, Borji A (2017) Structure-measure: a new way to evaluate foreground maps. In: 2017 international conference on computer vision, pp 4558–4567

  9. Han K, Yu D, Tashev I (2014) Speech emotion recognition using deep neural network and extreme learning machine. In: Fifteenth annual conference of the international speech communication association

  10. Hong W, Li M, Geng J, Zhang Y (2019) Novel chaotic bat algorithm for forecasting complex motion of floating platforms. Appl Math Model 72:425–443

    Article  MathSciNet  Google Scholar 

  11. Hoon S, Keith W, Farrar CR (2001) Novelty detection using auto associative neural network. In: ASEM international mechanical engineering congress and exposition, pp 573–580

  12. Huang C, Narayanan SS (2016) Attention assisted discovery of sub-utterance structure in speech emotion recognition. In: INTERSPEECH, pp 1387–1391

  13. Huang C, Narayanan SS (2017) Deep convolutional recurrent neural network with attention mechanism for robust speech emotion recognition. In: 2017 IEEE international conference on multimedia and expo (ICME). IEEE, pp 583–588

  14. Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. arXiv: Learning

  15. Kundra H, Sadawarti H (2015) Hybrid algorithm of cuckoo search and particle swarm optimization for natural terrain feature extraction. Res J Inf Technol 7(1):58–69

    Google Scholar 

  16. Lee J, Tashev I (2015) High-level feature representation using recurrent neural network for speech emotion recognition. In: Sixteenth annual conference of the international speech communication association

  17. Lin T, Goyal P, Girshick R, He K, Dollár P (2017) Focal loss for dense object detection. In: Proceedings of the IEEE international conference on computer vision, pp 2980–2988

  18. Maas AL, Hannun AY, Ng AY (2013) Rectifier nonlinearities improve neural network acoustic models. In: Proc. icml, vol 30, p 3

  19. Masko D, Hensman P (2015) The impact of imbalanced training data for convolutional neural networks

  20. Mirsamadi S, Barsoum E, Zhang C (2017) Automatic speech emotion recognition using recurrent neural networks with local attention. In: 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 2227–2231

  21. Trigeorgis G, Ringeval F, Brueckner R, Marchi E, Nicolaou MA, Schuller B, Zafeiriou S (2016) Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network. In: 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 5200–5204

  22. Vuckovic F, Lauc G, Aulchenko Y (2015) Normalization and batch correction methods for high-throughput glycomics. In: XXIII international symposium on glycoconjugates (GLYCO 23) ,

  23. Wang K, An N, Li BN, Zhang Y, Li L (2015) Speech emotion recognition using fourier parameters. IEEE Trans Affect Comput 6(1):69–75

    Article  Google Scholar 

  24. Yang N, Muraleedharan R, Kohl J, Demirkol I, Heinzelman W, Sturge-Apple M (2012) Speech-based emotion classification using multiclass svm with hybrid kernel and thresholding fusion. In: 2012 IEEE spoken language technology workshop (slt), IEEE, pp 455–460

  25. Zeiler MD, Fergus R (2014) Visualizing and understanding convolutional networks. In: European conference on computer vision. Springer, pp 818–833

  26. Zhang Z, Hong WC (2019) Electric load forecasting by complete ensemble empirical mode decomposition adaptive noise and support vector regression with quantum-based dragonfly algorithm. Nonlinear Dynamics 98(4)

  27. Zhang Z, Hong W, Li J (2020) Electric load forecasting by hybrid self-recurrent support vector regression model with variational mode decomposition and improved cuckoo search algorithm. IEEE Access 8:14642–14658

    Article  Google Scholar 

  28. Zhao J, Mao X, Chen L (2019) Speech emotion recognition using deep 1D & 2D CNN LSTM networks. Biomedical Signal Processing and Control 47:312–323

    Article  Google Scholar 

Download references

Acknowledgements

This research was partially supported by the National Natural Science Foundation of China under grant No. 61472267, No. 61702351, the Natural Science Foundation of the Jiangsu Higher Education Institutions of China under grant No. 17KJB520036, Foundation of Key Laboratory in Science and Technology Development Project of Suzhou under grant No. SZS201609, Suzhou Science and Technology Plan Project under grant No. SYG201903.

Funding

This study was funded by Natural Science Foundation of China (grant number: 61472267, 61702351), Natural Science Foundation of the Jiangsu Higher Education Institutions of China (grant number: 17KJB520036), and Foundation of Key Laboratory in Science and Technology Development Project of Suzhou (grant number: SZS201609), Suzhou Science and Technology Plan Project (grant number: SYG201903).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Xusheng Ai or Victor S. Sheng.

Ethics declarations

Conflict of interests

The authors declare that they have no conflict of interest. All procedures performed in studies involving human participants were in accordance with the ethical standards of the institutional and/or national research committee and with the 1964 Helsinki declaration and its later amendments or comparable ethical standards. This article does not contain any studies with animals performed by any of the authors.

Informed Consent

Informed consent was obtained from all individual participants included in the study.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ai, X., Sheng, V.S., Fang, W. et al. An optimal model with a lower bound of recall for imbalanced speech emotion recognition. Multimed Tools Appl 79, 24281–24301 (2020). https://doi.org/10.1007/s11042-020-09155-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-020-09155-3

Keywords

Navigation