Skip to main content
Log in

Video multimodal emotion recognition based on Bi-GRU and attention fusion

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

A video multimodal emotion recognition method based on Bi-GRU and attention fusion is proposed in this paper. Bidirectional gated recurrent unit (Bi-GRU) is applied to improve the accuracy of emotion recognition in time contexts. A new network initialization method is proposed and applied to the network model, which can further improve the video emotion recognition accuracy of the time-contextual learning. To overcome the weight consistency of each modality in multimodal fusion, a video multimodal emotion recognition method based on attention fusion network is proposed. The attention fusion network can calculate the attention distribution of each modality at each moment in real-time so that the network model can learn multimodal contextual information in real-time. The experimental results show that the proposed method can improve the accuracy of emotion recognition in three single modalities of textual, visual, and audio, meanwhile improve the accuracy of video multimodal emotion recognition. The proposed method outperforms the existing state-of-the-art methods for multimodal emotion recognition in sentiment classification and sentiment regression.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Availability of data and material

Data and material are fully available without restriction.

References

  1. Bairaju SPR, Ari S, Garimella RM (2019) Emotion detection using visual information with deep auto-encoders[C]//2019 IEEE 5th international conference for convergence in technology (I2CT). IEEE:1–5

  2. Byeon YH, Kwak KC (2014) Facial expression recognition using 3D convolutional neural network[J]. Int J Adv Comput Sci Appl 5(12):107–112

    Google Scholar 

  3. Degottex G, Kane J, Drugman T et al (2014) COVAREP—A collaborative voice analysis repository for speech technologies[C]// IEEE international conference on acoustics, speech and signal processing. IEEE, Florence, pp 960–964

    Google Scholar 

  4. Drugman T, Alwan A (2011) Joint robust voicing detection and pitch estimation based on residual harmonics[C]// Twelfth Annual Conference of the International Speech Communication Association : 1973–1976.

  5. Drugman T, Thomas M, Gudnason J, Naylor P, Dutoit T (2012) Detection of glottal closure instants from speech signals: a quantitative review[J]. IEEE Trans Audio Speech Lang Process 20(3):994–1006

    Article  Google Scholar 

  6. Ebrahimi Kahou S, Michalski V, Konda K, et al (2015) Recurrent neural networks for emotion recognition in video[C]// Proceedings of the 2015 ACM on international conference on multimodal interaction, ACM, Seattle, Washington, USA, Nov 09-13: 467–474.

  7. Ekman P (1992) An argument for basic emotions[J]. Cognit Emot 6(3–4):169–200

    Article  Google Scholar 

  8. Ekman P, Freisen WV, Ancoli S (1980) Facial signs of emotional experience[J]. J Pers Soc Psychol 39(6):1125–1134

    Article  Google Scholar 

  9. Fujisaki H, Ljungqvist M (1986) Proposal and evaluation of models for the glottal source waveform[C]// ICASSP'86. IEEE International Conference on Acoustics, Speech, and Signal Processing. IEEE, Tokyo, Japan, 11: 1605-1608.

  10. Ghosh S, Laksana E, Morency L P, et al. (2016) Representation Learning for Speech Emotion Recognition[C]// Interspeech : 3603–3607.

  11. Hatzivassiloglou V, McKeown K R (1997) Predicting the semantic orientation of adjectives[C]// proceedings of the 35th annual meeting of the association for computational linguistics and eighth conference of the european chapter of the association for computational linguistics. Assoc Comput Linguist Madrid, Spain, July 07 : 174–181.

  12. Iyyer M, Manjunatha V, Boyd-Graber J, et al (2015) Deep unordered composition rivals syntactic methods for text classification[C]// Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, 1: 1681-1691.

  13. Kalchbrenner N, Grefenstette E, Blunsom P (2014) A convolutional neural network for modelling sentences[J]. arXiv preprint arXiv:1404.2188.

  14. Kane J, Gobl C (2013) Wavelet maxima dispersion for breathy to tense voice discrimination[J]. IEEE Trans Audio Speech Lang Process 21(6):1170–1179

    Article  Google Scholar 

  15. Kingma D P, Ba J (2014) Adam: a method for stochastic optimization[J]. arXiv preprint arXiv:1412.6980.

  16. Kumawat S, Verma M, Raman S (2019) LBVCNN: local binary volume convolutional neural network for facial expression recognition from image sequences[C]//proceedings of the IEEE conference on computer vision and pattern recognition workshops.

  17. Lee J, Tashev I (2015) High-level feature representation using recurrent neural network for speech emotion recognition[C]// Sixteenth Annual Conference of the International Speech Communication Association.

  18. Li J, Ren F (2013) A hybrid approach for word emotion recognition[J]. IEEJ Trans Electr Electron Eng 8(6):616–626

    Article  Google Scholar 

  19. Lim W, Jang D, Lee T (2016) Speech emotion recognition using convolutional and recurrent neural networks[C]// Asia-Pacific signal and information processing association annual summit and conference (APSIPA). IEEE, Jeju, pp 1–4

    Google Scholar 

  20. Liu Z, Shen Y, Lakshminarasimhan V B, et al (2018) Efficient low-rank multimodal fusion with modality-specific factors[C]// proceedings of the 56th annual meeting of the Association for Computational Linguistics (volume 1: long papers).

  21. Ma L, Ju F, Wan J, Shen X (2019) Emotional computing based on cross-modal fusion and edge network data incentive[J]. Pers Ubiquit Comput 23(3–4):363–372

    Article  Google Scholar 

  22. Moilanen K, Pulman S (2007) Sentiment composition[C]// Proceedings of RANLP , 7: 378–382.

  23. Morency LP, Mihalcea R, Doshi P (2011) Towards multimodal sentiment analysis: harvesting opinions from the web[C]// proceedings of the 13th international conference on multimodal interfaces, ICMI 2011, Alicante, Spain, Nov 14-18, 2011. ACM:169–176

  24. Nojavanasghari B, Gopinath D, Koushik J, et al (2016) Deep multimodal fusion for persuasiveness prediction[C]// international conference on multimodal interfaces (ICMI). ACM.

  25. Orjesek R, Jarina R, Chmulik M, et al (2019) DNN based music emotion recognition from raw audio signal[C]//2019 29th international conference RADIOELEKTRONIKA (RADIOELEKTRONIKA). IEEE: 1–4.

  26. Park S, Shim HS, Chatterjee M et al (2014) Computational analysis of persuasiveness in social multimedia: a novel dataset and multimodal prediction approach[C]// proceedings of the 16th international conference on multimodal interaction. ACM, Istanbul, Turkey, Nov 12-16:50–57

    Google Scholar 

  27. Pennington J, Socher R, Manning C (2014) Glove: Global vectors for word representation[C]// Proceedings of the 2014 conference on empirical methods in natural language processing : 1532–1543.

  28. Polanyi L, Zaenen A (2006) Contextual valence shifters[M]// computing attitude and affect in text: theory and applications. Springer, Dordrecht, pp 1–10

    Book  Google Scholar 

  29. Poria S, Cambria E, Gelbukh A (2015) Deep convolutional neural network textual features and multiple kernel learning for utterance-level multimodal sentiment analysis[C]// Proceedings of the 2015 conference on empirical methods in natural language processing : 2539–2544.

  30. Poria S, Cambria E, Howard N, et al (2015) Fusing audio, visual and textual clues for sentiment analysis from multimodal content[J]. Neurocomputing: S0925231215011297.

  31. Rajagopalan S S, Morency L P (2016) Tadas Baltrus̆aitis, et al. Extending Long Short-Term Memory for Multi-View Structured Learning[M]// Computer Vision – ECCV 2016. Springer International Publishing.

  32. Seyeditabari A, Tabari N, Gholizadeh S, et al (2019) Emotion Detection in Text: Focusing on Latent Representation[J]. arXiv preprint arXiv:1907.09369.

  33. Shrivastava K, Kumar S, Jain DK et al (2019) An effective approach for emotion detection in multimedia text data using sequence based convolutional neural network[J]. Multimed Tools Appl 78(20):29607–29639

    Article  Google Scholar 

  34. Socher R, Perelygin A, Wu J, et al (2013) Recursive deep models for semantic compositionality over a sentiment treebank[C]// proceedings of the 2013 conference on empirical methods in natural language processing, Seattle, WA, USA, Oct 18-21: 1631-1642.

  35. Socher, R, et al (2013) Recursive deep models for semantic compositionality over a sentiment treebank[C]// Proceedings of the 2013 conference on empirical methods in natural language processing.

  36. Taboada M, Brooke J, Tofiloski M, Voll K, Stede M (2011) Lexicon-based methods for sentiment analysis[J]. Comput Linguist 37(2):267–307

    Article  Google Scholar 

  37. Takamura H, Inui T, Okumura M (2006) Latent variable models for semantic orientations of phrases[C]// 11th conference of the European chapter of the association for. Comput Linguist:201–208

  38. Trigeorgis G, Ringeval F, Brueckner R et al (2016) Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network[C]// IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, Shanghai, pp 5200–5204

    Google Scholar 

  39. Wang H, Meghawat A, Morency L P, et al (2017) Select-additive learning: improving generalization in multimodal sentiment analysis[C]//2017 IEEE international conference on multimedia and expo (ICME). IEEE : 949–954.

  40. Wu X, et al (2019) Speech Emotion Recognition Using Capsule Networks[C]// ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE.

  41. Yang B, Cardie C (2012) Extracting opinion expressions with semi-Markov conditional random fields[C]// proceedings of the 2012 joint conference on empirical methods in natural language processing and computational natural language learning. Assoc Comput Linguist, Jeju Island, Korea, July 12-14 : 1335–1345.

  42. Zadeh A, Chen M, Poria S, et al (2017) Tensor fusion network for multimodal sentiment analysis[J]. arXiv preprint arXiv:1707.07250.

  43. Zadeh A, Liang P P, Mazumder N, et al (2018) Memory fusion network for multi-view sequential learning[C]//thirty-second AAAI conference on artificial intelligence.

  44. Zadeh A, Liang P, Poria S, et al (2018) Multi-attention recurrent network for human communication comprehension[C]//thirty-second AAAI conference on artificial intelligence.

  45. Zadeh A, Zellers R, Pincus E, et al (2016) MOSI: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos[J]. arXiv preprint arXiv:1606.06259.

  46. Zhao J, Chen S, Wang S, et al (2018) Emotion recognition using multimodal features[C]//2018 first Asian conference on affective computing and intelligent interaction (ACII Asia). IEEE: 1–6.

Download references

Acknowledgements

This work was supported by the Zhejiang Provincial Natural Science Foundation of China [grant number LY19F020032], and National Natural Science Foundation of China [grant number U1909203, 61872322].

Funding

This study was funded by the Zhejiang Provincial Natural Science Foundation of China (grant number LY19F020032), and National Natural Science Foundation of China (grant number U1909203, 61872322).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ruo-Hong Huan.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Code availability

Custom code is available without restriction.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Huan, RH., Shu, J., Bao, SL. et al. Video multimodal emotion recognition based on Bi-GRU and attention fusion. Multimed Tools Appl 80, 8213–8240 (2021). https://doi.org/10.1007/s11042-020-10030-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-020-10030-4

Keywords

Navigation