Skip to main content
Log in

Multimodal sentiment analysis with asymmetric window multi-attentions

  • 1182: Deep Processing of Multimedia Data
  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Multimodal sentiment analysis is an actively developing field of research. The main research problem in this domain is to model both intra-modality and inter-modality dynamics. However, most of the current work cannot do well with these two aspects of dynamics. In this study, we introduce a novel model to achieve this. The novelty of our model is to represent the asymmetric weights of contexts at a particular timestamp using asymmetric windows. Further, multiple separate attentions are performed on the contexts, producing an updated representation of the particular timestamp. Each representation corresponding to one of the modes multiplies a weight vector controlled by a neural network. All multiplied results are merged with an addition operation. Experiments on the MOSI dataset show our model outperforms the compared methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. https://github.com/A2Zadeh

References

  1. Baecchi C, Uricchio T, Bertini M, Del Bimbo A (2016) A multimodal feature learning approach for sentiment analysis of social network multimedia. Multimed Tools Appl 75(5):2507–2525

    Article  Google Scholar 

  2. Baltrušaitis T, Ahuja C, Morency L-P (2018) Multimodal machine learning: A survey and taxonomy. IEEE Trans Pattern Anal Mach Intell 41(2):423–443

    Article  Google Scholar 

  3. Cavallari S, Zheng VW, Cai H, Chang C-CK, Cambria E (2017) Learning community embedding with community detection and node embedding on graphs. In: Proceedings of the 2017 ACM on conference on information and knowledge management. ACM, pp 377–386

  4. Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv:1406.1078

  5. Datcu D, Rothkrantz LJM (2014) Semantic audio-visual data fusion for automatic emotion recognition. Emotion recognition: a pattern analysis approach 411–435

  6. Degottex G, Kane J, Drugman T, Raitio T, Scherer S (2014) Covarep—a collaborative voice analysis repository for speech technologies. In: 2014 Ieee international conference on acoustics, speech and signal processing (icassp). IEEE, pp 960–964

  7. Ebrahimi M, Hossein Yazdavar A, Sheth A (2017) Challenges of sentiment analysis for dynamic events. IEEE Intell Syst 32(5):70–75

    Article  Google Scholar 

  8. Fukui A, Park DH, Yang D, Rohrbach A, Darrell T, Rohrbach M (2016) Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv:1606.01847

  9. Gunes H, Piccardi M (2007) Bi-modal emotion recognition from expressive face and body gestures. J Netw Comput Appl 30(4):1334–1345

    Article  Google Scholar 

  10. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Computat 9(8):1735–1780

    Article  Google Scholar 

  11. Hussain A, Huang G-B (2015) Towards an intelligent framework for multimodal affective data analysis. Neural Netw 63:104–116

    Article  Google Scholar 

  12. Kapoor A, Burleson W, Picard WR (2007) Automatic prediction of frustration. Int J Human-Comput Stud 65(8):724–736

    Article  Google Scholar 

  13. Liu Z, Shen Y, Lakshminarasimhan VB, Liang PP, Zadeh A, Morency L-P (2018) Efficient low-rank multimodal fusion with modality-specific factors. In: Proceedings of the 56th annual meeting of the association for computational linguistics, pp 2247–2256

  14. Majumder N, Hazarika D, Gelbukh A, Cambria E, Poria S (2018) Multimodal sentiment analysis using hierarchical fusion with context modeling. Knowl-Based Syst 161:124–133

    Article  Google Scholar 

  15. Mekhaldi D, Lalanne D, Ingold R (2012) A multimodal alignment framework for spoken documents. Multimed Tools Appl 61(2):353–388

    Article  Google Scholar 

  16. Mittal T, Bhattacharya U, Chandra R, Bera A, Manocha D (2020) M3er Multiplicative multimodal emotion recognition using facial, textual, and speech cues. In: Proceedings of the AAAI conference on artificial intelligence, vol 34, pp 1359–1367

  17. Morency L-P, Mihalcea R, Doshi P (2011) Towards multimodal sentiment analysis: Harvesting opinions from the web. In: Proceedings of the 13th international conference on multimodal interfaces. ACM, pp 169–176

  18. Nojavanasghari B, Gopinath D, Koushik J, Baltrušaitis T, Morency L-P (2016) Deep multimodal fusion for persuasiveness prediction. In: Proceedings of the 18th ACM international conference on multimodal interaction. ACM, pp 284–288

  19. Pandeya YR, Lee J (2021) Deep learning-based late fusion of multimodal information for emotion classification of music video. Multimed Tools Appl 80(2):2887–2905

    Article  Google Scholar 

  20. Pennington J, Socher R, Manning DC (2014) Glove Global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1532–1543

  21. Pérez-Rosas V, Mihalcea R, Morency L-P (2013) Utterance-level multimodal sentiment analysis. In: Proceedings of the 51st annual meeting of the association for computational linguistics (Volume 1: Long Papers), pp 973–982

  22. Poria S, Cambria E, Bajpai R, Hussain A (2017) A review of affective computing: From unimodal analysis to multimodal fusion. Inform Fusion 37:98–125

    Article  Google Scholar 

  23. Poria S, Cambria E, Gelbukh A (2015) Deep convolutional neural network textual features and multiple kernel learning for utterance-level multimodal sentiment analysis. In: Proceedings of the 2015 conference on empirical methods in natural language processing, pp 2539–2544

  24. Poria S, Cambria E, Howard N, Huang G-B, Hussain A (2016) Fusing audio, visual and textual clues for sentiment analysis from multimodal content. Neurocomputing 174:50–59

    Article  Google Scholar 

  25. Poria S, Chaturvedi I, Cambria E, Hussain A (2016) Convolutional mkl based multimodal emotion recognition and sentiment analysis. In: 2016 IEEE 16Th international conference on data mining (ICDM). IEEE, pp 439–448

  26. Pun T, Alecu TI, Chanel G, Kronegg J, Voloshynovskiy S (2006) Brain-computer interaction research at the computer vision and multimedia laboratory, University of Geneva. IEEE Trans Neural Syst Rehabilit Eng 14(2):210–213

    Article  Google Scholar 

  27. Rajagopalan SS, Morency L-P, Baltrusaitis T, Goecke R (2016) Extending long short-term memory for multi-view structured learning. In: European conference on computer vision. Springer, pp 338–353

  28. Ren J, Hu Y, Tai Y-W, Wang C, Xu L, Sun W, Yan Q (2016) Look, listen and learn—a multimodal lstm for speaker identification. In: Proceedings of the AAAI conference on artificial intelligence, vol 30

  29. Shan C, Gong S, McOwan PW (2007) Beyond facial expressions: Learning human emotion from body gestures. In: BMVC, pp 1–10

  30. Sohrab F, Raitoharju J, Iosifidis A, Gabbouj M (2021) Multimodal subspace support vector data description. Pattern Recogn 110:107648

    Article  Google Scholar 

  31. Song Y, Morency L-P, Davis R (2012) Multi-view latent variable discriminative models for action recognition. In: 2012 IEEE Conference on computer vision and pattern recognition. IEEE, pp 2120–2127

  32. Song Y, Morency L-P, Davis R (2013) Action recognition by hierarchical sequence summarization. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3562–3569

  33. Tsai HY-H, Liang PP, Zadeh A, Morency L-P, Salakhutdinov R (2018) Learning factorized multimodal representations. arXiv:1806.06176

  34. Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show, Tell A neural image caption generator. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3156–3164

  35. Wang W, Arora R, Livescu K, Bilmes J (2015) On deep multi-view representation learning. In: International conference on machine learning. PMLR, pp 1083–1092

  36. Wang H, Meghawat A, Morency L-P, Xing EP (2017) Select-additive learning: Improving generalization in multimodal sentiment analysis. In: 2017 IEEE International conference on multimedia and expo (ICME). IEEE, pp 949–954

  37. Wörtwein T, Scherer S (2017) What really matters—an information gain analysis of questions and reactions in automated ptsd screenings. In: 2017 Seventh international conference on affective computing and intelligent interaction (ACII). IEEE, pp 15–20

  38. Xing FZ, Cambria E, Welsch RE (2018) Natural language based financial forecasting: A survey. Artif Intell Rev 50(1):49–73

    Article  Google Scholar 

  39. Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, Attend and tell neural image caption generation with visual attention. In: International conference on machine learning. PMLR, pp 2048–2057

  40. Xu J, Gavves E, Fernando B, Tuytelaars T (2015) Guiding the long-short term memory model for image caption generation. In: Proceedings of the IEEE international conference on computer vision, pp 2407–2415

  41. Xu C, Tao D, Xu C (2013) A survey on multi-view learning. arXiv:1304.5634

  42. Young T, Cambria E, Chaturvedi I, Zhou H, Biswas S, Huang M (2018) Augmenting end-to-end dialogue systems with commonsense knowledge. In: Thirty-second AAAI conference on artificial intelligence

  43. Yu W, Zeiler S, Kolossa D (2021) Multimodal integration for large-vocabulary audio-visual speech recognition. In: 2020 28Th european signal processing conference (EUSIPCO). IEEE, pp 341– 345

  44. Yuan J, Liberman M (2008) Speaker identification on the scotus corpus. J Acoust Soc Am 123(5):3878

    Article  Google Scholar 

  45. Zadeh A, Chen M, Poria S, Cambria E, Morency L-P (2017) Tensor fusion network for multimodal sentiment analysis. arXiv:1707.07250

  46. Zadeh A, Liang PP, Mazumder N, Poria S, Cambria E, Morency L-P (2018) Memory fusion network for multi-view sequential learning. In: Thirty-second AAAI conference on artificial intelligence

  47. Zadeh A, Liang PP, Poria S, Vij P, Cambria E, Morency L-P (2018) Multi-attention recurrent network for human communication comprehension. In: Thirty-second AAAI conference on artificial intelligence

  48. Zadeh A, Zellers R, Pincus E, Morency L-P (2016) Mosi: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos. arXiv:1606.06259

  49. Zadeh A, Zellers R, Pincus E, Morency L-P (2016) Multimodal sentiment intensity analysis in videos: Facial gestures and verbal messages. IEEE Intell Syst 31(6):82–88

    Article  Google Scholar 

  50. Zeng Z, Tu J, Liu M, Huang TS, Pianfetti B, Roth D, Levinson S (2007) Audio-visual affect recognition. IEEE Trans Multimed 9(2):424–428

    Article  Google Scholar 

  51. Zhu Q, Yeh M-C, Cheng K-T, Avidan S (2006) Fast human detection using a cascade of histograms of oriented gradients. In: 2006 IEEE Computer society conference on computer vision and pattern recognition (CVPR’06), vol 2. IEEE, pp 1491–1498

Download references

Acknowledgements

This research was supported in part by Science and Technology Program of Guangzhou (202102020878), National Natural Science Foundation of China (62006053), Special Innovation Project of Guangdong Education Department (2018KQNCX072), the Youth Innovative Talents Project in Guangdong Universities (2020KQNCX186), the Fourth College Level Project of Guangdong Justice Police Vocational College (2020YB16), the 13th Five-Year Plan of Guangdong Institute of Higher Education Research on Higher Education of Young Teachers in Colleges and Universities in 2019 (19GGZ070), and thanks Ziang Liu for revising the english grammar of the paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xueming Yan.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Helang Lai and Xueming Yan had contributed equally to this work.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lai, H., Yan, X. Multimodal sentiment analysis with asymmetric window multi-attentions. Multimed Tools Appl 81, 19415–19428 (2022). https://doi.org/10.1007/s11042-021-11234-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-021-11234-y

Keywords

Navigation