Skip to main content
Log in

Transformer-Based Interactive Multi-Modal Attention Network for Video Sentiment Detection

  • Published:
Neural Processing Letters Aims and scope Submit manuscript

Abstract

Social media allows users to express opinions in multiple modalities such as text, pictures, and short-videos. Multi-modal sentiment detection can more effectively predict the emotional tendencies expressed by users. Therefore, multi-modal sentiment detection has received extensive attention in recent years. Current works consider utterances from videos as independent modal, ignoring the effective interaction among diffence modalities of a video. To tackle these challenges, we propose transformer-based interactive multi-modal attention network to investigate multi-modal paired attention between multiple modalities and utterances for video sentiment detection. Specifically, we first take a series of utterances as input and use three separate transformer encoders to capture the utterances-level features of each modality. Subsequently, we introduced multimodal paired attention mechanisms to learn the cross-modality information between multiple modalities and utterances. Finally, we inject the cross-modality information into the multi-headed self-attention layer for making final emotion and sentiment classification. Our solutions outperform baseline models on three multi-modal datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  1. Poria S, Cambria E, Hazarika D, Majumder N, Zadeh A, Morency LP (2017) Context-dependent sentiment analysis in user-generated videos. In: Proceedings of the 55th annual meeting of the association for computational linguistics (volume 1: Long papers), pp 873–883

  2. Soleymani M, Garcia D, Jou B, Schuller B, Chang SF, Pantic M (2017) A survey of multimodal sentiment analysis. Image Vis Comput 65:3–14

    Article  Google Scholar 

  3. Zhang L, Wang S, Liu B (2018) Deep learning for sentiment analysis: a survey. Wiley Interdiscip Rev Data Min Knowl Discov 8(4):e1253

    Article  Google Scholar 

  4. Yu J, Rui Y, Tao D (2014) Click prediction for web image reranking using multimodal sparse coding. IEEE Trans Image Process 23(5):2019–2032

    Article  MathSciNet  Google Scholar 

  5. Yu J, Tan M, Zhang H, Tao D, Rui Y (2019) Hierarchical deep click feature prediction for fine-grained image recognition. IEEE Trans Pattern Anal Mach Intell. https://doi.org/10.1109/TPAMI.2019.2932058

  6. Dumpala SH, Sheikh I, Chakraborty R, Kopparapu SK (2019) Audio-visual fusion for sentiment classification using cross-modal autoencoder. In: 32nd conference on neural information processing systems (NIPS 2018), pp 1–4

  7. Dumpala SH, Sheikh I, Chakraborty R, Kopparapu SK (2018) Sentiment classification on erroneous ASR transcripts: a multi view learning approach. In: 2018 IEEE Spoken language technology workshop (SLT). IEEE, pp 807–814

  8. Sheikh I, Dumpala SH, Chakraborty R, Kopparapu SK (2018) Sentiment analysis using imperfect views from spoken language and acoustic modalities. In: Proceedings of grand challenge and workshop on human multimodal language (Challenge-HML), pp 35–39

  9. Kumar A, Sebastian TM (2012) Sentiment analysis on twitter. Int J Comput Sci Issues (IJCSI) 9(4):372

    Google Scholar 

  10. Sun Z, Sarma PK, Sethares W, Bucy EP (2019) Multi-modal sentiment analysis using deep canonical correlation analysis. arXiv:1907.08696

  11. Mohammed Almansor MA, Zhang C, Khan W, Hussain A, Alhusaini N (2020) Cross lingual sentiment analysis: a clustering-based bee colony instance selection and target-based feature weighting approach. Sensors 20(18):5276

    Article  Google Scholar 

  12. Chen M, Wang S, Liang PP, Baltrušaitis T, Zadeh A, Morency LP (2017) Multimodal sentiment analysis with word-level fusion and reinforcement learning. In: Proceedings of the 19th ACM international conference on multimodal interaction, pp 163–171

  13. Kaur R, Kautish S (2019) Multimodal sentiment analysis: a survey and comparison. Int J Serv Sci Manag Eng Technol (IJSSMET) 10(2):38–58

    Google Scholar 

  14. Poria S, Chaturvedi I, Cambria E, Hussain A (2016) Convolutional MKL based multimodal emotion recognition and sentiment analysis. In: 2016 IEEE 16th international conference on data mining (ICDM), IEEE, pp 439–448

  15. Poria S, Majumder N, Hazarika D, Cambria E, Gelbukh A, Hussain A (2018) Multimodal sentiment analysis: addressing key issues and setting up the baselines. IEEE Intell Syst 33(6):17–25

    Article  Google Scholar 

  16. Agarwal A, Yadav A, Vishwakarma DK (2019) Multimodal sentiment analysis via RNN variants. In: 2019 IEEE international conference on big data, cloud computing, data science & engineering (BCD). IEEE, pp 19–23

  17. Majumder N, Hazarika D, Gelbukh A, Cambria E, Poria S (2018) Multimodal sentiment analysis using hierarchical fusion with context modeling. Knowl-Based Syst 161:124–133

    Article  Google Scholar 

  18. Xi C, Lu G, Yan J (2020) Multimodal sentiment analysis based on multi-head attention mechanism. In: Proceedings of the 4th international conference on machine learning and soft computing, pp 34–39

  19. Wang Z, Wan Z, Wan X (2020) Transmodality: an end2end fusion method with transformer for multimodal sentiment analysis. Proc Web Conf 2020:2514–2520

    Google Scholar 

  20. Wang H, Meghawat A, Morency LP, Xing EP (2017) Select-additive learning: improving generalization in multimodal sentiment analysis. In: 2017 IEEE international conference on multimedia and expo (ICME). IEEE, pp 949–954

  21. Cambria E, Hazarika D, Poria S, Hussain A, Subramanyam R (2017) Benchmarking multimodal sentiment analysis. In: International conference on computational linguistics and intelligent text processing. Springer, pp 166–179

  22. Fulse S, Sugandhi R, Mahajan A (2014) A survey on multimodal sentiment analysis. Int J Eng Res Technol 3(11):1233–1238

    Google Scholar 

  23. Poria S, Cambria E, Hazarika D, Mazumder N, Zadeh A, Morency LP (2017) Multi-level multiple attentions for contextual multimodal sentiment analysis. In: 2017 IEEE international conference on data mining (ICDM). IEEE, pp 1033–1038

  24. Ranganathan H, Chakraborty S, Panchanathan S (2016) Multimodal emotion recognition using deep learning architectures. In: 2016 IEEE Winter conference on applications of computer vision (WACV). IEEE, pp 1–9

  25. Poria S, Cambria E, Gelbukh A (2015) Deep convolutional neural network textual features and multiple kernel learning for utterance-level multimodal sentiment analysis. In: Proceedings of the 2015 conference on empirical methods in natural language processing, pp 2539–2544

  26. Luo Z, Xu H, Chen F (2019) Audio sentiment analysis by heterogeneous signal features learned from utterance-based parallel neural network. In: AffCon@AAAI

  27. Huddar MG, Sannakki SS, Rajpurohit VS (2018) An ensemble approach to utterance level multimodal sentiment analysis. In: 2018 International conference on computational techniques, electronics and mechanical systems (CTEMS). IEEE, pp 145–150

  28. Deng D, Zhou Y, Pi J, Shi BE (2018) Multimodal utterance-level affect analysis using visual, audio and text features. arXiv:1805.00625

  29. Devlin J, Chang MW, Lee K, Toutanova K (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805

  30. Wang M, Cao D, Li L, Li S, Ji R (2014) Microblog sentiment analysis based on cross-media bag-of-words model. In: Proceedings of international conference on internet multimedia computing and service, pp 76–80

  31. Cao D, Ji R, Lin D, Li S (2016) A cross-media public sentiment analysis system for microblog. Multimed Syst 22(4):479–486

    Article  Google Scholar 

  32. You Q, Luo J, Jin H, Yang J (2016) Cross-modality consistent regression for joint visual-textual sentiment analysis of social multimedia. In: Proceedings of the Ninth ACM international conference on Web search and data mining, pp 13–22

  33. You Q, Cao L, Jin H, Luo J (2016) Robust visual-textual sentiment analysis: when attention meets tree-structured recursive neural networks. In: Proceedings of the 24th ACM international conference on multimedia, pp 1008–1017

  34. Zadeh A, Chen M, Poria S, Cambria E, Morency LP (2017) Tensor fusion network for multimodal sentiment analysis. arXiv:1707.07250

  35. You Q, Luo J, Jin H, Yang J (2015) Joint visual-textual sentiment analysis with deep neural networks. In: Proceedings of the 23rd ACM international conference on Multimedia, pp 1071–1074

  36. Zhu X, Cao B, Xu S, Liu B, Cao J (2019) Joint visual-textual sentiment analysis based on cross-modality attention mechanism. In: International conference on multimedia modeling. Springer, pp 264–276

  37. Zadeh A, Liang PP, Mazumder N, Poria S, Cambria E, Morency LP (2018) Memory fusion network for multi-view sequential learning. In: Proceedings of the AAAI conference on artificial intelligence, vol 32

  38. Zadeh AB, Liang PP, Poria S, Cambria E, Morency LP (2018) Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph. In: Proceedings of the 56th annual meeting of the association for computational linguistics (Volume 1: Long Papers), pp 2236–2246

  39. Hutto C, Gilbert E (2014) Vader: a parsimonious rule-based model for sentiment analysis of social media text. In: Proceedings of the international AAAI conference on web and social media, vol 8

  40. O’Connor B, Balasubramanyan R, Routledge BR, Smith NA (2010) From tweets to polls: linking text sentiment to public opinion time series. In: Fourth international AAAI conference on weblogs and social media

  41. Borth D, Chen T, Ji R, Chang SF (2013) Sentibank: large-scale ontology and classifiers for detecting sentiment and emotions in visual content. In: Proceedings of the 21st ACM international conference on Multimedia, pp 459–460

  42. Siersdorfer S, Minack E, Deng F, Hare J (2010) Analyzing and predicting sentiment of images on the social web. In: Proceedings of the 18th ACM international conference on Multimedia, pp 715–718

  43. Pennington J, Socher R, Manning CD (2014) Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1532–1543

  44. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556

  45. Eyben F, Wöllmer M, Schuller B (2010) Opensmile: the Munich versatile and fast open-source audio feature extractor. In: Proceedings of the 18th ACM international conference on multimedia, pp 1459–1462

  46. Rakhlin A (2016) Convolutional neural networks for sentence classification. GitHub

  47. Pérez-Rosas V, Mihalcea R, Morency LP (2013) Utterance-level multimodal sentiment analysis. In: Proceedings of the 51st annual meeting of the association for computational linguistics (Volume 1: Long Papers), pp 973–982

  48. Zadeh A, Zellers R, Pincus E, Morency LP (2016) Multimodal sentiment intensity analysis in videos: facial gestures and verbal messages. IEEE Intell Syst 31(6):82–88

    Article  Google Scholar 

  49. Nojavanasghari B, Gopinath D, Koushik J, Baltrušaitis T, Morency LP (2016) Deep multimodal fusion for persuasiveness prediction. In: Proceedings of the 18th ACM international conference on multimodal interaction, pp 284–288

  50. Rajagopalan SS, Morency LP, Baltrusaitis T, Goecke R (2016) Extending long short-term memory for multi-view structured learning. In: European conference on computer vision. Springer, pp 338–353

  51. Blanchard N, Moreira D, Bharati A, Scheirer WJ (2018) Getting the subtext without the text: Scalable multimodal sentiment classification from visual and acoustic modalities. arXiv:1807.01122

Download references

Acknowledgements

This work was supported by the following grants: National Natural Science Foundation of China (No.61772321); Shandong Natural Science Foundation ZR202011020044; Natural Science Foundation of China (No.81973981); Key Project of Research and Development in Shandong Province (No.2019RKB14090); Project of Traditional Chinese Medicine and Technology Development Plan Program in Shandong province (No.2019-0018); Shandong Postgraduate Education Quality Improvement Plan (SDYKC19147).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Fangai Liu.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhuang, X., Liu, F., Hou, J. et al. Transformer-Based Interactive Multi-Modal Attention Network for Video Sentiment Detection. Neural Process Lett 54, 1943–1960 (2022). https://doi.org/10.1007/s11063-021-10713-5

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11063-021-10713-5

Keywords

Navigation