Transformer-Based Interactive Multi-Modal Attention Network for Video Sentiment Detection

Zhuang, Xuqiang; Liu, Fangai; Hou, Jian; Hao, Jianhua; Cai, Xiaohong

doi:10.1007/s11063-021-10713-5

Transformer-Based Interactive Multi-Modal Attention Network for Video Sentiment Detection

Published: 04 January 2022

Volume 54, pages 1943–1960, (2022)
Cite this article

Neural Processing Letters Aims and scope Submit manuscript

Xuqiang Zhuang ORCID: orcid.org/0000-0002-2010-278X¹,
Fangai Liu¹,
Jian Hou¹,
Jianhua Hao¹ &
…
Xiaohong Cai²

1346 Accesses
9 Citations
1 Altmetric
Explore all metrics

Abstract

Social media allows users to express opinions in multiple modalities such as text, pictures, and short-videos. Multi-modal sentiment detection can more effectively predict the emotional tendencies expressed by users. Therefore, multi-modal sentiment detection has received extensive attention in recent years. Current works consider utterances from videos as independent modal, ignoring the effective interaction among diffence modalities of a video. To tackle these challenges, we propose transformer-based interactive multi-modal attention network to investigate multi-modal paired attention between multiple modalities and utterances for video sentiment detection. Specifically, we first take a series of utterances as input and use three separate transformer encoders to capture the utterances-level features of each modality. Subsequently, we introduced multimodal paired attention mechanisms to learn the cross-modality information between multiple modalities and utterances. Finally, we inject the cross-modality information into the multi-headed self-attention layer for making final emotion and sentiment classification. Our solutions outperform baseline models on three multi-modal datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multimodel Sentiment Analysis with Self-attention

Multi-task Gated Contextual Cross-Modal Attention Framework for Sentiment and Emotion Analysis

Attention-based multimodal contextual fusion for sentiment and emotion classification using bidirectional LSTM

Article 11 January 2021

References

Poria S, Cambria E, Hazarika D, Majumder N, Zadeh A, Morency LP (2017) Context-dependent sentiment analysis in user-generated videos. In: Proceedings of the 55th annual meeting of the association for computational linguistics (volume 1: Long papers), pp 873–883
Soleymani M, Garcia D, Jou B, Schuller B, Chang SF, Pantic M (2017) A survey of multimodal sentiment analysis. Image Vis Comput 65:3–14
Article Google Scholar
Zhang L, Wang S, Liu B (2018) Deep learning for sentiment analysis: a survey. Wiley Interdiscip Rev Data Min Knowl Discov 8(4):e1253
Article Google Scholar
Yu J, Rui Y, Tao D (2014) Click prediction for web image reranking using multimodal sparse coding. IEEE Trans Image Process 23(5):2019–2032
Article MathSciNet Google Scholar
Yu J, Tan M, Zhang H, Tao D, Rui Y (2019) Hierarchical deep click feature prediction for fine-grained image recognition. IEEE Trans Pattern Anal Mach Intell. https://doi.org/10.1109/TPAMI.2019.2932058
Dumpala SH, Sheikh I, Chakraborty R, Kopparapu SK (2019) Audio-visual fusion for sentiment classification using cross-modal autoencoder. In: 32nd conference on neural information processing systems (NIPS 2018), pp 1–4
Dumpala SH, Sheikh I, Chakraborty R, Kopparapu SK (2018) Sentiment classification on erroneous ASR transcripts: a multi view learning approach. In: 2018 IEEE Spoken language technology workshop (SLT). IEEE, pp 807–814
Sheikh I, Dumpala SH, Chakraborty R, Kopparapu SK (2018) Sentiment analysis using imperfect views from spoken language and acoustic modalities. In: Proceedings of grand challenge and workshop on human multimodal language (Challenge-HML), pp 35–39
Kumar A, Sebastian TM (2012) Sentiment analysis on twitter. Int J Comput Sci Issues (IJCSI) 9(4):372
Google Scholar
Sun Z, Sarma PK, Sethares W, Bucy EP (2019) Multi-modal sentiment analysis using deep canonical correlation analysis. arXiv:1907.08696
Mohammed Almansor MA, Zhang C, Khan W, Hussain A, Alhusaini N (2020) Cross lingual sentiment analysis: a clustering-based bee colony instance selection and target-based feature weighting approach. Sensors 20(18):5276
Article Google Scholar
Chen M, Wang S, Liang PP, Baltrušaitis T, Zadeh A, Morency LP (2017) Multimodal sentiment analysis with word-level fusion and reinforcement learning. In: Proceedings of the 19th ACM international conference on multimodal interaction, pp 163–171
Kaur R, Kautish S (2019) Multimodal sentiment analysis: a survey and comparison. Int J Serv Sci Manag Eng Technol (IJSSMET) 10(2):38–58
Google Scholar
Poria S, Chaturvedi I, Cambria E, Hussain A (2016) Convolutional MKL based multimodal emotion recognition and sentiment analysis. In: 2016 IEEE 16th international conference on data mining (ICDM), IEEE, pp 439–448
Poria S, Majumder N, Hazarika D, Cambria E, Gelbukh A, Hussain A (2018) Multimodal sentiment analysis: addressing key issues and setting up the baselines. IEEE Intell Syst 33(6):17–25
Article Google Scholar
Agarwal A, Yadav A, Vishwakarma DK (2019) Multimodal sentiment analysis via RNN variants. In: 2019 IEEE international conference on big data, cloud computing, data science & engineering (BCD). IEEE, pp 19–23
Majumder N, Hazarika D, Gelbukh A, Cambria E, Poria S (2018) Multimodal sentiment analysis using hierarchical fusion with context modeling. Knowl-Based Syst 161:124–133
Article Google Scholar
Xi C, Lu G, Yan J (2020) Multimodal sentiment analysis based on multi-head attention mechanism. In: Proceedings of the 4th international conference on machine learning and soft computing, pp 34–39
Wang Z, Wan Z, Wan X (2020) Transmodality: an end2end fusion method with transformer for multimodal sentiment analysis. Proc Web Conf 2020:2514–2520
Google Scholar
Wang H, Meghawat A, Morency LP, Xing EP (2017) Select-additive learning: improving generalization in multimodal sentiment analysis. In: 2017 IEEE international conference on multimedia and expo (ICME). IEEE, pp 949–954
Cambria E, Hazarika D, Poria S, Hussain A, Subramanyam R (2017) Benchmarking multimodal sentiment analysis. In: International conference on computational linguistics and intelligent text processing. Springer, pp 166–179
Fulse S, Sugandhi R, Mahajan A (2014) A survey on multimodal sentiment analysis. Int J Eng Res Technol 3(11):1233–1238
Google Scholar
Poria S, Cambria E, Hazarika D, Mazumder N, Zadeh A, Morency LP (2017) Multi-level multiple attentions for contextual multimodal sentiment analysis. In: 2017 IEEE international conference on data mining (ICDM). IEEE, pp 1033–1038
Ranganathan H, Chakraborty S, Panchanathan S (2016) Multimodal emotion recognition using deep learning architectures. In: 2016 IEEE Winter conference on applications of computer vision (WACV). IEEE, pp 1–9
Poria S, Cambria E, Gelbukh A (2015) Deep convolutional neural network textual features and multiple kernel learning for utterance-level multimodal sentiment analysis. In: Proceedings of the 2015 conference on empirical methods in natural language processing, pp 2539–2544
Luo Z, Xu H, Chen F (2019) Audio sentiment analysis by heterogeneous signal features learned from utterance-based parallel neural network. In: AffCon@AAAI
Huddar MG, Sannakki SS, Rajpurohit VS (2018) An ensemble approach to utterance level multimodal sentiment analysis. In: 2018 International conference on computational techniques, electronics and mechanical systems (CTEMS). IEEE, pp 145–150
Deng D, Zhou Y, Pi J, Shi BE (2018) Multimodal utterance-level affect analysis using visual, audio and text features. arXiv:1805.00625
Devlin J, Chang MW, Lee K, Toutanova K (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805
Wang M, Cao D, Li L, Li S, Ji R (2014) Microblog sentiment analysis based on cross-media bag-of-words model. In: Proceedings of international conference on internet multimedia computing and service, pp 76–80
Cao D, Ji R, Lin D, Li S (2016) A cross-media public sentiment analysis system for microblog. Multimed Syst 22(4):479–486
Article Google Scholar
You Q, Luo J, Jin H, Yang J (2016) Cross-modality consistent regression for joint visual-textual sentiment analysis of social multimedia. In: Proceedings of the Ninth ACM international conference on Web search and data mining, pp 13–22
You Q, Cao L, Jin H, Luo J (2016) Robust visual-textual sentiment analysis: when attention meets tree-structured recursive neural networks. In: Proceedings of the 24th ACM international conference on multimedia, pp 1008–1017
Zadeh A, Chen M, Poria S, Cambria E, Morency LP (2017) Tensor fusion network for multimodal sentiment analysis. arXiv:1707.07250
You Q, Luo J, Jin H, Yang J (2015) Joint visual-textual sentiment analysis with deep neural networks. In: Proceedings of the 23rd ACM international conference on Multimedia, pp 1071–1074
Zhu X, Cao B, Xu S, Liu B, Cao J (2019) Joint visual-textual sentiment analysis based on cross-modality attention mechanism. In: International conference on multimedia modeling. Springer, pp 264–276
Zadeh A, Liang PP, Mazumder N, Poria S, Cambria E, Morency LP (2018) Memory fusion network for multi-view sequential learning. In: Proceedings of the AAAI conference on artificial intelligence, vol 32
Zadeh AB, Liang PP, Poria S, Cambria E, Morency LP (2018) Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph. In: Proceedings of the 56th annual meeting of the association for computational linguistics (Volume 1: Long Papers), pp 2236–2246
Hutto C, Gilbert E (2014) Vader: a parsimonious rule-based model for sentiment analysis of social media text. In: Proceedings of the international AAAI conference on web and social media, vol 8
O’Connor B, Balasubramanyan R, Routledge BR, Smith NA (2010) From tweets to polls: linking text sentiment to public opinion time series. In: Fourth international AAAI conference on weblogs and social media
Borth D, Chen T, Ji R, Chang SF (2013) Sentibank: large-scale ontology and classifiers for detecting sentiment and emotions in visual content. In: Proceedings of the 21st ACM international conference on Multimedia, pp 459–460
Siersdorfer S, Minack E, Deng F, Hare J (2010) Analyzing and predicting sentiment of images on the social web. In: Proceedings of the 18th ACM international conference on Multimedia, pp 715–718
Pennington J, Socher R, Manning CD (2014) Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1532–1543
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556
Eyben F, Wöllmer M, Schuller B (2010) Opensmile: the Munich versatile and fast open-source audio feature extractor. In: Proceedings of the 18th ACM international conference on multimedia, pp 1459–1462
Rakhlin A (2016) Convolutional neural networks for sentence classification. GitHub
Pérez-Rosas V, Mihalcea R, Morency LP (2013) Utterance-level multimodal sentiment analysis. In: Proceedings of the 51st annual meeting of the association for computational linguistics (Volume 1: Long Papers), pp 973–982
Zadeh A, Zellers R, Pincus E, Morency LP (2016) Multimodal sentiment intensity analysis in videos: facial gestures and verbal messages. IEEE Intell Syst 31(6):82–88
Article Google Scholar
Nojavanasghari B, Gopinath D, Koushik J, Baltrušaitis T, Morency LP (2016) Deep multimodal fusion for persuasiveness prediction. In: Proceedings of the 18th ACM international conference on multimodal interaction, pp 284–288
Rajagopalan SS, Morency LP, Baltrusaitis T, Goecke R (2016) Extending long short-term memory for multi-view structured learning. In: European conference on computer vision. Springer, pp 338–353
Blanchard N, Moreira D, Bharati A, Scheirer WJ (2018) Getting the subtext without the text: Scalable multimodal sentiment classification from visual and acoustic modalities. arXiv:1807.01122

Download references

Acknowledgements

This work was supported by the following grants: National Natural Science Foundation of China (No.61772321); Shandong Natural Science Foundation ZR202011020044; Natural Science Foundation of China (No.81973981); Key Project of Research and Development in Shandong Province (No.2019RKB14090); Project of Traditional Chinese Medicine and Technology Development Plan Program in Shandong province (No.2019-0018); Shandong Postgraduate Education Quality Improvement Plan (SDYKC19147).

Author information

Authors and Affiliations

School of Information Science and Engineering, Shandong Normal University, Jinan, 250358, China
Xuqiang Zhuang, Fangai Liu, Jian Hou & Jianhua Hao
School of Intelligence and Information Engineering, Shandong University of Traditional Chinese Medicine, Jinan, 250355, China
Xiaohong Cai

Authors

Xuqiang Zhuang
View author publications
You can also search for this author in PubMed Google Scholar
Fangai Liu
View author publications
You can also search for this author in PubMed Google Scholar
Jian Hou
View author publications
You can also search for this author in PubMed Google Scholar
Jianhua Hao
View author publications
You can also search for this author in PubMed Google Scholar
Xiaohong Cai
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Fangai Liu.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhuang, X., Liu, F., Hou, J. et al. Transformer-Based Interactive Multi-Modal Attention Network for Video Sentiment Detection. Neural Process Lett 54, 1943–1960 (2022). https://doi.org/10.1007/s11063-021-10713-5

Download citation

Accepted: 08 December 2021
Published: 04 January 2022
Issue Date: June 2022
DOI: https://doi.org/10.1007/s11063-021-10713-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Transformer-Based Interactive Multi-Modal Attention Network for Video Sentiment Detection

Abstract

Access this article

Similar content being viewed by others

Multimodel Sentiment Analysis with Self-attention

Multi-task Gated Contextual Cross-Modal Attention Framework for Sentiment and Emotion Analysis

Attention-based multimodal contextual fusion for sentiment and emotion classification using bidirectional LSTM

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Transformer-Based Interactive Multi-Modal Attention Network for Video Sentiment Detection

Abstract

Access this article

Similar content being viewed by others

Multimodel Sentiment Analysis with Self-attention

Multi-task Gated Contextual Cross-Modal Attention Framework for Sentiment and Emotion Analysis

Attention-based multimodal contextual fusion for sentiment and emotion classification using bidirectional LSTM

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation