Abstract
Multimodal sentiment analysis is an actively developing field of research. The main research problem in this domain is to model both intra-modality and inter-modality dynamics. However, most of the current work cannot do well with these two aspects of dynamics. In this study, we introduce a novel model to achieve this. The novelty of our model is to represent the asymmetric weights of contexts at a particular timestamp using asymmetric windows. Further, multiple separate attentions are performed on the contexts, producing an updated representation of the particular timestamp. Each representation corresponding to one of the modes multiplies a weight vector controlled by a neural network. All multiplied results are merged with an addition operation. Experiments on the MOSI dataset show our model outperforms the compared methods.
Similar content being viewed by others
References
Baecchi C, Uricchio T, Bertini M, Del Bimbo A (2016) A multimodal feature learning approach for sentiment analysis of social network multimedia. Multimed Tools Appl 75(5):2507–2525
Baltrušaitis T, Ahuja C, Morency L-P (2018) Multimodal machine learning: A survey and taxonomy. IEEE Trans Pattern Anal Mach Intell 41(2):423–443
Cavallari S, Zheng VW, Cai H, Chang C-CK, Cambria E (2017) Learning community embedding with community detection and node embedding on graphs. In: Proceedings of the 2017 ACM on conference on information and knowledge management. ACM, pp 377–386
Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv:1406.1078
Datcu D, Rothkrantz LJM (2014) Semantic audio-visual data fusion for automatic emotion recognition. Emotion recognition: a pattern analysis approach 411–435
Degottex G, Kane J, Drugman T, Raitio T, Scherer S (2014) Covarep—a collaborative voice analysis repository for speech technologies. In: 2014 Ieee international conference on acoustics, speech and signal processing (icassp). IEEE, pp 960–964
Ebrahimi M, Hossein Yazdavar A, Sheth A (2017) Challenges of sentiment analysis for dynamic events. IEEE Intell Syst 32(5):70–75
Fukui A, Park DH, Yang D, Rohrbach A, Darrell T, Rohrbach M (2016) Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv:1606.01847
Gunes H, Piccardi M (2007) Bi-modal emotion recognition from expressive face and body gestures. J Netw Comput Appl 30(4):1334–1345
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Computat 9(8):1735–1780
Hussain A, Huang G-B (2015) Towards an intelligent framework for multimodal affective data analysis. Neural Netw 63:104–116
Kapoor A, Burleson W, Picard WR (2007) Automatic prediction of frustration. Int J Human-Comput Stud 65(8):724–736
Liu Z, Shen Y, Lakshminarasimhan VB, Liang PP, Zadeh A, Morency L-P (2018) Efficient low-rank multimodal fusion with modality-specific factors. In: Proceedings of the 56th annual meeting of the association for computational linguistics, pp 2247–2256
Majumder N, Hazarika D, Gelbukh A, Cambria E, Poria S (2018) Multimodal sentiment analysis using hierarchical fusion with context modeling. Knowl-Based Syst 161:124–133
Mekhaldi D, Lalanne D, Ingold R (2012) A multimodal alignment framework for spoken documents. Multimed Tools Appl 61(2):353–388
Mittal T, Bhattacharya U, Chandra R, Bera A, Manocha D (2020) M3er Multiplicative multimodal emotion recognition using facial, textual, and speech cues. In: Proceedings of the AAAI conference on artificial intelligence, vol 34, pp 1359–1367
Morency L-P, Mihalcea R, Doshi P (2011) Towards multimodal sentiment analysis: Harvesting opinions from the web. In: Proceedings of the 13th international conference on multimodal interfaces. ACM, pp 169–176
Nojavanasghari B, Gopinath D, Koushik J, Baltrušaitis T, Morency L-P (2016) Deep multimodal fusion for persuasiveness prediction. In: Proceedings of the 18th ACM international conference on multimodal interaction. ACM, pp 284–288
Pandeya YR, Lee J (2021) Deep learning-based late fusion of multimodal information for emotion classification of music video. Multimed Tools Appl 80(2):2887–2905
Pennington J, Socher R, Manning DC (2014) Glove Global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1532–1543
Pérez-Rosas V, Mihalcea R, Morency L-P (2013) Utterance-level multimodal sentiment analysis. In: Proceedings of the 51st annual meeting of the association for computational linguistics (Volume 1: Long Papers), pp 973–982
Poria S, Cambria E, Bajpai R, Hussain A (2017) A review of affective computing: From unimodal analysis to multimodal fusion. Inform Fusion 37:98–125
Poria S, Cambria E, Gelbukh A (2015) Deep convolutional neural network textual features and multiple kernel learning for utterance-level multimodal sentiment analysis. In: Proceedings of the 2015 conference on empirical methods in natural language processing, pp 2539–2544
Poria S, Cambria E, Howard N, Huang G-B, Hussain A (2016) Fusing audio, visual and textual clues for sentiment analysis from multimodal content. Neurocomputing 174:50–59
Poria S, Chaturvedi I, Cambria E, Hussain A (2016) Convolutional mkl based multimodal emotion recognition and sentiment analysis. In: 2016 IEEE 16Th international conference on data mining (ICDM). IEEE, pp 439–448
Pun T, Alecu TI, Chanel G, Kronegg J, Voloshynovskiy S (2006) Brain-computer interaction research at the computer vision and multimedia laboratory, University of Geneva. IEEE Trans Neural Syst Rehabilit Eng 14(2):210–213
Rajagopalan SS, Morency L-P, Baltrusaitis T, Goecke R (2016) Extending long short-term memory for multi-view structured learning. In: European conference on computer vision. Springer, pp 338–353
Ren J, Hu Y, Tai Y-W, Wang C, Xu L, Sun W, Yan Q (2016) Look, listen and learn—a multimodal lstm for speaker identification. In: Proceedings of the AAAI conference on artificial intelligence, vol 30
Shan C, Gong S, McOwan PW (2007) Beyond facial expressions: Learning human emotion from body gestures. In: BMVC, pp 1–10
Sohrab F, Raitoharju J, Iosifidis A, Gabbouj M (2021) Multimodal subspace support vector data description. Pattern Recogn 110:107648
Song Y, Morency L-P, Davis R (2012) Multi-view latent variable discriminative models for action recognition. In: 2012 IEEE Conference on computer vision and pattern recognition. IEEE, pp 2120–2127
Song Y, Morency L-P, Davis R (2013) Action recognition by hierarchical sequence summarization. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3562–3569
Tsai HY-H, Liang PP, Zadeh A, Morency L-P, Salakhutdinov R (2018) Learning factorized multimodal representations. arXiv:1806.06176
Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show, Tell A neural image caption generator. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3156–3164
Wang W, Arora R, Livescu K, Bilmes J (2015) On deep multi-view representation learning. In: International conference on machine learning. PMLR, pp 1083–1092
Wang H, Meghawat A, Morency L-P, Xing EP (2017) Select-additive learning: Improving generalization in multimodal sentiment analysis. In: 2017 IEEE International conference on multimedia and expo (ICME). IEEE, pp 949–954
Wörtwein T, Scherer S (2017) What really matters—an information gain analysis of questions and reactions in automated ptsd screenings. In: 2017 Seventh international conference on affective computing and intelligent interaction (ACII). IEEE, pp 15–20
Xing FZ, Cambria E, Welsch RE (2018) Natural language based financial forecasting: A survey. Artif Intell Rev 50(1):49–73
Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, Attend and tell neural image caption generation with visual attention. In: International conference on machine learning. PMLR, pp 2048–2057
Xu J, Gavves E, Fernando B, Tuytelaars T (2015) Guiding the long-short term memory model for image caption generation. In: Proceedings of the IEEE international conference on computer vision, pp 2407–2415
Xu C, Tao D, Xu C (2013) A survey on multi-view learning. arXiv:1304.5634
Young T, Cambria E, Chaturvedi I, Zhou H, Biswas S, Huang M (2018) Augmenting end-to-end dialogue systems with commonsense knowledge. In: Thirty-second AAAI conference on artificial intelligence
Yu W, Zeiler S, Kolossa D (2021) Multimodal integration for large-vocabulary audio-visual speech recognition. In: 2020 28Th european signal processing conference (EUSIPCO). IEEE, pp 341– 345
Yuan J, Liberman M (2008) Speaker identification on the scotus corpus. J Acoust Soc Am 123(5):3878
Zadeh A, Chen M, Poria S, Cambria E, Morency L-P (2017) Tensor fusion network for multimodal sentiment analysis. arXiv:1707.07250
Zadeh A, Liang PP, Mazumder N, Poria S, Cambria E, Morency L-P (2018) Memory fusion network for multi-view sequential learning. In: Thirty-second AAAI conference on artificial intelligence
Zadeh A, Liang PP, Poria S, Vij P, Cambria E, Morency L-P (2018) Multi-attention recurrent network for human communication comprehension. In: Thirty-second AAAI conference on artificial intelligence
Zadeh A, Zellers R, Pincus E, Morency L-P (2016) Mosi: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos. arXiv:1606.06259
Zadeh A, Zellers R, Pincus E, Morency L-P (2016) Multimodal sentiment intensity analysis in videos: Facial gestures and verbal messages. IEEE Intell Syst 31(6):82–88
Zeng Z, Tu J, Liu M, Huang TS, Pianfetti B, Roth D, Levinson S (2007) Audio-visual affect recognition. IEEE Trans Multimed 9(2):424–428
Zhu Q, Yeh M-C, Cheng K-T, Avidan S (2006) Fast human detection using a cascade of histograms of oriented gradients. In: 2006 IEEE Computer society conference on computer vision and pattern recognition (CVPR’06), vol 2. IEEE, pp 1491–1498
Acknowledgements
This research was supported in part by Science and Technology Program of Guangzhou (202102020878), National Natural Science Foundation of China (62006053), Special Innovation Project of Guangdong Education Department (2018KQNCX072), the Youth Innovative Talents Project in Guangdong Universities (2020KQNCX186), the Fourth College Level Project of Guangdong Justice Police Vocational College (2020YB16), the 13th Five-Year Plan of Guangdong Institute of Higher Education Research on Higher Education of Young Teachers in Colleges and Universities in 2019 (19GGZ070), and thanks Ziang Liu for revising the english grammar of the paper.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Helang Lai and Xueming Yan had contributed equally to this work.
Rights and permissions
About this article
Cite this article
Lai, H., Yan, X. Multimodal sentiment analysis with asymmetric window multi-attentions. Multimed Tools Appl 81, 19415–19428 (2022). https://doi.org/10.1007/s11042-021-11234-y
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-021-11234-y