Multimodal sentiment analysis with asymmetric window multi-attentions

Lai, Helang; Yan, Xueming

doi:10.1007/s11042-021-11234-y

Multimodal sentiment analysis with asymmetric window multi-attentions

1182: Deep Processing of Multimedia Data
Published: 23 July 2021

Volume 81, pages 19415–19428, (2022)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Helang Lai^1,2 &
Xueming Yan^3,4

646 Accesses
4 Citations
1 Altmetric
Explore all metrics

Abstract

Multimodal sentiment analysis is an actively developing field of research. The main research problem in this domain is to model both intra-modality and inter-modality dynamics. However, most of the current work cannot do well with these two aspects of dynamics. In this study, we introduce a novel model to achieve this. The novelty of our model is to represent the asymmetric weights of contexts at a particular timestamp using asymmetric windows. Further, multiple separate attentions are performed on the contexts, producing an updated representation of the particular timestamp. Each representation corresponding to one of the modes multiplies a weight vector controlled by a neural network. All multiplied results are merged with an addition operation. Experiments on the MOSI dataset show our model outperforms the compared methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Attention fusion network for multimodal sentiment analysis

Article 14 June 2023

Text-Oriented Modality Reinforcement Network for Multimodal Sentiment Analysis from Unaligned Multimodal Sequences

Non-Uniform Attention Network for Multi-modal Sentiment Analysis

Notes

https://github.com/A2Zadeh

References

Baecchi C, Uricchio T, Bertini M, Del Bimbo A (2016) A multimodal feature learning approach for sentiment analysis of social network multimedia. Multimed Tools Appl 75(5):2507–2525
Article Google Scholar
Baltrušaitis T, Ahuja C, Morency L-P (2018) Multimodal machine learning: A survey and taxonomy. IEEE Trans Pattern Anal Mach Intell 41(2):423–443
Article Google Scholar
Cavallari S, Zheng VW, Cai H, Chang C-CK, Cambria E (2017) Learning community embedding with community detection and node embedding on graphs. In: Proceedings of the 2017 ACM on conference on information and knowledge management. ACM, pp 377–386
Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv:1406.1078
Datcu D, Rothkrantz LJM (2014) Semantic audio-visual data fusion for automatic emotion recognition. Emotion recognition: a pattern analysis approach 411–435
Degottex G, Kane J, Drugman T, Raitio T, Scherer S (2014) Covarep—a collaborative voice analysis repository for speech technologies. In: 2014 Ieee international conference on acoustics, speech and signal processing (icassp). IEEE, pp 960–964
Ebrahimi M, Hossein Yazdavar A, Sheth A (2017) Challenges of sentiment analysis for dynamic events. IEEE Intell Syst 32(5):70–75
Article Google Scholar
Fukui A, Park DH, Yang D, Rohrbach A, Darrell T, Rohrbach M (2016) Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv:1606.01847
Gunes H, Piccardi M (2007) Bi-modal emotion recognition from expressive face and body gestures. J Netw Comput Appl 30(4):1334–1345
Article Google Scholar
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Computat 9(8):1735–1780
Article Google Scholar
Hussain A, Huang G-B (2015) Towards an intelligent framework for multimodal affective data analysis. Neural Netw 63:104–116
Article Google Scholar
Kapoor A, Burleson W, Picard WR (2007) Automatic prediction of frustration. Int J Human-Comput Stud 65(8):724–736
Article Google Scholar
Liu Z, Shen Y, Lakshminarasimhan VB, Liang PP, Zadeh A, Morency L-P (2018) Efficient low-rank multimodal fusion with modality-specific factors. In: Proceedings of the 56th annual meeting of the association for computational linguistics, pp 2247–2256
Majumder N, Hazarika D, Gelbukh A, Cambria E, Poria S (2018) Multimodal sentiment analysis using hierarchical fusion with context modeling. Knowl-Based Syst 161:124–133
Article Google Scholar
Mekhaldi D, Lalanne D, Ingold R (2012) A multimodal alignment framework for spoken documents. Multimed Tools Appl 61(2):353–388
Article Google Scholar
Mittal T, Bhattacharya U, Chandra R, Bera A, Manocha D (2020) M3er Multiplicative multimodal emotion recognition using facial, textual, and speech cues. In: Proceedings of the AAAI conference on artificial intelligence, vol 34, pp 1359–1367
Morency L-P, Mihalcea R, Doshi P (2011) Towards multimodal sentiment analysis: Harvesting opinions from the web. In: Proceedings of the 13th international conference on multimodal interfaces. ACM, pp 169–176
Nojavanasghari B, Gopinath D, Koushik J, Baltrušaitis T, Morency L-P (2016) Deep multimodal fusion for persuasiveness prediction. In: Proceedings of the 18th ACM international conference on multimodal interaction. ACM, pp 284–288
Pandeya YR, Lee J (2021) Deep learning-based late fusion of multimodal information for emotion classification of music video. Multimed Tools Appl 80(2):2887–2905
Article Google Scholar
Pennington J, Socher R, Manning DC (2014) Glove Global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1532–1543
Pérez-Rosas V, Mihalcea R, Morency L-P (2013) Utterance-level multimodal sentiment analysis. In: Proceedings of the 51st annual meeting of the association for computational linguistics (Volume 1: Long Papers), pp 973–982
Poria S, Cambria E, Bajpai R, Hussain A (2017) A review of affective computing: From unimodal analysis to multimodal fusion. Inform Fusion 37:98–125
Article Google Scholar
Poria S, Cambria E, Gelbukh A (2015) Deep convolutional neural network textual features and multiple kernel learning for utterance-level multimodal sentiment analysis. In: Proceedings of the 2015 conference on empirical methods in natural language processing, pp 2539–2544
Poria S, Cambria E, Howard N, Huang G-B, Hussain A (2016) Fusing audio, visual and textual clues for sentiment analysis from multimodal content. Neurocomputing 174:50–59
Article Google Scholar
Poria S, Chaturvedi I, Cambria E, Hussain A (2016) Convolutional mkl based multimodal emotion recognition and sentiment analysis. In: 2016 IEEE 16Th international conference on data mining (ICDM). IEEE, pp 439–448
Pun T, Alecu TI, Chanel G, Kronegg J, Voloshynovskiy S (2006) Brain-computer interaction research at the computer vision and multimedia laboratory, University of Geneva. IEEE Trans Neural Syst Rehabilit Eng 14(2):210–213
Article Google Scholar
Rajagopalan SS, Morency L-P, Baltrusaitis T, Goecke R (2016) Extending long short-term memory for multi-view structured learning. In: European conference on computer vision. Springer, pp 338–353
Ren J, Hu Y, Tai Y-W, Wang C, Xu L, Sun W, Yan Q (2016) Look, listen and learn—a multimodal lstm for speaker identification. In: Proceedings of the AAAI conference on artificial intelligence, vol 30
Shan C, Gong S, McOwan PW (2007) Beyond facial expressions: Learning human emotion from body gestures. In: BMVC, pp 1–10
Sohrab F, Raitoharju J, Iosifidis A, Gabbouj M (2021) Multimodal subspace support vector data description. Pattern Recogn 110:107648
Article Google Scholar
Song Y, Morency L-P, Davis R (2012) Multi-view latent variable discriminative models for action recognition. In: 2012 IEEE Conference on computer vision and pattern recognition. IEEE, pp 2120–2127
Song Y, Morency L-P, Davis R (2013) Action recognition by hierarchical sequence summarization. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3562–3569
Tsai HY-H, Liang PP, Zadeh A, Morency L-P, Salakhutdinov R (2018) Learning factorized multimodal representations. arXiv:1806.06176
Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show, Tell A neural image caption generator. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3156–3164
Wang W, Arora R, Livescu K, Bilmes J (2015) On deep multi-view representation learning. In: International conference on machine learning. PMLR, pp 1083–1092
Wang H, Meghawat A, Morency L-P, Xing EP (2017) Select-additive learning: Improving generalization in multimodal sentiment analysis. In: 2017 IEEE International conference on multimedia and expo (ICME). IEEE, pp 949–954
Wörtwein T, Scherer S (2017) What really matters—an information gain analysis of questions and reactions in automated ptsd screenings. In: 2017 Seventh international conference on affective computing and intelligent interaction (ACII). IEEE, pp 15–20
Xing FZ, Cambria E, Welsch RE (2018) Natural language based financial forecasting: A survey. Artif Intell Rev 50(1):49–73
Article Google Scholar
Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, Attend and tell neural image caption generation with visual attention. In: International conference on machine learning. PMLR, pp 2048–2057
Xu J, Gavves E, Fernando B, Tuytelaars T (2015) Guiding the long-short term memory model for image caption generation. In: Proceedings of the IEEE international conference on computer vision, pp 2407–2415
Xu C, Tao D, Xu C (2013) A survey on multi-view learning. arXiv:1304.5634
Young T, Cambria E, Chaturvedi I, Zhou H, Biswas S, Huang M (2018) Augmenting end-to-end dialogue systems with commonsense knowledge. In: Thirty-second AAAI conference on artificial intelligence
Yu W, Zeiler S, Kolossa D (2021) Multimodal integration for large-vocabulary audio-visual speech recognition. In: 2020 28Th european signal processing conference (EUSIPCO). IEEE, pp 341– 345
Yuan J, Liberman M (2008) Speaker identification on the scotus corpus. J Acoust Soc Am 123(5):3878
Article Google Scholar
Zadeh A, Chen M, Poria S, Cambria E, Morency L-P (2017) Tensor fusion network for multimodal sentiment analysis. arXiv:1707.07250
Zadeh A, Liang PP, Mazumder N, Poria S, Cambria E, Morency L-P (2018) Memory fusion network for multi-view sequential learning. In: Thirty-second AAAI conference on artificial intelligence
Zadeh A, Liang PP, Poria S, Vij P, Cambria E, Morency L-P (2018) Multi-attention recurrent network for human communication comprehension. In: Thirty-second AAAI conference on artificial intelligence
Zadeh A, Zellers R, Pincus E, Morency L-P (2016) Mosi: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos. arXiv:1606.06259
Zadeh A, Zellers R, Pincus E, Morency L-P (2016) Multimodal sentiment intensity analysis in videos: Facial gestures and verbal messages. IEEE Intell Syst 31(6):82–88
Article Google Scholar
Zeng Z, Tu J, Liu M, Huang TS, Pianfetti B, Roth D, Levinson S (2007) Audio-visual affect recognition. IEEE Trans Multimed 9(2):424–428
Article Google Scholar
Zhu Q, Yeh M-C, Cheng K-T, Avidan S (2006) Fast human detection using a cascade of histograms of oriented gradients. In: 2006 IEEE Computer society conference on computer vision and pattern recognition (CVPR’06), vol 2. IEEE, pp 1491–1498

Download references

Acknowledgements

This research was supported in part by Science and Technology Program of Guangzhou (202102020878), National Natural Science Foundation of China (62006053), Special Innovation Project of Guangdong Education Department (2018KQNCX072), the Youth Innovative Talents Project in Guangdong Universities (2020KQNCX186), the Fourth College Level Project of Guangdong Justice Police Vocational College (2020YB16), the 13th Five-Year Plan of Guangdong Institute of Higher Education Research on Higher Education of Young Teachers in Colleges and Universities in 2019 (19GGZ070), and thanks Ziang Liu for revising the english grammar of the paper.

Author information

Authors and Affiliations

Guangdong Justice Police Vocational College, Guangzhou, 510520, China
Helang Lai
School of Computer Science, South China Normal University, Guangzhou, 510631, China
Helang Lai
Guangzhou Key laboratory of Multilingual Intelligent Processing, Guangdong University of Foreign Studies, Guangzhou, 510000, China
Xueming Yan
School of Information Science and Technology, School of Cyber Security, Guangdong University of Foreign Studies, Guangzhou, 510000, China
Xueming Yan

Authors

Helang Lai
View author publications
You can also search for this author in PubMed Google Scholar
Xueming Yan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xueming Yan.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Helang Lai and Xueming Yan had contributed equally to this work.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lai, H., Yan, X. Multimodal sentiment analysis with asymmetric window multi-attentions. Multimed Tools Appl 81, 19415–19428 (2022). https://doi.org/10.1007/s11042-021-11234-y

Download citation

Received: 18 June 2020
Revised: 18 June 2021
Accepted: 08 July 2021
Published: 23 July 2021
Issue Date: June 2022
DOI: https://doi.org/10.1007/s11042-021-11234-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multimodal sentiment analysis with asymmetric window multi-attentions

Abstract

Access this article

Similar content being viewed by others

Attention fusion network for multimodal sentiment analysis

Text-Oriented Modality Reinforcement Network for Multimodal Sentiment Analysis from Unaligned Multimodal Sequences

Non-Uniform Attention Network for Multi-modal Sentiment Analysis

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Multimodal sentiment analysis with asymmetric window multi-attentions

Abstract

Access this article

Similar content being viewed by others

Attention fusion network for multimodal sentiment analysis

Text-Oriented Modality Reinforcement Network for Multimodal Sentiment Analysis from Unaligned Multimodal Sequences

Non-Uniform Attention Network for Multi-modal Sentiment Analysis

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation