Abstract
Multimodal sentiment analysis is an important and active research field. Most methods construct fusion modules based on unimodal representations generated by pretrained models, which lack the deep interaction of multimodal information, especially the rich semantic-emotional information embedded in text. In addition, previous studies have focused on capturing modal coherence information and ignored differentiated information. We propose a text-enhanced multi-interactive attention and multitask learning network (TEMM). First, syntactic dependency graphs and sentiment graphs of the text are constructed, and additional graph embedding representations of the text are obtained using graph convolutional networks and graph attention networks. Then, self-attention and cross-modal attention are applied to explore intramodal and intermodal dynamic interactions, using text as the main cue. Finally, a multitask learning framework is constructed to exert control over the information flow by monitoring the mutual information between the unimodal and multimodal representations and exploiting the classification properties of the unimodal modality to achieve a more comprehensive focus on modal information. The experimental results on the CMU-MOSI, CMU-MOSEI, and CH-SIMS datasets show that the proposed model outperforms state-of-the-art models.




Similar content being viewed by others
Data availability
No datasets were generated or analysed during the current study.
References
Zhu C, Chen M, Zhang S et al (2023) Skeafn: sentiment knowledge enhanced attention fusion network for multimodal sentiment analysis. Inf Fusion 100:101958. https://doi.org/10.1016/j.inffus.2023.101958
Chen M, Li X (2020) Swafn: sentimental words aware fusion network for multimodal sentiment analysis. In: Proceedings of the 28th international conference on computational linguistics, pp 1067–1077. https://doi.org/10.18653/v1/2020.coling-main.93
Cambria E, Hussain A (2015) Sentic computing: a common-sense-based framework for concept-level sentiment analysis. Springer, Berlin. https://doi.org/10.1007/978-3-319-23654-4
Kipf TN, Welling M (2017) Semi-supervised classification with graph convolutional networks. https://arxiv.org/abs/1609.02907. arXiv:1609.02907
Veličković P, Cucurull G, Casanova A, et al (2018) Graph attention networks. https://arxiv.org/abs/1710.10903. arXiv:1710.10903
Jin T, Huang S, Li Y, et al (2020) Dual low-rank multimodal fusion. In: Empirical methods in natural language processing. https://doi.org/10.18653/v1/2020.findings-emnlp.35
Liu Z, Shen Y, Lakshminarasimhan VB, et al (2018) Efficient low-rank multimodal fusion with modality-specific factors. In: Proceedings of the 56th annual meeting of the association for computational linguistics (Volume 1: Long Papers), Association for Computational Linguistics. https://doi.org/10.18653/v1/p18-1209
Mai S, Hu H, Xing S (2020) Modality to modality translation: an adversarial representation learning and graph fusion network for multimodal fusion. In: Proceedings of the AAAI conference on artificial intelligence, pp 164–172. https://doi.org/10.1609/aaai.v34i01.5347
Mai S, Xing S, He J et al (2023) Multimodal graph for unaligned multimodal sequence analysis via graph convolution and graph pooling. ACM Trans Multimed Comput Commun Appl 19(2):1–24. https://doi.org/10.1145/3542927
Han W, Chen H, Gelbukh A, et al (2021) Bi-bimodal modality fusion for correlation-controlled multimodal sentiment analysis. In: Proceedings of the 2021 international conference on multimodal interaction, pp 6–15. https://doi.org/10.1145/3462244.3479919
Yang K, Xu H, Gao K (2020) Cm-bert: cross-modal bert for text-audio sentiment analysis. In: Proceedings of the 28th ACM international conference on multimedia, pp 521–528. https://doi.org/10.1145/3394171.3413690
Tsai YHH, Bai S, Liang PP, et al (2019) Multimodal transformer for unaligned multimodal language sequences. In: Proceedings of the conference. Association for Computational Linguistics. Meeting, NIH Public Access, p 6558. https://doi.org/10.18653/v1/P19-1656
Han W, Chen H, Poria S (2021) Improving multimodal fusion with hierarchical mutual information maximization for multimodal sentiment analysis. In: Proceedings of the 2021 conference on empirical methods in natural language processing, pp 9180–9192. https://doi.org/10.18653/v1/2021.emnlp-main.723
Li K, Tian S, Yu L et al (2023) Mutual information maximization and feature space separation and bi-bimodal mo-dality fusion for multimodal sentiment analysis. J Intell Fuzzy Syst 45(4):5783–5793. https://doi.org/10.3233/JIFS-222189
Zheng Y, Gong J, Wen Y et al (2024) Djmf: a discriminative joint multi-task framework for multimodal sentiment analysis based on intra-and inter-task dynamics. Expert Syst Appl 242:122728. https://doi.org/10.1016/j.eswa.2023.122728
Luo Y, Wu R, Liu J et al (2023) A text guided multi-task learning network for multimodal sentiment analysis. Neurocomputing 560:126836. https://doi.org/10.1016/j.neucom.2023.126836
Yu W, Xu H, Yuan Z, et al (2021) Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis. In: Proceedings of the AAAI conference on artificial intelligence, pp 10790–10797. https://doi.org/10.1609/aaai.v35i12.17289
Vaswani A, Shazeer NM, Parmar N, et al (2017) Attention is all you need. In: Neural information processing systems. https://api.semanticscholar.org/CorpusID:13756489
Kenton JDMWC, Toutanova LK (2019) Bert: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT. Association for Computational Linguistics, pp 4171–4186. https://doi.org/10.18653/v1/N19-1423
Zhuang L, Wayne L, Ya S, et al (2021) A robustly optimized BERT pre-training approach with post-training. In: Li S, Sun M, Liu Y, et al (eds) Proceedings of the 20th Chinese national conference on computational linguistics. Chinese Information Processing Society of China, pp 1218–1227. https://aclanthology.org/2021.ccl-1.108
Ghorbanali A, Sohrabi MK, Yaghmaee F (2022) Ensemble transfer learning-based multimodal sentiment analysis using weighted convolutional neural networks. Inf Process Manag 59(3):102929. https://doi.org/10.1016/j.ipm.2022.102929
Sun Z, Sarma P, Sethares W, et al (2020) Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of the AAAI conference on artificial intelligence, pp 8992–8999. https://doi.org/10.1609/aaai.v34i05.6431
He Z, Wang H, Zhang X (2023) Multi-task learning model based on bert and knowledge graph for aspect-based sentiment analysis. Electronics 12(3):737. https://doi.org/10.3390/electronics12030737
Jin W, Zhao B, Zhang L et al (2023) Back to common sense: Oxford dictionary descriptive knowledge augmentation for aspect-based sentiment analysis. Inf Process Manag 60(3):103260. https://doi.org/10.1016/j.ipm.2022.103260
Cambria E, Li Y, Xing FZ, et al (2020) Senticnet 6: ensemble application of symbolic and subsymbolic ai for sentiment analysis. In: Proceedings of the 29th ACM international conference on information & knowledge management, pp 105–114. https://doi.org/10.1145/3340531.3412003
Hazarika D, Zimmermann R, Poria S (2020) Misa: modality-invariant and -specific representations for multimodal sentiment analysis. In: Proceedings of the 28th ACM international conference on multimedia, pp 1122–1131. https://doi.org/10.1145/3394171.3413678
Xue X, Zhang C, Niu Z et al (2022) Multi-level attention map network for multimodal sentiment analysis. IEEE Trans Knowl Data Eng 35(5):5105–5118. https://doi.org/10.1109/TKDE.2022.3155290
Rahman W, Hasan MK, Lee S, et al (2020) Integrating multimodal information in large pretrained transformers. In: Proceedings of the conference. Association for Computational Linguistics. Meeting, NIH Public Access, p 2359. https://doi.org/10.18653/v1/2020.acl-main.214
Zhang Q, Shi L, Liu P, et al (2023) Discriminating information of modality contributions network by gating mechanism and multi-task learning. In: 2023 international joint conference on neural networks (IJCNN). IEEE, pp 1–7. https://doi.org/10.1109/IJCNN54540.2023.10191402
Mai S, Hu H, Xing S (2021) A unimodal representation learning and recurrent decomposition fusion structure for utterance-level multimodal embedding learning. IEEE Trans Multimed 24:2488–2501. https://doi.org/10.1109/TMM.2021.3082398
Ghosal D, Akhtar MS, Chauhan D, et al (2018) Contextual inter-modal attention for multi-modal sentiment analysis. In: Proceedings of the 2018 conference on empirical methods in natural language processing, pp 3454–3466. https://doi.org/10.18653/v1/D18-1382
Zadeh AB, Liang PP, Poria S, et al (2018) Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph. In: Proceedings of the 56th annual meeting of the association for computational linguistics (Volume 1: Long Papers), pp 2236–2246. https://doi.org/10.18653/v1/P18-1208
Dai Y, Shou L, Gong M et al (2022) Graph fusion network for text classification. Knowl-Based Syst 236:107659. https://doi.org/10.1016/j.knosys.2021.107659
Shen X, Yang H, Hu X et al (2023) Graph convolutional network with interactive memory fusion for aspect-based sentiment analysis. Journal of Intelligent & Fuzzy Systems 45(5):7893–7903. https://doi.org/10.3233/JIFS-230703
Li R, Chen H, Feng F, et al (2021) Dual graph convolutional networks for aspect-based sentiment analysis. In: Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (volume 1: Long Papers), pp 6319–6329. https://doi.org/10.18653/v1/2021.acl-long.494
Bai X, Liu P, Zhang Y (2020) Exploiting typed syntactic dependencies for targeted sentiment classification using graph attention neural network. arXiv:2002.09685. https://doi.org/10.1109/TASLP.2020.3042009
Yuan L, Wang J, Yu LC, et al (2020) Graph attention network with memory fusion for aspect-level sentiment analysis. In: Proceedings of the 1st conference of the asia-pacific chapter of the association for computational linguistics and the 10th international joint conference on natural language processing, pp 27–36
Zhang Y, Yang Q (2021) A survey on multi-task learning. IEEE Trans Knowl Data Eng 34(12):5586–5609. https://doi.org/10.1109/TKDE.2021.3070203
Fortin MP, Chaib-Draa B (2019) Multimodal sentiment analysis: a multitask learning approach. In: ICPRAM, pp 368–376. https://doi.org/10.5220/0007313503680376
Akhtar MS, Chauhan DS, Ghosal D, et al (2019) Multi-task learning for multi-modal emotion recognition and sentiment analysis. In: Proceedings of NAACL-HLT, pp 370–379. https://doi.org/10.18653/v1/N19-1034
Zhang Y, Wang J, Liu Y et al (2023) A multitask learning model for multimodal sarcasm, sentiment and emotion recognition in conversations. Inf Fusion 93:282–301. https://doi.org/10.1016/j.inffus.2023.01.005
Thi NHN, Le DT, Ha QT, et al (2023) Self-mi: efficient multimodal fusion via self-supervised multi-task learning with auxiliary mutual information maximization. In: Proceedings of the 37th pacific asia conference on language. Information and computation. Association for Computational Linguistics, pp 582–590. https://doi.org/10.48550/arXiv.2311.03785
Oord Avd, Li Y, Vinyals O (2018) Representation learning with contrastive predictive coding. arXiv:1807.03748. https://doi.org/10.48550/arXiv.1807.03748
Yu W, Xu H, Meng F, et al (2020) Ch-sims: a Chinese multimodal sentiment analysis dataset with fine-grained annotation of modality. In: Proceedings of the 58th annual meeting of the association for computational linguistics, pp 3718–3727. https://doi.org/10.18653/v1/2020.acl-main.343
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735
Zadeh A, Zellers R, Pincus E, et al (2016) Mosi: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos. arXiv:1606.06259. https://doi.org/10.48550/arXiv.1606.06259
Zadeh A, Liang PP, Poria S, et al (2018) Multi-attention recurrent network for human communication comprehension. In: Proceedings of the AAAI conference on artificial intelligence. https://doi.org/10.1609/aaai.v32i1.12024
Degottex G, Kane J, Drugman T, et al (2014) Covarep—a collaborative voice analysis repository for speech technologies. In: 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 960–964. https://doi.org/10.1109/ICASSP.2014.6853739
McFee B, Raffel C, Liang D, et al (2015) librosa: audio and music signal analysis in python. In: SciPy, pp 18–24. https://doi.org/10.25080/majora-7b98e3ed-003
Baltrušaitis T, Robinson P, Morency LP (2016) Openface: an open source facial behavior analysis toolkit. In: 2016 IEEE winter conference on applications of computer vision (WACV), pp 1–10. https://doi.org/10.1109/WACV.2016.7477553
Zadeh A, Chen M, Poria S, et al (2017) Tensor fusion network for multimodal sentiment analysis. In: Proceedings of the 2017 conference on empirical methods in natural language processing. Association for Computational Linguistics, pp 1103–1114. https://doi.org/10.18653/v1/D17-1115
Zadeh A, Liang PP, Mazumder N, et al (2018) Memory fusion network for multi-view sequential learning. In: Proceedings of the AAAI conference on artificial intelligence. https://doi.org/10.1609/aaai.v32i1.12021
Wang Y, Shen Y, Liu Z, et al (2019) Words can shift: dynamically adjusting word representations using nonverbal behaviors. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7216–7223. https://doi.org/10.1609/aaai.v33i01.33017216
Tsai YHH, Liang PP, Zadeh A, et al (2018) Learning factorized multimodal representations. In: International conference on learning representations. https://doi.org/10.48550/arXiv.1806.06176
Yuan Z, Li W, Xu H, et al (2021) Transformer-based feature reconstruction network for robust multimodal sentiment analysis. In: Proceedings of the 29th ACM international conference on multimedia. Association for Computing Machinery, New York, NY, USA, MM ’21, pp 4400–4407. https://doi.org/10.1145/3474085.3475585
Sun Y, Mai S, Hu H (2023) Learning to learn better unimodal representations via adaptive multimodal meta-learning. IEEE Trans Affect Comput 14(3):2209–2223. https://doi.org/10.1109/TAFFC.2022.3178231
Zhou H, Ma T, Rong H et al (2022) Mdmn: multi-task and domain adaptation based multi-modal network for early rumor detection. Expert Syst Appl. https://doi.org/10.1016/j.eswa.2022.116517
Yang D, Huang S, Kuang H, et al (2022) Disentangled representation learning for multimodal emotion recognition. In: Proceedings of the 30th ACM international conference on multimedia. Association for Computing Machinery, New York, NY, USA, MM ’22, pp 1642–1651. https://doi.org/10.1145/3503161.3547754
Liu Y, Qiao L, Lu C, et al (2023) Osan: a one-stage alignment network to unify multimodal alignment and unsupervised domain adaptation. In: 2023 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 3551–3560. https://doi.org/10.1109/CVPR52729.2023.00346
Sun L, Lian Z, Liu B et al (2024) Efficient multimodal transformer with dual-level feature restoration for robust multimodal sentiment analysis. IEEE Trans Affect Comput 15(1):309–325. https://doi.org/10.1109/TAFFC.2023.3274829
Wang Y, Chen Z, Chen S, et al (2022) Mt-tcct: multi-task learning for multimodal emotion recognition. In: International conference on artificial neural networks. Springer, pp 429–442. https://doi.org/10.1007/978-3-031-15934-3_36
Acknowledgements
This work was supported by the National Natural Science Foundation of China (No.72071061).
Author information
Authors and Affiliations
Contributions
Bengong Yu is responsible for conceptualizing the experimental plan, revising research content and writing. Zhongyu Shi is responsible for proposing research ideas, conducting experimental analysis, and writing the draft.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Yu, B., Shi, Z. TEMM: text-enhanced multi-interactive attention and multitask learning network for multimodal sentiment analysis. J Supercomput 80, 25563–25589 (2024). https://doi.org/10.1007/s11227-024-06422-0
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-024-06422-0