TEMM: text-enhanced multi-interactive attention and multitask learning network for multimodal sentiment analysis

Yu, Bengong; Shi, Zhongyu

doi:10.1007/s11227-024-06422-0

TEMM: text-enhanced multi-interactive attention and multitask learning network for multimodal sentiment analysis

Published: 12 August 2024

Volume 80, pages 25563–25589, (2024)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Bengong Yu^1,2 &
Zhongyu Shi¹

428 Accesses
Explore all metrics

Abstract

Multimodal sentiment analysis is an important and active research field. Most methods construct fusion modules based on unimodal representations generated by pretrained models, which lack the deep interaction of multimodal information, especially the rich semantic-emotional information embedded in text. In addition, previous studies have focused on capturing modal coherence information and ignored differentiated information. We propose a text-enhanced multi-interactive attention and multitask learning network (TEMM). First, syntactic dependency graphs and sentiment graphs of the text are constructed, and additional graph embedding representations of the text are obtained using graph convolutional networks and graph attention networks. Then, self-attention and cross-modal attention are applied to explore intramodal and intermodal dynamic interactions, using text as the main cue. Finally, a multitask learning framework is constructed to exert control over the information flow by monitoring the mutual information between the unimodal and multimodal representations and exploiting the classification properties of the unimodal modality to achieve a more comprehensive focus on modal information. The experimental results on the CMU-MOSI, CMU-MOSEI, and CH-SIMS datasets show that the proposed model outperforms state-of-the-art models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

MECG: modality-enhanced convolutional graph for unbalanced multimodal representations

Article 20 December 2024

A graph convolution-based heterogeneous fusion network for multimodal sentiment analysis

Article 18 November 2023

Multimodal aspect-based sentiment analysis based on a dual syntactic graph network and joint contrastive learning

Article 28 February 2025

Data availability

No datasets were generated or analysed during the current study.

References

Zhu C, Chen M, Zhang S et al (2023) Skeafn: sentiment knowledge enhanced attention fusion network for multimodal sentiment analysis. Inf Fusion 100:101958. https://doi.org/10.1016/j.inffus.2023.101958
Article Google Scholar
Chen M, Li X (2020) Swafn: sentimental words aware fusion network for multimodal sentiment analysis. In: Proceedings of the 28th international conference on computational linguistics, pp 1067–1077. https://doi.org/10.18653/v1/2020.coling-main.93
Cambria E, Hussain A (2015) Sentic computing: a common-sense-based framework for concept-level sentiment analysis. Springer, Berlin. https://doi.org/10.1007/978-3-319-23654-4
Book Google Scholar
Kipf TN, Welling M (2017) Semi-supervised classification with graph convolutional networks. https://arxiv.org/abs/1609.02907. arXiv:1609.02907
Veličković P, Cucurull G, Casanova A, et al (2018) Graph attention networks. https://arxiv.org/abs/1710.10903. arXiv:1710.10903
Jin T, Huang S, Li Y, et al (2020) Dual low-rank multimodal fusion. In: Empirical methods in natural language processing. https://doi.org/10.18653/v1/2020.findings-emnlp.35
Liu Z, Shen Y, Lakshminarasimhan VB, et al (2018) Efficient low-rank multimodal fusion with modality-specific factors. In: Proceedings of the 56th annual meeting of the association for computational linguistics (Volume 1: Long Papers), Association for Computational Linguistics. https://doi.org/10.18653/v1/p18-1209
Mai S, Hu H, Xing S (2020) Modality to modality translation: an adversarial representation learning and graph fusion network for multimodal fusion. In: Proceedings of the AAAI conference on artificial intelligence, pp 164–172. https://doi.org/10.1609/aaai.v34i01.5347
Mai S, Xing S, He J et al (2023) Multimodal graph for unaligned multimodal sequence analysis via graph convolution and graph pooling. ACM Trans Multimed Comput Commun Appl 19(2):1–24. https://doi.org/10.1145/3542927
Article Google Scholar
Han W, Chen H, Gelbukh A, et al (2021) Bi-bimodal modality fusion for correlation-controlled multimodal sentiment analysis. In: Proceedings of the 2021 international conference on multimodal interaction, pp 6–15. https://doi.org/10.1145/3462244.3479919
Yang K, Xu H, Gao K (2020) Cm-bert: cross-modal bert for text-audio sentiment analysis. In: Proceedings of the 28th ACM international conference on multimedia, pp 521–528. https://doi.org/10.1145/3394171.3413690
Tsai YHH, Bai S, Liang PP, et al (2019) Multimodal transformer for unaligned multimodal language sequences. In: Proceedings of the conference. Association for Computational Linguistics. Meeting, NIH Public Access, p 6558. https://doi.org/10.18653/v1/P19-1656
Han W, Chen H, Poria S (2021) Improving multimodal fusion with hierarchical mutual information maximization for multimodal sentiment analysis. In: Proceedings of the 2021 conference on empirical methods in natural language processing, pp 9180–9192. https://doi.org/10.18653/v1/2021.emnlp-main.723
Li K, Tian S, Yu L et al (2023) Mutual information maximization and feature space separation and bi-bimodal mo-dality fusion for multimodal sentiment analysis. J Intell Fuzzy Syst 45(4):5783–5793. https://doi.org/10.3233/JIFS-222189
Article Google Scholar
Zheng Y, Gong J, Wen Y et al (2024) Djmf: a discriminative joint multi-task framework for multimodal sentiment analysis based on intra-and inter-task dynamics. Expert Syst Appl 242:122728. https://doi.org/10.1016/j.eswa.2023.122728
Article Google Scholar
Luo Y, Wu R, Liu J et al (2023) A text guided multi-task learning network for multimodal sentiment analysis. Neurocomputing 560:126836. https://doi.org/10.1016/j.neucom.2023.126836
Article Google Scholar
Yu W, Xu H, Yuan Z, et al (2021) Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis. In: Proceedings of the AAAI conference on artificial intelligence, pp 10790–10797. https://doi.org/10.1609/aaai.v35i12.17289
Vaswani A, Shazeer NM, Parmar N, et al (2017) Attention is all you need. In: Neural information processing systems. https://api.semanticscholar.org/CorpusID:13756489
Kenton JDMWC, Toutanova LK (2019) Bert: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT. Association for Computational Linguistics, pp 4171–4186. https://doi.org/10.18653/v1/N19-1423
Zhuang L, Wayne L, Ya S, et al (2021) A robustly optimized BERT pre-training approach with post-training. In: Li S, Sun M, Liu Y, et al (eds) Proceedings of the 20th Chinese national conference on computational linguistics. Chinese Information Processing Society of China, pp 1218–1227. https://aclanthology.org/2021.ccl-1.108
Ghorbanali A, Sohrabi MK, Yaghmaee F (2022) Ensemble transfer learning-based multimodal sentiment analysis using weighted convolutional neural networks. Inf Process Manag 59(3):102929. https://doi.org/10.1016/j.ipm.2022.102929
Article Google Scholar
Sun Z, Sarma P, Sethares W, et al (2020) Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of the AAAI conference on artificial intelligence, pp 8992–8999. https://doi.org/10.1609/aaai.v34i05.6431
He Z, Wang H, Zhang X (2023) Multi-task learning model based on bert and knowledge graph for aspect-based sentiment analysis. Electronics 12(3):737. https://doi.org/10.3390/electronics12030737
Article Google Scholar
Jin W, Zhao B, Zhang L et al (2023) Back to common sense: Oxford dictionary descriptive knowledge augmentation for aspect-based sentiment analysis. Inf Process Manag 60(3):103260. https://doi.org/10.1016/j.ipm.2022.103260
Article Google Scholar
Cambria E, Li Y, Xing FZ, et al (2020) Senticnet 6: ensemble application of symbolic and subsymbolic ai for sentiment analysis. In: Proceedings of the 29th ACM international conference on information & knowledge management, pp 105–114. https://doi.org/10.1145/3340531.3412003
Hazarika D, Zimmermann R, Poria S (2020) Misa: modality-invariant and -specific representations for multimodal sentiment analysis. In: Proceedings of the 28th ACM international conference on multimedia, pp 1122–1131. https://doi.org/10.1145/3394171.3413678
Xue X, Zhang C, Niu Z et al (2022) Multi-level attention map network for multimodal sentiment analysis. IEEE Trans Knowl Data Eng 35(5):5105–5118. https://doi.org/10.1109/TKDE.2022.3155290
Article Google Scholar
Rahman W, Hasan MK, Lee S, et al (2020) Integrating multimodal information in large pretrained transformers. In: Proceedings of the conference. Association for Computational Linguistics. Meeting, NIH Public Access, p 2359. https://doi.org/10.18653/v1/2020.acl-main.214
Zhang Q, Shi L, Liu P, et al (2023) Discriminating information of modality contributions network by gating mechanism and multi-task learning. In: 2023 international joint conference on neural networks (IJCNN). IEEE, pp 1–7. https://doi.org/10.1109/IJCNN54540.2023.10191402
Mai S, Hu H, Xing S (2021) A unimodal representation learning and recurrent decomposition fusion structure for utterance-level multimodal embedding learning. IEEE Trans Multimed 24:2488–2501. https://doi.org/10.1109/TMM.2021.3082398
Article Google Scholar
Ghosal D, Akhtar MS, Chauhan D, et al (2018) Contextual inter-modal attention for multi-modal sentiment analysis. In: Proceedings of the 2018 conference on empirical methods in natural language processing, pp 3454–3466. https://doi.org/10.18653/v1/D18-1382
Zadeh AB, Liang PP, Poria S, et al (2018) Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph. In: Proceedings of the 56th annual meeting of the association for computational linguistics (Volume 1: Long Papers), pp 2236–2246. https://doi.org/10.18653/v1/P18-1208
Dai Y, Shou L, Gong M et al (2022) Graph fusion network for text classification. Knowl-Based Syst 236:107659. https://doi.org/10.1016/j.knosys.2021.107659
Article Google Scholar
Shen X, Yang H, Hu X et al (2023) Graph convolutional network with interactive memory fusion for aspect-based sentiment analysis. Journal of Intelligent & Fuzzy Systems 45(5):7893–7903. https://doi.org/10.3233/JIFS-230703
Article Google Scholar
Li R, Chen H, Feng F, et al (2021) Dual graph convolutional networks for aspect-based sentiment analysis. In: Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (volume 1: Long Papers), pp 6319–6329. https://doi.org/10.18653/v1/2021.acl-long.494
Bai X, Liu P, Zhang Y (2020) Exploiting typed syntactic dependencies for targeted sentiment classification using graph attention neural network. arXiv:2002.09685. https://doi.org/10.1109/TASLP.2020.3042009
Yuan L, Wang J, Yu LC, et al (2020) Graph attention network with memory fusion for aspect-level sentiment analysis. In: Proceedings of the 1st conference of the asia-pacific chapter of the association for computational linguistics and the 10th international joint conference on natural language processing, pp 27–36
Zhang Y, Yang Q (2021) A survey on multi-task learning. IEEE Trans Knowl Data Eng 34(12):5586–5609. https://doi.org/10.1109/TKDE.2021.3070203
Article Google Scholar
Fortin MP, Chaib-Draa B (2019) Multimodal sentiment analysis: a multitask learning approach. In: ICPRAM, pp 368–376. https://doi.org/10.5220/0007313503680376
Akhtar MS, Chauhan DS, Ghosal D, et al (2019) Multi-task learning for multi-modal emotion recognition and sentiment analysis. In: Proceedings of NAACL-HLT, pp 370–379. https://doi.org/10.18653/v1/N19-1034
Zhang Y, Wang J, Liu Y et al (2023) A multitask learning model for multimodal sarcasm, sentiment and emotion recognition in conversations. Inf Fusion 93:282–301. https://doi.org/10.1016/j.inffus.2023.01.005
Article Google Scholar
Thi NHN, Le DT, Ha QT, et al (2023) Self-mi: efficient multimodal fusion via self-supervised multi-task learning with auxiliary mutual information maximization. In: Proceedings of the 37th pacific asia conference on language. Information and computation. Association for Computational Linguistics, pp 582–590. https://doi.org/10.48550/arXiv.2311.03785
Oord Avd, Li Y, Vinyals O (2018) Representation learning with contrastive predictive coding. arXiv:1807.03748. https://doi.org/10.48550/arXiv.1807.03748
Yu W, Xu H, Meng F, et al (2020) Ch-sims: a Chinese multimodal sentiment analysis dataset with fine-grained annotation of modality. In: Proceedings of the 58th annual meeting of the association for computational linguistics, pp 3718–3727. https://doi.org/10.18653/v1/2020.acl-main.343
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735
Article Google Scholar
Zadeh A, Zellers R, Pincus E, et al (2016) Mosi: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos. arXiv:1606.06259. https://doi.org/10.48550/arXiv.1606.06259
Zadeh A, Liang PP, Poria S, et al (2018) Multi-attention recurrent network for human communication comprehension. In: Proceedings of the AAAI conference on artificial intelligence. https://doi.org/10.1609/aaai.v32i1.12024
Degottex G, Kane J, Drugman T, et al (2014) Covarep—a collaborative voice analysis repository for speech technologies. In: 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 960–964. https://doi.org/10.1109/ICASSP.2014.6853739
McFee B, Raffel C, Liang D, et al (2015) librosa: audio and music signal analysis in python. In: SciPy, pp 18–24. https://doi.org/10.25080/majora-7b98e3ed-003
Baltrušaitis T, Robinson P, Morency LP (2016) Openface: an open source facial behavior analysis toolkit. In: 2016 IEEE winter conference on applications of computer vision (WACV), pp 1–10. https://doi.org/10.1109/WACV.2016.7477553
Zadeh A, Chen M, Poria S, et al (2017) Tensor fusion network for multimodal sentiment analysis. In: Proceedings of the 2017 conference on empirical methods in natural language processing. Association for Computational Linguistics, pp 1103–1114. https://doi.org/10.18653/v1/D17-1115
Zadeh A, Liang PP, Mazumder N, et al (2018) Memory fusion network for multi-view sequential learning. In: Proceedings of the AAAI conference on artificial intelligence. https://doi.org/10.1609/aaai.v32i1.12021
Wang Y, Shen Y, Liu Z, et al (2019) Words can shift: dynamically adjusting word representations using nonverbal behaviors. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7216–7223. https://doi.org/10.1609/aaai.v33i01.33017216
Tsai YHH, Liang PP, Zadeh A, et al (2018) Learning factorized multimodal representations. In: International conference on learning representations. https://doi.org/10.48550/arXiv.1806.06176
Yuan Z, Li W, Xu H, et al (2021) Transformer-based feature reconstruction network for robust multimodal sentiment analysis. In: Proceedings of the 29th ACM international conference on multimedia. Association for Computing Machinery, New York, NY, USA, MM ’21, pp 4400–4407. https://doi.org/10.1145/3474085.3475585
Sun Y, Mai S, Hu H (2023) Learning to learn better unimodal representations via adaptive multimodal meta-learning. IEEE Trans Affect Comput 14(3):2209–2223. https://doi.org/10.1109/TAFFC.2022.3178231
Article Google Scholar
Zhou H, Ma T, Rong H et al (2022) Mdmn: multi-task and domain adaptation based multi-modal network for early rumor detection. Expert Syst Appl. https://doi.org/10.1016/j.eswa.2022.116517
Article Google Scholar
Yang D, Huang S, Kuang H, et al (2022) Disentangled representation learning for multimodal emotion recognition. In: Proceedings of the 30th ACM international conference on multimedia. Association for Computing Machinery, New York, NY, USA, MM ’22, pp 1642–1651. https://doi.org/10.1145/3503161.3547754
Liu Y, Qiao L, Lu C, et al (2023) Osan: a one-stage alignment network to unify multimodal alignment and unsupervised domain adaptation. In: 2023 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 3551–3560. https://doi.org/10.1109/CVPR52729.2023.00346
Sun L, Lian Z, Liu B et al (2024) Efficient multimodal transformer with dual-level feature restoration for robust multimodal sentiment analysis. IEEE Trans Affect Comput 15(1):309–325. https://doi.org/10.1109/TAFFC.2023.3274829
Article Google Scholar
Wang Y, Chen Z, Chen S, et al (2022) Mt-tcct: multi-task learning for multimodal emotion recognition. In: International conference on artificial neural networks. Springer, pp 429–442. https://doi.org/10.1007/978-3-031-15934-3_36

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (No.72071061).

Author information

Authors and Affiliations

School of Management, Hefei University of Technology, Hefei, Anhui, 230009, China
Bengong Yu & Zhongyu Shi
Key Laboratory of Process Optimization and Intelligent Decision-making, Ministry of Education, Hefei, Anhui, 230009, China
Bengong Yu

Authors

Bengong Yu
View author publications
You can also search for this author inPubMed Google Scholar
Zhongyu Shi
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

Bengong Yu is responsible for conceptualizing the experimental plan, revising research content and writing. Zhongyu Shi is responsible for proposing research ideas, conducting experimental analysis, and writing the draft.

Corresponding author

Correspondence to Bengong Yu.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Yu, B., Shi, Z. TEMM: text-enhanced multi-interactive attention and multitask learning network for multimodal sentiment analysis. J Supercomput 80, 25563–25589 (2024). https://doi.org/10.1007/s11227-024-06422-0

Download citation

Accepted: 02 August 2024
Published: 12 August 2024
Issue Date: November 2024
DOI: https://doi.org/10.1007/s11227-024-06422-0

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

TEMM: text-enhanced multi-interactive attention and multitask learning network for multimodal sentiment analysis

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

MECG: modality-enhanced convolutional graph for unbalanced multimodal representations

A graph convolution-based heterogeneous fusion network for multimodal sentiment analysis

Multimodal aspect-based sentiment analysis based on a dual syntactic graph network and joint contrastive learning

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now