Skip to main content

Advertisement

Log in

TEMM: text-enhanced multi-interactive attention and multitask learning network for multimodal sentiment analysis

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Multimodal sentiment analysis is an important and active research field. Most methods construct fusion modules based on unimodal representations generated by pretrained models, which lack the deep interaction of multimodal information, especially the rich semantic-emotional information embedded in text. In addition, previous studies have focused on capturing modal coherence information and ignored differentiated information. We propose a text-enhanced multi-interactive attention and multitask learning network (TEMM). First, syntactic dependency graphs and sentiment graphs of the text are constructed, and additional graph embedding representations of the text are obtained using graph convolutional networks and graph attention networks. Then, self-attention and cross-modal attention are applied to explore intramodal and intermodal dynamic interactions, using text as the main cue. Finally, a multitask learning framework is constructed to exert control over the information flow by monitoring the mutual information between the unimodal and multimodal representations and exploiting the classification properties of the unimodal modality to achieve a more comprehensive focus on modal information. The experimental results on the CMU-MOSI, CMU-MOSEI, and CH-SIMS datasets show that the proposed model outperforms state-of-the-art models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Data availability

No datasets were generated or analysed during the current study.

References

  1. Zhu C, Chen M, Zhang S et al (2023) Skeafn: sentiment knowledge enhanced attention fusion network for multimodal sentiment analysis. Inf Fusion 100:101958. https://doi.org/10.1016/j.inffus.2023.101958

    Article  Google Scholar 

  2. Chen M, Li X (2020) Swafn: sentimental words aware fusion network for multimodal sentiment analysis. In: Proceedings of the 28th international conference on computational linguistics, pp 1067–1077. https://doi.org/10.18653/v1/2020.coling-main.93

  3. Cambria E, Hussain A (2015) Sentic computing: a common-sense-based framework for concept-level sentiment analysis. Springer, Berlin. https://doi.org/10.1007/978-3-319-23654-4

    Book  Google Scholar 

  4. Kipf TN, Welling M (2017) Semi-supervised classification with graph convolutional networks. https://arxiv.org/abs/1609.02907. arXiv:1609.02907

  5. Veličković P, Cucurull G, Casanova A, et al (2018) Graph attention networks. https://arxiv.org/abs/1710.10903. arXiv:1710.10903

  6. Jin T, Huang S, Li Y, et al (2020) Dual low-rank multimodal fusion. In: Empirical methods in natural language processing. https://doi.org/10.18653/v1/2020.findings-emnlp.35

  7. Liu Z, Shen Y, Lakshminarasimhan VB, et al (2018) Efficient low-rank multimodal fusion with modality-specific factors. In: Proceedings of the 56th annual meeting of the association for computational linguistics (Volume 1: Long Papers), Association for Computational Linguistics. https://doi.org/10.18653/v1/p18-1209

  8. Mai S, Hu H, Xing S (2020) Modality to modality translation: an adversarial representation learning and graph fusion network for multimodal fusion. In: Proceedings of the AAAI conference on artificial intelligence, pp 164–172. https://doi.org/10.1609/aaai.v34i01.5347

  9. Mai S, Xing S, He J et al (2023) Multimodal graph for unaligned multimodal sequence analysis via graph convolution and graph pooling. ACM Trans Multimed Comput Commun Appl 19(2):1–24. https://doi.org/10.1145/3542927

    Article  Google Scholar 

  10. Han W, Chen H, Gelbukh A, et al (2021) Bi-bimodal modality fusion for correlation-controlled multimodal sentiment analysis. In: Proceedings of the 2021 international conference on multimodal interaction, pp 6–15. https://doi.org/10.1145/3462244.3479919

  11. Yang K, Xu H, Gao K (2020) Cm-bert: cross-modal bert for text-audio sentiment analysis. In: Proceedings of the 28th ACM international conference on multimedia, pp 521–528. https://doi.org/10.1145/3394171.3413690

  12. Tsai YHH, Bai S, Liang PP, et al (2019) Multimodal transformer for unaligned multimodal language sequences. In: Proceedings of the conference. Association for Computational Linguistics. Meeting, NIH Public Access, p 6558. https://doi.org/10.18653/v1/P19-1656

  13. Han W, Chen H, Poria S (2021) Improving multimodal fusion with hierarchical mutual information maximization for multimodal sentiment analysis. In: Proceedings of the 2021 conference on empirical methods in natural language processing, pp 9180–9192. https://doi.org/10.18653/v1/2021.emnlp-main.723

  14. Li K, Tian S, Yu L et al (2023) Mutual information maximization and feature space separation and bi-bimodal mo-dality fusion for multimodal sentiment analysis. J Intell Fuzzy Syst 45(4):5783–5793. https://doi.org/10.3233/JIFS-222189

    Article  Google Scholar 

  15. Zheng Y, Gong J, Wen Y et al (2024) Djmf: a discriminative joint multi-task framework for multimodal sentiment analysis based on intra-and inter-task dynamics. Expert Syst Appl 242:122728. https://doi.org/10.1016/j.eswa.2023.122728

    Article  Google Scholar 

  16. Luo Y, Wu R, Liu J et al (2023) A text guided multi-task learning network for multimodal sentiment analysis. Neurocomputing 560:126836. https://doi.org/10.1016/j.neucom.2023.126836

    Article  Google Scholar 

  17. Yu W, Xu H, Yuan Z, et al (2021) Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis. In: Proceedings of the AAAI conference on artificial intelligence, pp 10790–10797. https://doi.org/10.1609/aaai.v35i12.17289

  18. Vaswani A, Shazeer NM, Parmar N, et al (2017) Attention is all you need. In: Neural information processing systems. https://api.semanticscholar.org/CorpusID:13756489

  19. Kenton JDMWC, Toutanova LK (2019) Bert: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT. Association for Computational Linguistics, pp 4171–4186. https://doi.org/10.18653/v1/N19-1423

  20. Zhuang L, Wayne L, Ya S, et al (2021) A robustly optimized BERT pre-training approach with post-training. In: Li S, Sun M, Liu Y, et al (eds) Proceedings of the 20th Chinese national conference on computational linguistics. Chinese Information Processing Society of China, pp 1218–1227. https://aclanthology.org/2021.ccl-1.108

  21. Ghorbanali A, Sohrabi MK, Yaghmaee F (2022) Ensemble transfer learning-based multimodal sentiment analysis using weighted convolutional neural networks. Inf Process Manag 59(3):102929. https://doi.org/10.1016/j.ipm.2022.102929

    Article  Google Scholar 

  22. Sun Z, Sarma P, Sethares W, et al (2020) Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of the AAAI conference on artificial intelligence, pp 8992–8999. https://doi.org/10.1609/aaai.v34i05.6431

  23. He Z, Wang H, Zhang X (2023) Multi-task learning model based on bert and knowledge graph for aspect-based sentiment analysis. Electronics 12(3):737. https://doi.org/10.3390/electronics12030737

    Article  Google Scholar 

  24. Jin W, Zhao B, Zhang L et al (2023) Back to common sense: Oxford dictionary descriptive knowledge augmentation for aspect-based sentiment analysis. Inf Process Manag 60(3):103260. https://doi.org/10.1016/j.ipm.2022.103260

    Article  Google Scholar 

  25. Cambria E, Li Y, Xing FZ, et al (2020) Senticnet 6: ensemble application of symbolic and subsymbolic ai for sentiment analysis. In: Proceedings of the 29th ACM international conference on information & knowledge management, pp 105–114. https://doi.org/10.1145/3340531.3412003

  26. Hazarika D, Zimmermann R, Poria S (2020) Misa: modality-invariant and -specific representations for multimodal sentiment analysis. In: Proceedings of the 28th ACM international conference on multimedia, pp 1122–1131. https://doi.org/10.1145/3394171.3413678

  27. Xue X, Zhang C, Niu Z et al (2022) Multi-level attention map network for multimodal sentiment analysis. IEEE Trans Knowl Data Eng 35(5):5105–5118. https://doi.org/10.1109/TKDE.2022.3155290

    Article  Google Scholar 

  28. Rahman W, Hasan MK, Lee S, et al (2020) Integrating multimodal information in large pretrained transformers. In: Proceedings of the conference. Association for Computational Linguistics. Meeting, NIH Public Access, p 2359. https://doi.org/10.18653/v1/2020.acl-main.214

  29. Zhang Q, Shi L, Liu P, et al (2023) Discriminating information of modality contributions network by gating mechanism and multi-task learning. In: 2023 international joint conference on neural networks (IJCNN). IEEE, pp 1–7. https://doi.org/10.1109/IJCNN54540.2023.10191402

  30. Mai S, Hu H, Xing S (2021) A unimodal representation learning and recurrent decomposition fusion structure for utterance-level multimodal embedding learning. IEEE Trans Multimed 24:2488–2501. https://doi.org/10.1109/TMM.2021.3082398

    Article  Google Scholar 

  31. Ghosal D, Akhtar MS, Chauhan D, et al (2018) Contextual inter-modal attention for multi-modal sentiment analysis. In: Proceedings of the 2018 conference on empirical methods in natural language processing, pp 3454–3466. https://doi.org/10.18653/v1/D18-1382

  32. Zadeh AB, Liang PP, Poria S, et al (2018) Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph. In: Proceedings of the 56th annual meeting of the association for computational linguistics (Volume 1: Long Papers), pp 2236–2246. https://doi.org/10.18653/v1/P18-1208

  33. Dai Y, Shou L, Gong M et al (2022) Graph fusion network for text classification. Knowl-Based Syst 236:107659. https://doi.org/10.1016/j.knosys.2021.107659

    Article  Google Scholar 

  34. Shen X, Yang H, Hu X et al (2023) Graph convolutional network with interactive memory fusion for aspect-based sentiment analysis. Journal of Intelligent & Fuzzy Systems 45(5):7893–7903. https://doi.org/10.3233/JIFS-230703

    Article  Google Scholar 

  35. Li R, Chen H, Feng F, et al (2021) Dual graph convolutional networks for aspect-based sentiment analysis. In: Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (volume 1: Long Papers), pp 6319–6329. https://doi.org/10.18653/v1/2021.acl-long.494

  36. Bai X, Liu P, Zhang Y (2020) Exploiting typed syntactic dependencies for targeted sentiment classification using graph attention neural network. arXiv:2002.09685. https://doi.org/10.1109/TASLP.2020.3042009

  37. Yuan L, Wang J, Yu LC, et al (2020) Graph attention network with memory fusion for aspect-level sentiment analysis. In: Proceedings of the 1st conference of the asia-pacific chapter of the association for computational linguistics and the 10th international joint conference on natural language processing, pp 27–36

  38. Zhang Y, Yang Q (2021) A survey on multi-task learning. IEEE Trans Knowl Data Eng 34(12):5586–5609. https://doi.org/10.1109/TKDE.2021.3070203

    Article  Google Scholar 

  39. Fortin MP, Chaib-Draa B (2019) Multimodal sentiment analysis: a multitask learning approach. In: ICPRAM, pp 368–376. https://doi.org/10.5220/0007313503680376

  40. Akhtar MS, Chauhan DS, Ghosal D, et al (2019) Multi-task learning for multi-modal emotion recognition and sentiment analysis. In: Proceedings of NAACL-HLT, pp 370–379. https://doi.org/10.18653/v1/N19-1034

  41. Zhang Y, Wang J, Liu Y et al (2023) A multitask learning model for multimodal sarcasm, sentiment and emotion recognition in conversations. Inf Fusion 93:282–301. https://doi.org/10.1016/j.inffus.2023.01.005

    Article  Google Scholar 

  42. Thi NHN, Le DT, Ha QT, et al (2023) Self-mi: efficient multimodal fusion via self-supervised multi-task learning with auxiliary mutual information maximization. In: Proceedings of the 37th pacific asia conference on language. Information and computation. Association for Computational Linguistics, pp 582–590. https://doi.org/10.48550/arXiv.2311.03785

  43. Oord Avd, Li Y, Vinyals O (2018) Representation learning with contrastive predictive coding. arXiv:1807.03748. https://doi.org/10.48550/arXiv.1807.03748

  44. Yu W, Xu H, Meng F, et al (2020) Ch-sims: a Chinese multimodal sentiment analysis dataset with fine-grained annotation of modality. In: Proceedings of the 58th annual meeting of the association for computational linguistics, pp 3718–3727. https://doi.org/10.18653/v1/2020.acl-main.343

  45. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735

    Article  Google Scholar 

  46. Zadeh A, Zellers R, Pincus E, et al (2016) Mosi: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos. arXiv:1606.06259. https://doi.org/10.48550/arXiv.1606.06259

  47. Zadeh A, Liang PP, Poria S, et al (2018) Multi-attention recurrent network for human communication comprehension. In: Proceedings of the AAAI conference on artificial intelligence. https://doi.org/10.1609/aaai.v32i1.12024

  48. Degottex G, Kane J, Drugman T, et al (2014) Covarep—a collaborative voice analysis repository for speech technologies. In: 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 960–964. https://doi.org/10.1109/ICASSP.2014.6853739

  49. McFee B, Raffel C, Liang D, et al (2015) librosa: audio and music signal analysis in python. In: SciPy, pp 18–24. https://doi.org/10.25080/majora-7b98e3ed-003

  50. Baltrušaitis T, Robinson P, Morency LP (2016) Openface: an open source facial behavior analysis toolkit. In: 2016 IEEE winter conference on applications of computer vision (WACV), pp 1–10. https://doi.org/10.1109/WACV.2016.7477553

  51. Zadeh A, Chen M, Poria S, et al (2017) Tensor fusion network for multimodal sentiment analysis. In: Proceedings of the 2017 conference on empirical methods in natural language processing. Association for Computational Linguistics, pp 1103–1114. https://doi.org/10.18653/v1/D17-1115

  52. Zadeh A, Liang PP, Mazumder N, et al (2018) Memory fusion network for multi-view sequential learning. In: Proceedings of the AAAI conference on artificial intelligence. https://doi.org/10.1609/aaai.v32i1.12021

  53. Wang Y, Shen Y, Liu Z, et al (2019) Words can shift: dynamically adjusting word representations using nonverbal behaviors. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7216–7223. https://doi.org/10.1609/aaai.v33i01.33017216

  54. Tsai YHH, Liang PP, Zadeh A, et al (2018) Learning factorized multimodal representations. In: International conference on learning representations. https://doi.org/10.48550/arXiv.1806.06176

  55. Yuan Z, Li W, Xu H, et al (2021) Transformer-based feature reconstruction network for robust multimodal sentiment analysis. In: Proceedings of the 29th ACM international conference on multimedia. Association for Computing Machinery, New York, NY, USA, MM ’21, pp 4400–4407. https://doi.org/10.1145/3474085.3475585

  56. Sun Y, Mai S, Hu H (2023) Learning to learn better unimodal representations via adaptive multimodal meta-learning. IEEE Trans Affect Comput 14(3):2209–2223. https://doi.org/10.1109/TAFFC.2022.3178231

    Article  Google Scholar 

  57. Zhou H, Ma T, Rong H et al (2022) Mdmn: multi-task and domain adaptation based multi-modal network for early rumor detection. Expert Syst Appl. https://doi.org/10.1016/j.eswa.2022.116517

    Article  Google Scholar 

  58. Yang D, Huang S, Kuang H, et al (2022) Disentangled representation learning for multimodal emotion recognition. In: Proceedings of the 30th ACM international conference on multimedia. Association for Computing Machinery, New York, NY, USA, MM ’22, pp 1642–1651. https://doi.org/10.1145/3503161.3547754

  59. Liu Y, Qiao L, Lu C, et al (2023) Osan: a one-stage alignment network to unify multimodal alignment and unsupervised domain adaptation. In: 2023 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 3551–3560. https://doi.org/10.1109/CVPR52729.2023.00346

  60. Sun L, Lian Z, Liu B et al (2024) Efficient multimodal transformer with dual-level feature restoration for robust multimodal sentiment analysis. IEEE Trans Affect Comput 15(1):309–325. https://doi.org/10.1109/TAFFC.2023.3274829

    Article  Google Scholar 

  61. Wang Y, Chen Z, Chen S, et al (2022) Mt-tcct: multi-task learning for multimodal emotion recognition. In: International conference on artificial neural networks. Springer, pp 429–442. https://doi.org/10.1007/978-3-031-15934-3_36

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (No.72071061).

Author information

Authors and Affiliations

Authors

Contributions

Bengong Yu is responsible for conceptualizing the experimental plan, revising research content and writing. Zhongyu Shi is responsible for proposing research ideas, conducting experimental analysis, and writing the draft.

Corresponding author

Correspondence to Bengong Yu.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yu, B., Shi, Z. TEMM: text-enhanced multi-interactive attention and multitask learning network for multimodal sentiment analysis. J Supercomput 80, 25563–25589 (2024). https://doi.org/10.1007/s11227-024-06422-0

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-024-06422-0

Keywords