Skip to main content

Advertisement

Text-dominant multimodal perception network for sentiment analysis based on cross-modal semantic enhancements

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Multimodal sentiment analysis (MSA) aims to discern the emotional information expressed by users in the multimodal data they upload on various social media platforms. In most previous studies, these modalities (audio A, visual V, and text T) were typically treated equally, overlooking the lower representation quality inherent in audio and visual modalities. This oversight often results in inaccurate interaction information when audio or visual modalities are used as the primary input, thereby negatively impacting the model’s sentiment predictions. In this paper, we propose a text-dominant multimodal perception network with cross-modal transformer-based semantic enhancement. The network comprises primarily a text-dominant multimodal perception (TDMP) module and a cross-modal transformer-based semantic enhancement (TSE) module. TDMP leverages the text modality to dominate intermodal interactions, extracting high correlation and differentiation features from each modality, thereby obtaining more accurate representations for each modality. The TSE module uses a transformer architecture to convert the audio and visual modality features into text features. By applying KL divergence constraints, it ensures that the translated modality representations capture as much emotional information as possible while maintaining high similarity to the original text modality representations. This enhances the original text modality semantics while mitigating the negative impact of the input. Extensive experiments on the CMU-MOSI and CMU-MOSEI datasets demonstrate the effectiveness of our proposed model.

Graphical abstract

The overview of Text-dominant Multimodal Perception Network

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data availability and access

The data will be made available upon reasonable request.

References

  1. Zhu L, Zhu Z, Zhang C, Xu Y, Kong X (2023) Multimodal sentiment analysis based on fusion methods: A survey. Inf Fusion 95:306–325. https://doi.org/10.1016/j.inffus.2023.02.028

    Article  MATH  Google Scholar 

  2. Liu Z, Yang T, Chen W, Chen J, Li Q, Zhang J (2024) Sentiment analysis of social media comments based on multimodal attention fusion network. Appl Soft Comput 164:112011. https://doi.org/10.1016/j.asoc.2024.112011

    Article  MATH  Google Scholar 

  3. Aruna AG, Vetriselvi V (2024) Sentiment analysis on a low-resource language dataset using multimodal representation learning and cross-lingual transfer learning. Appl Soft Comput 157:111553. https://doi.org/10.1016/j.asoc.2024.111553

    Article  MATH  Google Scholar 

  4. Gandhi A, Adhvaryu K, Poria S, Cambria E, Hussain A (2023) Multimodal sentiment analysis: A systematic review of history, datasets, multimodal fusion methods, applications, challenges and future directions. Inf Fusion 91:424–444. https://doi.org/10.1016/j.inffus.2022.09.025

    Article  Google Scholar 

  5. Zadeh A, Chen M, Cambria E, Poria S, Morency L-P (2017) Tensor fusion network for multimodal sentiment analysis. In: EMNLP 2017 - conference on empirical methods in natural language processing, Proceedings, Copenhagen, Denmark, pp 1103–1114

  6. Liu Z, Shen Y, Lakshminarasimhan VB, Liang PP, Zadeh A, Morency L-P (2018) Efficient low-rank multimodal fusion with modality-specific factors. In: Gurevych I, Miyao Y (eds) Proceedings of the 56th annual meeting of the Association for Computational Linguistics (Acl), vol 1, pp 2247–2256

  7. Li Z, Guo Q, Pan Y, Ding W, Yu J, Zhang Y, Liu W, Chen H, Wang H, Xie Y (2023) Multi-level correlation mining framework with self-supervised label generation for multimodal sentiment analysis. Inf Fusion 99:101891. https://doi.org/10.1016/j.inffus.2023.101891

    Article  Google Scholar 

  8. Huang C, Zhang J, Wu X, Wang Y, Li M, Huang X (2023) Tefna: Text-centered fusion network with crossmodal attention for multimodal sentiment analysis. Knowl-Based Syst 269:110502. https://doi.org/10.1016/j.knosys.2023.110502

    Article  MATH  Google Scholar 

  9. Yuan Z, Liu Y, Xu H, Gao K (2024) Noise imitation based adversarial training for robust multimodal sentiment analysis. IEEE Trans Multimed 26:529–539. https://doi.org/10.1109/TMM.2023.3267882

    Article  MATH  Google Scholar 

  10. Liu W, Li W, Ruan Y-P, Shu Y, Chen J, Li Y, Yu C, Zhang Y, Guan J, Zhou S (2024) Weakly correlated multimodal sentiment analysis: New dataset and topic-oriented model. IEEE Trans Affect Comput 1–13. https://doi.org/10.1109/TAFFC.2024.3396144

  11. Zhou J, Zhao J, Huang JX, Hu QV, He L (2021) Masad: A large-scale dataset for multimodal aspect-based sentiment analysis. Neurocomputing 455:47–58. https://doi.org/10.1016/j.neucom.2021.05.040

    Article  MATH  Google Scholar 

  12. Poria S, Cambria E, Bajpai R, Hussain A (2017) A review of affective computing: From unimodal analysis to multimodal fusion. Inf Fusion 37:98–125. https://doi.org/10.1016/j.inffus.2017.02.003

    Article  MATH  Google Scholar 

  13. Mai S, Hu H, Xing S (2019) Divide, conquer and combine: Hierarchical feature fusion network with local and global perspectives for multimodal affective computing. In: Korhonen A, Traum D, Màrquez L (eds) Proceedings of the 57th annual meeting of the association for computational linguistics, pp 481–492. Association for Computational Linguistics, Florence, Italy. https://doi.org/10.18653/v1/P19-1046

  14. Hazarika D, Zimmermann R, Poria S (2020) Misa: Modality-invariant and -specific representations for multimodal sentiment analysis. In: Proceedings of the 28th ACM international conference on multimedia, pp 1122–1131. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/3394171.3413678

  15. Tsai Y-HH, Bai S, Liang PP, Kolter JZ, Morency L-P, Salakhutdinov R (2019) Multimodal transformer for unaligned multimodal language sequences. In: Korhonen A, Traum D, Marquez L (eds) 57th Annual Meeting of the Association for Computational Linguistics (Acl 2019), pp 6558–6569

  16. Tang J, Liu D, Jin X, Peng Y, Zhao Q, Ding Y, Kong W (2023) Bafn: Bi-direction attention based fusion network for multimodal sentiment analysis. IEEE Trans Circ Syst Video Technol 33(4):1966–1978. https://doi.org/10.1109/TCSVT.2022.3218018

    Article  MATH  Google Scholar 

  17. Zadeh A, Liang PP, Poria S, Vij P, Cambria E, Morency L-P (2018) Multi-attention recurrent network human communication comprehension. In: Thirty-second AAAI conference on artificial intelligence / thirtieth innovative applications of artificial intelligence conference / eighth AAAI symposium on educational advances in artificial intelligence, pp 5642–5649

  18. Wang Z, Wan Z, Wan X (2020) Transmodality: An end2end fusion method with transformer for multimodal sentiment analysis. In: Proceedings of the web conference 2020, pp 2514–2520. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/3366423.3380000

  19. Chen C, Hong H, Guo J, Song B (2023) Inter-intra modal representation augmentation with trimodal collaborative disentanglement network for multimodal sentiment analysis. IEEE/ACM Trans Audio Speech Lang Process 31:1476–1488. https://doi.org/10.1109/TASLP.2023.3263801

    Article  MATH  Google Scholar 

  20. Wu Y, Lin Z, Zhao Y, Qin B, Zhu L-N (2021) A text-centered shared-private framework via cross-modal prediction for multimodal sentiment analysis. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp 4730–4738

  21. Han W, Chen H, Gelbukh A, Zadeh A, Morency L-P, Poria S (2021) Bi-bimodal modality fusion for correlation-controlled multimodal sentiment analysis. In: Proceedings of the 2021 international conference on multimodal interaction, pp 6–15. https://doi.org/10.1145/3462244.3479919

  22. Devlin J, Chang M-W, Lee K, Toutanova K (2019) Bert: Pre-training of deep bidirectional transformers for language understanding. In: 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Naacl Hlt 2019), vol 1, pp 4171–4186

  23. Hochreiter S, Schmidhuber J (1997) Long Short-Term Memory. Neural Comput 9(8):1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735

    Article  MATH  Google Scholar 

  24. Khosla P, Teterwak P, Wang C, Sarna A, Tian Y, Isola P, Maschinot A, Liu C, Krishnan D (2020) Supervised contrastive learning. In: Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H (eds) Advances in neural information processing systems, vol 33, pp 18661–18673

  25. Zadeh A, Zellers R, Pincus E, Morency L (2016) MOSI: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos

  26. Bagher Zadeh A, Liang PP, Poria S, Cambria E, Morency L-P (2018) Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph. In: Gurevych I, Miyao Y (eds) Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp 2236–2246. Association for Computational Linguistics, Melbourne, Australia. https://doi.org/10.18653/v1/P18-1208

  27. Sun Z, Sarma PK, Sethares WA, Liang Y (2020) Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Thirty-fourth AAAI conference on artificial intelligence, the thirty-second innovative applications of artificial intelligence conference and the tenth AAAI symposium on educational advances in artificial intelligence, vol 34, pp 8992–8999

  28. Yu W, Xu H, Yuan Z, Wu J (2021) Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis. Proceedings of the AAAI conference on artificial intelligence, vol 35, no 12, pp 10790–10797. https://doi.org/10.1609/aaai.v35i12.17289

  29. Yang B, Shao B, Wu L, Lin X (2022) Multimodal sentiment analysis with unidirectional modality translation. Neurocomputing 467:130–137. https://doi.org/10.1016/j.neucom.2021.09.041

    Article  MATH  Google Scholar 

  30. Kim K, Park S (2023) Aobert: All-modalities-in-one Bert for multimodal sentiment analysis. Inf Fusion 92:37–45. https://doi.org/10.1016/j.inffus.2022.11.022

    Article  MATH  Google Scholar 

  31. Wang J, Wang S, Lin M, Xu Z, Guo W (2023) Learning speaker-independent multimodal representation for sentiment analysis. Inf Sci 628:208–225. https://doi.org/10.1016/j.ins.2023.01.116

    Article  MATH  Google Scholar 

  32. Wang D, Guo X, Tian Y, Liu J, He L, Luo X (2023) Tetfn: A text enhanced transformer fusion network for multimodal sentiment analysis. Pattern Recognit 136:109259. https://doi.org/10.1016/j.patcog.2022.109259

    Article  Google Scholar 

  33. Tang Z, Xiao Q, Zhou X, Li Y, Chen C, Li K (2023) Learning discriminative multi-relation representations for multimodal sentiment analysis. Inf Sci 641:119125. https://doi.org/10.1016/j.ins.2023.119125

    Article  MATH  Google Scholar 

  34. Issa B, Jasser MB, Chua HN, Hamzah M (2023) A comparative study on embedding models for keyword extraction using keybert method. In: ICSET 2023 - 2023 IEEE 13th international conference on system engineering and technology, Proceeding, pp 40–45. https://doi.org/10.1109/ICSET59111.2023.10295108

  35. Loper E, Bird S (2002) Nltk: The natural language toolkit. In: ACL-02 Workshop on effective tools and methodologies for teaching natural language processing and computational linguistics, Proceedings, pp 63–70

Download references

Acknowledgements

The authors would like to express their gratitude for the assistance provided by our host institutions, whose support has been instrumental in making this collaborative work successful. Additionally, we extend our heartfelt thanks to friends and colleagues, who have provided valuable feedback on our work.

Funding

This work was supported in part by the National Natural Science Foundation of China under Grants 61702462 and 61906175; the Natural Science Foundation of Henan under Grants 242300421220 and XJTLU RDF-21-02-008; Jiangsu Double-Innova-tion Plan No. JSSCBS20230474; the Henan Provincial Science and Technology Research Project under Grants 232102211006, 232102210044, 232102211017, 232102210055 and 222102210214; the Science and Technology Innovation Project of Zhengzhou University of Light Industry under Grant 23XNKJTD0205; the Undergraduate Universities Smart Teaching Special Research Project of Henan Province under Grant Jiao Gao [2021] No. 489-29; and the Doctor Natural Science Foundation of Zhengzhou University of Light Industry under Grants 2021BSJJ025 and 2022BSJJZK13.

Author information

Authors and Affiliations

Authors

Contributions

Zuhe Li: Methodology design and development; creation of models; preparation, creation and/or presentation of the published work; specifically writing the initial draft (including substantive translation). Panbo Liu: Conducting the research and investigation process, specifically performing the experiments and data/evidence collection, programming, software development, designing computer programs, implementing computer code and supporting algorithms, and testing existing code components. Yushan Pan: Ideas: Formulation or evolution of overarching research goals and aims, development or design of methodology and creation of models. Oversight and leadership responsibility for research activity planning and execution, including mentorship external to the core team. Jun Yu: Verification, whether as part of the activity or separately, of the overall replication/reproducibility of the results/experiments and other research outputs. Weihua Liu: Preparation, creation, and/or presentation of the published work, specifically writing the initial draft (including substantive translations). Haoran Chen: Preparation, creation, and/or presentation of the published work, specifically, writing the initial draft (including substantive translation). Yiming Luo: Management activities to annotate (produce metadata), scrub data, and maintain research data (including software code where it is necessary to interpret the data) for initial use and later reuse. Hao Wang: Preparation, creation, and/or presentation of published work by those from the original research group, specifically critical review, commentary, or revision, including the pre- or post-publication stages.

Corresponding author

Correspondence to Yushan Pan.

Ethics declarations

Competing interests

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Ethical and informed consent for data used

The authors confirm that there were no human research participants involved in this study and that the research received approval from the university’s ethical committee.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, Z., Liu, P., Pan, Y. et al. Text-dominant multimodal perception network for sentiment analysis based on cross-modal semantic enhancements. Appl Intell 55, 188 (2025). https://doi.org/10.1007/s10489-024-06150-1

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10489-024-06150-1

Keywords