Skip to main content
Log in

TEDT: Transformer-Based Encoding–Decoding Translation Network for Multimodal Sentiment Analysis

  • Published:
Cognitive Computation Aims and scope Submit manuscript

Abstract

Multimodal sentiment analysis is a popular and challenging research topic in natural language processing, but the impact of individual modal data in videos on sentiment analysis results can be different. In the temporal dimension, natural language sentiment is influenced by nonnatural language sentiment, which may enhance or weaken the original sentiment of the current natural language. In addition, there is a general problem of poor quality of nonnatural language features, which essentially hinders the effect of multimodal fusion. To address the above issues, we proposed a multimodal encoding–decoding translation network with a transformer and adopted a joint encoding–decoding method with text as the primary information and sound and image as the secondary information. To reduce the negative impact of nonnatural language data on natural language data, we propose a modality reinforcement cross-attention module to convert nonnatural language features into natural language features to improve their quality and better integrate multimodal features. Moreover, the dynamic filtering mechanism filters out the error information generated in the cross-modal interaction to further improve the final output. We evaluated the proposed method on two multimodal sentiment analysis benchmark datasets (MOSI and MOSEI), and the accuracy of the method was 89.3% and 85.9%, respectively. In addition, our method outperformed the current state-of-the-art methods. Our model can greatly improve the effect of multimodal fusion and more accurately analyze human sentiment.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Data Availability

Data is available in a public repository. The data that support the findings of this study are available in the following: MOSI–https://www.jianguoyun.com/p/DV1YEgUQvP35Bhim9d8D and MOSEI–https://www.jianguoyun.com/p/DVw4E7EQvP35Bhir9d8D.

References

  1. Bo P, Lillian L. A sentimental education: sentiment analysis using subjectivity summarization based on minimum cuts. Proceedings of the Association for Computational Linguistics; 2004. p. 271–8. https://doi.org/10.3115/1218955.1218990.

    Book  Google Scholar 

  2. Vinodhini G, Chandrasekaran RM. Sentiment analysis and opinion mining: a survey. Int J. 2012;2(6):282–92. https://doi.org/10.1007/978-1-4899-7502-7.

    Article  Google Scholar 

  3. Manning C, Surdeanu M, Bauer J. The Stanford CoreNLP natural language processing toolkit. Proceedings of the Association for Computational Linguistics; 2014. p. 55–60. https://doi.org/10.3115/v1/P14-5010.

    Book  Google Scholar 

  4. Poria S, Cambria E, Hazarika D. Context-dependent sentiment analysis in user-generated videos, vol. 1. Proceedings of the Association for Computational Linguistics; 2017. p. 873–83. https://doi.org/10.18653/v1/P17-1081.

    Book  Google Scholar 

  5. Baltrušaitis T, Ahuja C, Morency L-P. Multimodal machine learning: a survey and taxonomy. IEEE Trans Pattern Anal Mach Intell. 2019;41(2):423–43. https://doi.org/10.1109/TPAMI.2018.2798607.

    Article  Google Scholar 

  6. Majumder N, Hazarika D, Gelbukh A, Cambria E, Poria S. Multimodal sentiment analysis using hierarchical fusion with context modeling. Knowl Based Syst. 2018;161:124–33. https://doi.org/10.1016/j.knosys.2018.07.041.

    Article  Google Scholar 

  7. Hu A, Flaxman S. Multimodal sentiment analysis to explore the structure of emotions, vol. 9. New York, NY, USA: Proceedings of the Association for Computing Machinery; 2018. p. 350–8. https://doi.org/10.1145/3219819.3219853.

    Book  Google Scholar 

  8. Baecchi C, Uricchio T, Bertini M, et al. A multimodal feature learning approach for sentiment analysis of social network multimedia. Multimed Tools Appl. 2016;75:2507–25. https://doi.org/10.1007/s11042-015-2646-x.

    Article  Google Scholar 

  9. Pandeya YR, Lee J. Deep learning-based late fusion of multimodal information for emotion classification of music video. Multimed Tools Appl. 2021;80:2887–905. https://doi.org/10.1007/s11042-020-08836-3.

    Article  Google Scholar 

  10. Poria S, Cambria E, Bajpai R, Hussain A. A review of affective computing: from unimodal analysis to multimodal fusion. Inform Fusion. 2017;37:98–125.

    Article  Google Scholar 

  11. Shenoy A, Sardana A. Multilogue-Net: A context-aware RNN for multi-modal emotion detection and sentiment analysis in conversation. Proceedings of the Second Grand Challenge and Workshop on Multimodal Language; 2020. p. 19–28.

    Google Scholar 

  12. Zadeh A, Chen M, Poria S. Tensor fusion network for multimodal sentiment analysis. Proceedings of the Association for Computational Linguistics; 2017. p. 1103–14. https://doi.org/10.18653/v1/d17-1115.

    Book  Google Scholar 

  13. Zadeh A, Liang P, Mazumder N, Poria S. Memory fusion network for multi-view sequential learning. Proc AAAI Conf Artif Intell. 2018;32(1):5634–41.

    Google Scholar 

  14. Liang P, Liu Z, Zadeh A, Morency L. Multimodal language analysis with recurrent multistage fusion. Proceedings of the Association for Computational Linguistics; 2018. p. 150–61. https://doi.org/10.18653/v1/D18-1014.

    Book  Google Scholar 

  15. Ghosal D, Akhtar MS, Chauhan D, Poria S. Contextual intermodal attention for multi-modal sentiment analysis. Proceedings of the Association for Computational Linguistics; 2018. p. 3454–66. https://doi.org/10.18653/v1/D18-1382.

    Book  Google Scholar 

  16. Devlin J, Chang M, Lee K, Toutanova K. BERT: pre-training of deep bidirectional transformers for language understanding. In: Burstein J, Doran C, Solorio T, editors. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, vol 1, June 2–7, 2019 (Long and Short Papers). Association for Computational Linguistics; 2019. p. 4171–86.

    Google Scholar 

  17. Liu Z, Shen Y, Lakshminarasimhan VB, Liang PP, Zadeh A, Morency L. Efficient low-rank multimodal fusion with modality-specific factors. In: Gurevych I, Miyao Y, editors. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15–20, 2018, Volume 1: Long Papers. Association for Computational Linguistics; 2018. p. 2247–56.

    Google Scholar 

  18. Zadeh A, Liang PP, Mazumder N, Poria S, Cambria E, Morency L-P. Memory fusion network for multi-view sequential learning. Proc AAAI Conf Artif Intell. 2018;32(1):5634–41.

    Google Scholar 

  19. Ghosal D, Akhtar MS, Chauhan D, Poria S. Contextual intermodal attention for multi-modal sentiment analysis. Proceedings of the Association for Computational Linguistics; 2018. p. 3454–66. https://doi.org/10.18653/v1/D18-1382.

    Book  Google Scholar 

  20. Han W, Chen H, Gelbukh A, Zadeh A, et al. Bi-bimodal modality fusion for correlation-controlled multimodal sentiment analysis. ICMI ’21: Proceedings of the 2021 International Conference on Multimodal Interaction; 2021. p. 6–15. https://doi.org/10.1145/3462244.3479919.

    Book  Google Scholar 

  21. Yu W, Xu H, Yuan Z, Wu J. Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis. Proc AAAI Conf Artif Intell. 2021;35(12):10790–7. https://doi.org/10.18653/v1/D18-1382.

    Article  Google Scholar 

  22. Han W, Chen H, Poria S. Improving multimodal fusion with hierarchical mutual information maximization for multimodal sentiment analysis. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing; 2021. p. 9180–92.

    Google Scholar 

  23. Li Q, Gkoumas D, Lioma C, Melucci M. Quantum-inspired multimodal fusion for video sentiment analysis. Inf Fusion. 2021;65:58–71.

    Article  Google Scholar 

  24. He J, Hu H. MF-BERT: Multimodal fusion in pre-trained BERT for sentiment analysis. IEEE Signal Process Lett. 2022;29:454–8.

    Article  Google Scholar 

  25. Wu Y, Lin Z, Zhao Y, Qin B, Zhu L-N. A text-centered shared-private framework via cross-modal prediction for multimodal sentiment analysis. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. Association for Computational Linguistics; 2021. p. 4730–8.

    Chapter  Google Scholar 

  26. Jiang D, Wei R, Liu H, Wen J, Tu G, Zheng L, Cambria E. A multitask learning framework for multimodal sentiment analysis. In: 2021 International Conference on Data Mining Workshops (ICDMW). IEEE; 2021. p. 151–7.

    Chapter  Google Scholar 

  27. Yang B, Wu L, Zhu J, Shao B, Lin X, Liu TY. Multimodal sentiment analysis with two-phase multi-task learning. IEEE/ACM Trans Audio Speech Language Process. 2022;30:2015–24.

    Article  Google Scholar 

  28. Akhtar S, Ghosal D, Ekbal A, Bhattacharyya P, Kurohashi S. All-in-one: Emotion, sentiment and intensity prediction using a multi-task ensemble framework. IEEE Trans Affect Comput. 2019;13(1):285–97. https://doi.org/10.1109/TAFFC.2019.2926724.

    Article  Google Scholar 

  29. Wang Y, Chen Z, Chen S, Zhu Y. Multi-task Learning for Multimodal Emotion Recognition. In: Pimenidis E, Angelov P, Jayne C, Papaleonidas A, Aydin ML, editors. Artificial Neural Networks and Machine Learning – ICANN 2022. ICANN 2022. Lecture Notes in Computer Science, vol. 13531. Cham: Springer; 2022.

    Google Scholar 

  30. Song Y, Fan X, Yang Y, Ren G, Pan W. A Cross-Modal Attention and Multi-task Learning Based Approach for Multi-modal Sentiment Analysis. In: Liang Q, Wang W, Mu J, Liu X, Na Z, editors. Artificial Intelligence in China. Lecture Notes in Electrical Engineering, vol. 854. Singapore: Springer; 2022. p. 159–66. https://doi.org/10.1016/j.knosys.2022.109924.

    Chapter  Google Scholar 

  31. Yang L, Na J-C, Yu J. Cross-Modal Multitask Transformer for End-to-End Multimodal Aspect-Based Sentiment Analysis. Inf Process Manag. 2022;59(5):103038 ISSN 0306–4573.

    Article  Google Scholar 

  32. Chauhan DS, Singh GV, Arora A, Ekbal A, Bhattacharyya P. An emoji-aware multitask framework for multimodal sarcasm detection. Knowl Based Syst. 2022;257:109924. https://doi.org/10.1016/j.knosys.2022.109924 ISSN 0950–7051.

    Article  Google Scholar 

  33. Li H, Xu H. Deep reinforcement learning for robust emotional classification in facial expression recognition. Knowl Based Syst. 2020;204: 106172. https://doi.org/10.1016/j.knosys.2020.106172.

    Article  Google Scholar 

  34. Shu Y, Xu G. Emotion recognition from music enhanced by domain knowledge. In: The Pacific Rim International Conference On Artificial Intelligence 2019. Trends In Artificial Intelligence; 2019. p. 121–34.

    Google Scholar 

  35. Zhang K, Li Y, Wang J, Cambria E, Li X. Real-time video emotion recognition based on reinforcement learning and domain knowledge. IEEE Trans Circuits Syst Video Technol. 2021;32(3):1034–47.

    Article  Google Scholar 

  36. Pham H, Liang PP, Manzini T, Morency L-P, Póczos B. Found in translation: learning robust joint representations by cyclic translations between modalities. Proc AAAI Conf Artif Intell. 2019;33:6892–9.

    Google Scholar 

  37. Cambria E, Howard N, Hsu J, Hussain A. Sentic blending: Scalable multimodal fusion for the continuous interpretation of semantics and sentics. In: 2013 IEEE Symposium on Computational Intelligence for Human-Like Intelligence (CIHLI). Singapore, Singapore: IEEE; 2013. p. 108–17. https://doi.org/10.1109/CIHLI.2013.6613272.

    Chapter  Google Scholar 

  38. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I. Attention is all you need. Adv Neural Inf Process Syst. 2017;30:5998–6008.

    Google Scholar 

  39. Wu Y, Schuster M, Chen Z, Le QV, Norouzi M, Macherey W, Krikun M, Cao Y, Gao Q, Macherey K, Klingner J, Shah A, Johnson M, Liu X, Kaiser L, Gouws S, Kato Y, Kudo T, Kazawa H, Stevens K, Kurian G, Patil N, Wang W, Young C, Smith J, Riesa J, Rudnick A, Vinyals O, Corrado G, Hughes M, Dean J. Google’s neural machine translation system: Bridging the gap between human and machine translation. http://arxiv.org/abs/1609.08144. Accessed 18 Oct 2016.

  40. Cho K, van Merrienboer B, Gülçehre Ç, Bahdanau D, Bougares F, Schwenk H, Bengio Y. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: Moschitti A, Pang B, Daelemans W, editors. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25–29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL. ACL; 2014. p. 1724–34. https://doi.org/10.3115/v1/d14-1179.

    Chapter  Google Scholar 

  41. Zadeh A, Zellers R, Pincus E, Morency L. MOSI: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos. IEEE Intell Syst. 2016;31(6):82–8. http://arxiv.org/abs/1606.06259. Accessed 12 Aug 2016.

  42. Zadeh AB, Liang PP, Poria S, Cambria E, Morency L-P. Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); 2018. p. 2236–46.

    Google Scholar 

  43. Degottex G, Kane J, Drugman T, Raitio T, Scherer S. COVAREP–a collaborative voice analysis- repository for speech technologies. In: 2014 IEEE international conference on acoustics, speech and signal processing (icassp). IEEE; 2014. p. 960–4.

    Chapter  Google Scholar 

  44. Wang Y, Shen Y, Liu Z, Liang PP, Zadeh A, Morency L-P. Words can shift: Dynamically adjusting word representations using nonverbal behaviors. Proc AAAI Conf Artif Intell. 2019;33:7216–23.

    Google Scholar 

  45. Tsai YHH, Bai S, Liang PP, Kolter JZ, Morency LP, Salakhutdinov R. Multimodal transformer for unaligned multimodal language sequences. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy. Association for Computational Linguistics; 2019. p. 6558–69.

    Chapter  Google Scholar 

  46. Tsai YH, P Liang P, Zadeh A, Morency L, Salakhutdinov R. Learning factorized multimodal representations. In: 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6–9, 2019. OpenReview.net; 2019. https://openreview.net/forum?id=rygqqsA9KX. Accessed 14 May 2019.

  47. Sun Z, Sarma P, Sethares W, Liang Y. Learning Relationships between Text, Audio, and Video via Deep Canonical Correlation for Multimodal Language Analysis. Proc AAAI Conf Artif Intell. 2020;34(5):8992–9. https://doi.org/10.1609/aaai.v34i05.6431.

    Article  Google Scholar 

  48. Rahman W, Hasan MK, Lee S, Zadeh A, Mao C, Morency L-P, Hoque E. Integrating multimodal information in large pretrained transformers. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics; 2020. p. 2359–69.

    Google Scholar 

Download references

Funding

This work was supported by the National Natural Science Foundation of China 61962057, Autonomous Region Key R&D Project 2021B01002, and the National Natural Science Foundation of China under grant U2003208.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shengwei Tian.

Ethics declarations

Ethical Approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Conflict of Interest

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, F., Tian, S., Yu, L. et al. TEDT: Transformer-Based Encoding–Decoding Translation Network for Multimodal Sentiment Analysis. Cogn Comput 15, 289–303 (2023). https://doi.org/10.1007/s12559-022-10073-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12559-022-10073-9

Keywords

Navigation