Skip to main content
Log in

A transformer-encoder-based multimodal multi-attention fusion network for sentiment analysis

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Feature fusion for multimodal sentiment analysis is a challenging but worthwhile research topic. With the extension of the time dimension, there are interactions between multimodal signals and the lack of control over the target modal representations during the fusion process leads to erroneous shifts of vectors in the feature space. Moreover, ignoring the representation of target modal features under different fusion orders may lead to insufficient fusion of complementary information. To address the above issues, this paper proposes a transformer-encoder-based multimodal multi-attention fusion network model. The model constructs a multi-attention fusion transformer-encoder to learn inter-modal consistent features and enhance the representation of target modal features. Meanwhile, for each target modality, we construct multi-attention fusion transformer-encoder with different fusion orders in the model to capture the complementary features among the sequences with different fusion orders. Then, the three target modal representations containing consistent features and complementary features are fused with initial features through residual connections to guide the final sentiment analysis. We conduct extensive experiments on three public multimodal datasets. The results show that our approach outperforms the compared multimodal sentiment analysis methods on most metrics and can explain the contributions of inter- and intra-modal interactions in multiple modalities.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data Availability

The CMU-MOSI and CMU-MOSEI datasets that support the findings of this study are available in the data repository: https://github.com/thuiar/Self-MM. The IEMOCAP dataset that support the findings of this study are available in the data repository: https://github.com/yaohungt/Multimodal-Transformer.

Code Availability

Code availability not applicable

References

  1. Yadollahi A, Shahraki AG, Zaiane OR (2017) Current state of text sentiment analysis from opinion to emotion mining. ACM Comput Surv 50(2):1–33. https://doi.org/10.1145/3057270

    Article  Google Scholar 

  2. Hu J, Peng J, Zhang W, Qi L, Hu M, Zhang H (2021) An intention multiple-representation model with expanded information. Comput Speech & Lang 68:101196. https://doi.org/10.1016/j.csl.2021.101196

    Article  Google Scholar 

  3. Huang B, Zhang J, Ju J, Guo R, Fujita H, Liu J (2023) CRF-GCN: An effective syntactic dependency model for aspect-level sentiment analysis. Knowl-Based Syst 260:110125. https://doi.org/10.1016/j.knosys.2022.110125

    Article  Google Scholar 

  4. Kenton JDMWC, Toutanova LK (2019) BERT: Pre-training of deep bidirectional transformers for language understanding. NAACL-HLT 2019:4171–4186

    Google Scholar 

  5. Stöckli S, Schulte-Mecklenbeck M, Borer S, Samson AC (2018) Facial expression analysis with affdex and facet: A validation study. Behav Res Methods 50:1446–1460. https://doi.org/10.3758/s13428-017-0996-1

    Article  Google Scholar 

  6. Degottex G, Kane J, Drugman T, Raitio T, Scherer S (2014) COVAREP-A collaborative voice analysis repository for speech technologies. ICASSP 2014:960–964. https://doi.org/10.1109/ICASSP.2014.6853739

    Article  Google Scholar 

  7. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural comput 9(8):1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735

    Article  Google Scholar 

  8. SravyaPranati B, Suma D, ManjuLatha C, Putheti S (2014) Large-scale video classification with convolutional neural networks. CVPR 2014:1725–1732

    Google Scholar 

  9. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Adv Neural Inf Process Syst 30:5999–6009

    Google Scholar 

  10. Wang F, Tian S, Yu L, Liu J, Wang J, Li K, Wang Y (2023) TEDT: transformer-based encoding-decoding translation network for multimodal sentiment analysis. Cognit Comput 15(1):289–303. https://doi.org/10.1007/s12559-022-10073-9

    Article  Google Scholar 

  11. Zhang F, Li XC, Lim CP, Hua Q, Dong CR, Zhai JH (2022) Deep emotional arousal network for multimodal sentiment analysis and emotion recognition. Inf Fusion 88:296–304. https://doi.org/10.1016/j.inffus.2022.07.006

    Article  Google Scholar 

  12. Zhu L, Zhu Z, Zhang C, Xu Y, Kong X (2023) Multimodal sentiment analysis based on fusion methods: A survey. Inf Fusion 95:306–325. https://doi.org/10.1016/j.inffus.2023.02.028

    Article  Google Scholar 

  13. Zeng Y, Li Z, Tang Z, Chen Z, Ma H (2023) Heterogeneous graph convolution based on In-domain Self-supervision for Multimodal Sentiment Analysis. Expert Syst Appl 213:119240. https://doi.org/10.1016/j.eswa.2022.119240

    Article  Google Scholar 

  14. Zadeh A, Liang PP, Mazumder N, Poria S, Cambria E, Morency LP (2018) Memory fusion network for multi-view sequential learning. AAAI 2018:5634–5641

    Google Scholar 

  15. Gu Y, Yang K, Fu S, Chen S, Li X, Marsic I (2018) Multimodal affective analysis using hierarchical attention strategy with word-level alignment. ACL 2018:2225–2235

    Google Scholar 

  16. Liang PP, Liu Z, Zadeh A, Morency LP (2018) Multimodal language analysis with recurrent multistage fusion. EMNLP 2018:150–161

    Google Scholar 

  17. Tsai YHH, Bai S, Liang PP, Kolter JZ, Morency LP, Salakhutdinov R (2019) Multimodal transformer for unaligned multimodal language sequences. ACL 2019:6558–6569

    Google Scholar 

  18. Wu T, Peng J, Zhang W, Zhang H, Tan S, Yi F, Ma C, Huang Y (2022) Video sentiment analysis with bimodal information-augmented multi-head attention. Knowl-Based Syst 235:107676. https://doi.org/10.1016/j.knosys.2021.107676

    Article  Google Scholar 

  19. Shi P, Hu M, Ren F, Shi X, Xu L (2022) Learning modality-fused representation based on transformer for emotion analysis. J Electron Imaging 31(6):063032–063032. https://doi.org/10.1117/1.JEI.31.6.063032

    Article  Google Scholar 

  20. Zeng Y, Li Z, Chen Z, Ma H (2024) A feature-based restoration dynamic interaction network for multimodal sentiment analysis. Eng Appl Artif Intell 127(B):107335. https://doi.org/10.1016/j.engappai.2023.107335

  21. Zadeh AB, Liang PP, Poria S, Cambria E, Morency LP (2018) Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph. ACL 2018:2236–2246

    Google Scholar 

  22. Busso C, Bulut M, Lee CC, Kazemzadeh A, Mower E, Kim S, Chang JN, Lee S, Narayanan SS (2008) IEMOCAP: Interactive emotional dyadic motion capture database. Lang Resour Eval 42:335–359. https://doi.org/10.1007/s10579-008-9076-6

    Article  Google Scholar 

  23. Pandey A, Vishwakarma DK (2023) Progress, Achievements, and Challenges in Multimodal Sentiment Analysis Using Deep Learning: A Survey. Appl Soft Comput 152:111206. https://doi.org/10.1016/j.asoc.2023.111206

    Article  Google Scholar 

  24. Gandhi A, Adhvaryu K, Poria S, Cambria E, Hussain A (2023) Multimodal sentiment analysis: A systematic review of history, datasets, multimodal fusion methods, applications, challenges and future directions. Inf Fusion 91:424–444. https://doi.org/10.1016/j.inffus.2022.09.025

    Article  Google Scholar 

  25. Gkoumas D, Li Q, Lioma C, Yu Y, Song D (2021) What makes the difference? An empirical comparison of fusion strategies for multimodal language analysis. Inf Fusion 66:184–197. https://doi.org/10.1016/j.inffus.2020.09.005

    Article  Google Scholar 

  26. Kossaifi J, Lipton ZC, Kolbeinsson A, Khanna A, Furlanello T, Anandkumar A (2020) Tensor regression networks. J Mach Learn Res 21(123):1–21

    MathSciNet  Google Scholar 

  27. Barezi EJ, Fung P (2019) Modality-based factorization for multimodal fusion. ACL 2019:260–269

    Google Scholar 

  28. Zadeh A, Chen M, Poria S, Cambria E, Morency LP (2017) Tensor fusion network for multimodal sentiment analysis. In: EMNLP 2017, pp 1103–1114. https://doi.org/10.18653/v1/d17-1115

  29. Liu Z, Shen Y, Lakshminarasimhan VB, Liang PP, Zadeh A, Morency LP (2017) Efficient low-rank multimodal fusion with modality-specific factors. In: ACL 2018, pp 2247–2256. https://doi.org/10.18653/v1/p18-1209

  30. Kumar A, Vepa J (2020) Gated mechanism for attention based multi modal sentiment analysis. ICASSP 2020:4477–4481. https://doi.org/10.1109/ICASSP40776.2020.9053012

    Article  Google Scholar 

  31. Wu Y, Zhao Y, Yang H, Chen S, Qin B, Cao X, Zhao W (2022) Sentiment word aware multimodal refinement for multimodal sentiment analysis with asr errors. ACL 2022:1397–1406

    Google Scholar 

  32. Mai S, Hu H, Xu J, Xing S (2022) Multi-fusion residual memory network for multimodal human sentiment comprehension. IEEE Trans Affect Comput 13(1):320–334. https://doi.org/10.1109/TAFFC.2020.3000510

    Article  Google Scholar 

  33. Wang Y, Shen Y, Liu Z, Liang PP, Zadeh A, Morency LP (2019) Words can shift: Dynamically adjusting word representations using nonverbal behaviors. AAAI 2019:7216–7223. https://doi.org/10.1609/aaai.v33i01.33017216

    Article  Google Scholar 

  34. Lin Z, Liang B, Long Y, Dang Y, Yang M, Zhang M, Xu R (2022) Modeling intra-and inter-modal relations: Hierarchical graph contrastive learning for multimodal sentiment analysis. COLING 2022:7124–7135

    Google Scholar 

  35. Mai S, Zeng Y, Zheng S, Hu H (2023) Hybrid contrastive learning of tri-modal representation for multimodal sentiment analysis. IEEE Trans Affect Comput 14(3):2276–2289. https://doi.org/10.1109/TAFFC.2022.3172360

    Article  Google Scholar 

  36. Tsai YHH, Liang PP, Zadeh A, Morency LP, Salakhutdinov R (2019) Learning factorized multimodal representations. In: ICLR 2019

  37. Sun Z, Sarma P, Sethares W, Liang Y (2020) Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. AAAI 2020:8992–8999. https://doi.org/10.1609/aaai.v34i05.6431

    Article  Google Scholar 

  38. Hazarika D, Zimmermann R, Poria S (2020) MISA: Modality-invariant and -specific representations for multimodal sentiment analysis. In: MM 2020, pp 1122–1131. https://doi.org/10.1145/3394171.3413678

  39. Yu W, Xu H, Yuan Z, Wu J (2021) Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis. AAAI 2021:10790–10797. https://doi.org/10.1609/aaai.v35i12.17289

    Article  Google Scholar 

  40. Peng J, Wu T, Zhang W, Cheng F, Tan S, Yi F, Huang Y (2023) A fine-grained modal label-based multi-stage network for multimodal sentiment analysis. Expert Syst Appl 221:119721. https://doi.org/10.1016/j.eswa.2023.119721

    Article  Google Scholar 

  41. He J, Mai S, Hu H (2021) A unimodal reinforced transformer with time squeeze fusion for multimodal sentiment analysis. IEEE Signal Process Lett 28:992–996. https://doi.org/10.1109/LSP.2021.3078074

    Article  Google Scholar 

  42. Rahman W, Hasan MK, Lee S, Zadeh A, Mao C, Morency LP, Hoque E (2020) Integrating multimodal information in large pretrained transformers. In: ACL 2020, p 2359

  43. Pham H, Liang PP, Manzini T, Morency LP, Póczos B (2019) Found in translation: Learning robust joint representations by cyclic translations between modalities. AAAI 2019:6892–6899. https://doi.org/10.1609/aaai.v33i01.33016892

    Article  Google Scholar 

  44. Yu J, Jiang J, Xia R (2019) Entity-sensitive attention and fusion network for entity-level multimodal sentiment classification. IEEE/ACM Trans Audio, Speech, and Lang Process 28:429–439. https://doi.org/10.1109/TASLP.2019.2957872

    Article  Google Scholar 

  45. Jiang D, Liu H, Wei R, Tu G (2023) CSAT-FTCN: a fuzzy-oriented model with contextual self-attention network for multimodal emotion recognition. Cognit Comput 15:1082–1091. https://doi.org/10.1007/s12559-023-10119-6

    Article  Google Scholar 

  46. Zeng J, Zhou J, Liu T (2022) Mitigating inconsistencies in multimodal sentiment analysis under uncertain missing modalities. EMNLP 2022:2924–2934

  47. Yang B, Shao B, Wu L, Lin X (2022) Multimodal sentiment analysis with unidirectional modality translation. Neurocomputing 467:130–137. https://doi.org/10.1016/j.neucom.2021.09.041

    Article  Google Scholar 

  48. He J, Hu H (2021) MF-BERT: Multimodal fusion in pre-trained BERT for sentiment analysis. IEEE Signal Process Lett 29:454–458. https://doi.org/10.1109/LSP.2021.3139856

    Article  Google Scholar 

  49. Wen H, You S, Fu Y (2021) Cross-modal context-gated convolution for multi-modal sentiment analysis. Pattern Recognit Lett 146:252–259. https://doi.org/10.1016/j.patrec.2021.03.025

    Article  Google Scholar 

  50. Zhang S, Yin C, Yin Z (2022) Multimodal sentiment recognition with multi-task learning. IEEE Trans Emerg Top Computat Intell 7(1):200–209. https://doi.org/10.1109/TETCI.2022.3224929

    Article  Google Scholar 

  51. Dhanith P, Surendiran B, Rohith G, Kanmani SR, Devi KV (2024) A sparse self-attention enhanced model for aspect-level sentiment classification. Neural Process Lett 56(2):1–21. https://doi.org/10.1007/s11063-024-11513-3

    Article  Google Scholar 

  52. Catelli R, Fujita H, De Pietro G, Esposito M (2022) Deceptive reviews and sentiment polarity: Effective link by exploiting BERT. Expert Syst Appl 209:118290. https://doi.org/10.1016/j.eswa.2022.118290

    Article  Google Scholar 

  53. Chen Q, Huang G, Wang Y (2022) The weighted cross-modal attention mechanism with sentiment prediction auxiliary task for multimodal sentiment analysis. IEEE/ACM Transactions on Audio, Speech, and Language Processing 30:2689–2695. https://doi.org/10.1109/TASLP.2022.3192728

    Article  Google Scholar 

  54. Zhao X, Chen Y, Liu S, Tang B (2022) Shared-private memory networks for multimodal sentiment analysis. IEEE Trans Affect Comput 14(4):2889–2900. https://doi.org/10.1109/TAFFC.2022.3222023

    Article  Google Scholar 

  55. Graves A, Fernández S, Gomez F, Schmidhuber J (2006) Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. ICML 2006:369–376

    Article  Google Scholar 

  56. Wang D, Guo X, Tian Y, Liu J, He L, Luo X (2023) TETFN: A text enhanced transformer fusion network for multimodal sentiment analysis. Pattern Recognit 136:109259. https://doi.org/10.1016/j.patcog.2022.109259

    Article  Google Scholar 

Download references

Acknowledgements

This work is supported by the National Key Research and Development Program of China (Grant No. 2022YFC3301804), the Humanities and Social Sciences Youth Foundation, Ministry of Education of China (Grant No.20YJCZH172), the China Postdoctoral Science Foundation (Grant No. 2019M651262), the Heilongjiang Provincial Postdoctoral Science Foundation (Grant No. LBH-Z19015), and the National Natural Science Foundation of China (No. 61672179).

Author information

Authors and Affiliations

Authors

Contributions

Cong Liu: Conceptualization, Methodology, Software, Investigation, Writing - original draft. Yong Wang: Conceptualization, Resources, Funding acquisition, Supervision, Writing - review & editing. Jing Yang: Supervision, Funding acquisition, Writing - review & editing.

Corresponding author

Correspondence to Yong Wang.

Ethics declarations

Competing of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A: Positional embedding

Appendix A: Positional embedding

Compared with the temporally related models such as CNN, RNN, etc., transformer completely employs the mechanism of multi-head attention and ignores the temporal factors in the sequence. Therefore, for the different orders of the same temporal sequence, transformer produces the same result. To address this issue, consistent with [17], we similarly embed the position information in the temporal sequence. Specifically, for the temporal sequence \({\textbf {X}}\in \mathbb {R}^{T\times {d}}\), we use the sin and cos functions to encode the position information, and the frequency of the functions is determined by the feature index. The encoding process is shown in (A1) and (A2):

$$\begin{aligned} PE[pos,2j]= & {} sin(\frac{pos}{10000^{\frac{2j}{d}}}) \end{aligned}$$
(A1)
$$\begin{aligned} PE[pos,2j+1]= & {} cos(\frac{pos}{10000^{\frac{2j}{d}}}) \end{aligned}$$
(A2)

where \(pos=1,2,....,T\) is the position in the temporal sequence and \(j=0,\lfloor {\frac{d}{2}}\rfloor \) is the dimension corresponding to d. Therefore, the positional encoding generated by positional embedding shows a sinusoidal character. During the intra-modal feature extraction and interaction and inter-modal multi-attention interaction stages, positional embedding information can be added to the sequence by summing the encoded features with the temporal sequence via (9) and (12).

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, C., Wang, Y. & Yang, J. A transformer-encoder-based multimodal multi-attention fusion network for sentiment analysis. Appl Intell 54, 8415–8441 (2024). https://doi.org/10.1007/s10489-024-05623-7

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-024-05623-7

Keywords