A transformer-encoder-based multimodal multi-attention fusion network for sentiment analysis

Liu, Cong; Wang, Yong; Yang, Jing

doi:10.1007/s10489-024-05623-7

A transformer-encoder-based multimodal multi-attention fusion network for sentiment analysis

Published: 27 June 2024

Volume 54, pages 8415–8441, (2024)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Cong Liu^1,2,
Yong Wang ORCID: orcid.org/0000-0002-9867-9765^1,2 &
Jing Yang^1,2

775 Accesses
Explore all metrics

Abstract

Feature fusion for multimodal sentiment analysis is a challenging but worthwhile research topic. With the extension of the time dimension, there are interactions between multimodal signals and the lack of control over the target modal representations during the fusion process leads to erroneous shifts of vectors in the feature space. Moreover, ignoring the representation of target modal features under different fusion orders may lead to insufficient fusion of complementary information. To address the above issues, this paper proposes a transformer-encoder-based multimodal multi-attention fusion network model. The model constructs a multi-attention fusion transformer-encoder to learn inter-modal consistent features and enhance the representation of target modal features. Meanwhile, for each target modality, we construct multi-attention fusion transformer-encoder with different fusion orders in the model to capture the complementary features among the sequences with different fusion orders. Then, the three target modal representations containing consistent features and complementary features are fused with initial features through residual connections to guide the final sentiment analysis. We conduct extensive experiments on three public multimodal datasets. The results show that our approach outperforms the compared multimodal sentiment analysis methods on most metrics and can explain the contributions of inter- and intra-modal interactions in multiple modalities.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Attention fusion network for multimodal sentiment analysis

Article 14 June 2023

Text-Oriented Modality Reinforcement Network for Multimodal Sentiment Analysis from Unaligned Multimodal Sequences

NRAFN: a non-text reinforcement and adaptive fusion network for multimodal sentiment analysis

Article 28 May 2024

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Data Availability

The CMU-MOSI and CMU-MOSEI datasets that support the findings of this study are available in the data repository: https://github.com/thuiar/Self-MM. The IEMOCAP dataset that support the findings of this study are available in the data repository: https://github.com/yaohungt/Multimodal-Transformer.

Code Availability

Code availability not applicable

References

Yadollahi A, Shahraki AG, Zaiane OR (2017) Current state of text sentiment analysis from opinion to emotion mining. ACM Comput Surv 50(2):1–33. https://doi.org/10.1145/3057270
Article Google Scholar
Hu J, Peng J, Zhang W, Qi L, Hu M, Zhang H (2021) An intention multiple-representation model with expanded information. Comput Speech & Lang 68:101196. https://doi.org/10.1016/j.csl.2021.101196
Article Google Scholar
Huang B, Zhang J, Ju J, Guo R, Fujita H, Liu J (2023) CRF-GCN: An effective syntactic dependency model for aspect-level sentiment analysis. Knowl-Based Syst 260:110125. https://doi.org/10.1016/j.knosys.2022.110125
Article Google Scholar
Kenton JDMWC, Toutanova LK (2019) BERT: Pre-training of deep bidirectional transformers for language understanding. NAACL-HLT 2019:4171–4186
Google Scholar
Stöckli S, Schulte-Mecklenbeck M, Borer S, Samson AC (2018) Facial expression analysis with affdex and facet: A validation study. Behav Res Methods 50:1446–1460. https://doi.org/10.3758/s13428-017-0996-1
Article Google Scholar
Degottex G, Kane J, Drugman T, Raitio T, Scherer S (2014) COVAREP-A collaborative voice analysis repository for speech technologies. ICASSP 2014:960–964. https://doi.org/10.1109/ICASSP.2014.6853739
Article Google Scholar
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural comput 9(8):1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735
Article Google Scholar
SravyaPranati B, Suma D, ManjuLatha C, Putheti S (2014) Large-scale video classification with convolutional neural networks. CVPR 2014:1725–1732
Google Scholar
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Adv Neural Inf Process Syst 30:5999–6009
Google Scholar
Wang F, Tian S, Yu L, Liu J, Wang J, Li K, Wang Y (2023) TEDT: transformer-based encoding-decoding translation network for multimodal sentiment analysis. Cognit Comput 15(1):289–303. https://doi.org/10.1007/s12559-022-10073-9
Article Google Scholar
Zhang F, Li XC, Lim CP, Hua Q, Dong CR, Zhai JH (2022) Deep emotional arousal network for multimodal sentiment analysis and emotion recognition. Inf Fusion 88:296–304. https://doi.org/10.1016/j.inffus.2022.07.006
Article Google Scholar
Zhu L, Zhu Z, Zhang C, Xu Y, Kong X (2023) Multimodal sentiment analysis based on fusion methods: A survey. Inf Fusion 95:306–325. https://doi.org/10.1016/j.inffus.2023.02.028
Article Google Scholar
Zeng Y, Li Z, Tang Z, Chen Z, Ma H (2023) Heterogeneous graph convolution based on In-domain Self-supervision for Multimodal Sentiment Analysis. Expert Syst Appl 213:119240. https://doi.org/10.1016/j.eswa.2022.119240
Article Google Scholar
Zadeh A, Liang PP, Mazumder N, Poria S, Cambria E, Morency LP (2018) Memory fusion network for multi-view sequential learning. AAAI 2018:5634–5641
Google Scholar
Gu Y, Yang K, Fu S, Chen S, Li X, Marsic I (2018) Multimodal affective analysis using hierarchical attention strategy with word-level alignment. ACL 2018:2225–2235
Google Scholar
Liang PP, Liu Z, Zadeh A, Morency LP (2018) Multimodal language analysis with recurrent multistage fusion. EMNLP 2018:150–161
Google Scholar
Tsai YHH, Bai S, Liang PP, Kolter JZ, Morency LP, Salakhutdinov R (2019) Multimodal transformer for unaligned multimodal language sequences. ACL 2019:6558–6569
Google Scholar
Wu T, Peng J, Zhang W, Zhang H, Tan S, Yi F, Ma C, Huang Y (2022) Video sentiment analysis with bimodal information-augmented multi-head attention. Knowl-Based Syst 235:107676. https://doi.org/10.1016/j.knosys.2021.107676
Article Google Scholar
Shi P, Hu M, Ren F, Shi X, Xu L (2022) Learning modality-fused representation based on transformer for emotion analysis. J Electron Imaging 31(6):063032–063032. https://doi.org/10.1117/1.JEI.31.6.063032
Article Google Scholar
Zeng Y, Li Z, Chen Z, Ma H (2024) A feature-based restoration dynamic interaction network for multimodal sentiment analysis. Eng Appl Artif Intell 127(B):107335. https://doi.org/10.1016/j.engappai.2023.107335
Zadeh AB, Liang PP, Poria S, Cambria E, Morency LP (2018) Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph. ACL 2018:2236–2246
Google Scholar
Busso C, Bulut M, Lee CC, Kazemzadeh A, Mower E, Kim S, Chang JN, Lee S, Narayanan SS (2008) IEMOCAP: Interactive emotional dyadic motion capture database. Lang Resour Eval 42:335–359. https://doi.org/10.1007/s10579-008-9076-6
Article Google Scholar
Pandey A, Vishwakarma DK (2023) Progress, Achievements, and Challenges in Multimodal Sentiment Analysis Using Deep Learning: A Survey. Appl Soft Comput 152:111206. https://doi.org/10.1016/j.asoc.2023.111206
Article Google Scholar
Gandhi A, Adhvaryu K, Poria S, Cambria E, Hussain A (2023) Multimodal sentiment analysis: A systematic review of history, datasets, multimodal fusion methods, applications, challenges and future directions. Inf Fusion 91:424–444. https://doi.org/10.1016/j.inffus.2022.09.025
Article Google Scholar
Gkoumas D, Li Q, Lioma C, Yu Y, Song D (2021) What makes the difference? An empirical comparison of fusion strategies for multimodal language analysis. Inf Fusion 66:184–197. https://doi.org/10.1016/j.inffus.2020.09.005
Article Google Scholar
Kossaifi J, Lipton ZC, Kolbeinsson A, Khanna A, Furlanello T, Anandkumar A (2020) Tensor regression networks. J Mach Learn Res 21(123):1–21
MathSciNet Google Scholar
Barezi EJ, Fung P (2019) Modality-based factorization for multimodal fusion. ACL 2019:260–269
Google Scholar
Zadeh A, Chen M, Poria S, Cambria E, Morency LP (2017) Tensor fusion network for multimodal sentiment analysis. In: EMNLP 2017, pp 1103–1114. https://doi.org/10.18653/v1/d17-1115
Liu Z, Shen Y, Lakshminarasimhan VB, Liang PP, Zadeh A, Morency LP (2017) Efficient low-rank multimodal fusion with modality-specific factors. In: ACL 2018, pp 2247–2256. https://doi.org/10.18653/v1/p18-1209
Kumar A, Vepa J (2020) Gated mechanism for attention based multi modal sentiment analysis. ICASSP 2020:4477–4481. https://doi.org/10.1109/ICASSP40776.2020.9053012
Article Google Scholar
Wu Y, Zhao Y, Yang H, Chen S, Qin B, Cao X, Zhao W (2022) Sentiment word aware multimodal refinement for multimodal sentiment analysis with asr errors. ACL 2022:1397–1406
Google Scholar
Mai S, Hu H, Xu J, Xing S (2022) Multi-fusion residual memory network for multimodal human sentiment comprehension. IEEE Trans Affect Comput 13(1):320–334. https://doi.org/10.1109/TAFFC.2020.3000510
Article Google Scholar
Wang Y, Shen Y, Liu Z, Liang PP, Zadeh A, Morency LP (2019) Words can shift: Dynamically adjusting word representations using nonverbal behaviors. AAAI 2019:7216–7223. https://doi.org/10.1609/aaai.v33i01.33017216
Article Google Scholar
Lin Z, Liang B, Long Y, Dang Y, Yang M, Zhang M, Xu R (2022) Modeling intra-and inter-modal relations: Hierarchical graph contrastive learning for multimodal sentiment analysis. COLING 2022:7124–7135
Google Scholar
Mai S, Zeng Y, Zheng S, Hu H (2023) Hybrid contrastive learning of tri-modal representation for multimodal sentiment analysis. IEEE Trans Affect Comput 14(3):2276–2289. https://doi.org/10.1109/TAFFC.2022.3172360
Article Google Scholar
Tsai YHH, Liang PP, Zadeh A, Morency LP, Salakhutdinov R (2019) Learning factorized multimodal representations. In: ICLR 2019
Sun Z, Sarma P, Sethares W, Liang Y (2020) Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. AAAI 2020:8992–8999. https://doi.org/10.1609/aaai.v34i05.6431
Article Google Scholar
Hazarika D, Zimmermann R, Poria S (2020) MISA: Modality-invariant and -specific representations for multimodal sentiment analysis. In: MM 2020, pp 1122–1131. https://doi.org/10.1145/3394171.3413678
Yu W, Xu H, Yuan Z, Wu J (2021) Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis. AAAI 2021:10790–10797. https://doi.org/10.1609/aaai.v35i12.17289
Article Google Scholar
Peng J, Wu T, Zhang W, Cheng F, Tan S, Yi F, Huang Y (2023) A fine-grained modal label-based multi-stage network for multimodal sentiment analysis. Expert Syst Appl 221:119721. https://doi.org/10.1016/j.eswa.2023.119721
Article Google Scholar
He J, Mai S, Hu H (2021) A unimodal reinforced transformer with time squeeze fusion for multimodal sentiment analysis. IEEE Signal Process Lett 28:992–996. https://doi.org/10.1109/LSP.2021.3078074
Article Google Scholar
Rahman W, Hasan MK, Lee S, Zadeh A, Mao C, Morency LP, Hoque E (2020) Integrating multimodal information in large pretrained transformers. In: ACL 2020, p 2359
Pham H, Liang PP, Manzini T, Morency LP, Póczos B (2019) Found in translation: Learning robust joint representations by cyclic translations between modalities. AAAI 2019:6892–6899. https://doi.org/10.1609/aaai.v33i01.33016892
Article Google Scholar
Yu J, Jiang J, Xia R (2019) Entity-sensitive attention and fusion network for entity-level multimodal sentiment classification. IEEE/ACM Trans Audio, Speech, and Lang Process 28:429–439. https://doi.org/10.1109/TASLP.2019.2957872
Article Google Scholar
Jiang D, Liu H, Wei R, Tu G (2023) CSAT-FTCN: a fuzzy-oriented model with contextual self-attention network for multimodal emotion recognition. Cognit Comput 15:1082–1091. https://doi.org/10.1007/s12559-023-10119-6
Article Google Scholar
Zeng J, Zhou J, Liu T (2022) Mitigating inconsistencies in multimodal sentiment analysis under uncertain missing modalities. EMNLP 2022:2924–2934
Yang B, Shao B, Wu L, Lin X (2022) Multimodal sentiment analysis with unidirectional modality translation. Neurocomputing 467:130–137. https://doi.org/10.1016/j.neucom.2021.09.041
Article Google Scholar
He J, Hu H (2021) MF-BERT: Multimodal fusion in pre-trained BERT for sentiment analysis. IEEE Signal Process Lett 29:454–458. https://doi.org/10.1109/LSP.2021.3139856
Article Google Scholar
Wen H, You S, Fu Y (2021) Cross-modal context-gated convolution for multi-modal sentiment analysis. Pattern Recognit Lett 146:252–259. https://doi.org/10.1016/j.patrec.2021.03.025
Article Google Scholar
Zhang S, Yin C, Yin Z (2022) Multimodal sentiment recognition with multi-task learning. IEEE Trans Emerg Top Computat Intell 7(1):200–209. https://doi.org/10.1109/TETCI.2022.3224929
Article Google Scholar
Dhanith P, Surendiran B, Rohith G, Kanmani SR, Devi KV (2024) A sparse self-attention enhanced model for aspect-level sentiment classification. Neural Process Lett 56(2):1–21. https://doi.org/10.1007/s11063-024-11513-3
Article Google Scholar
Catelli R, Fujita H, De Pietro G, Esposito M (2022) Deceptive reviews and sentiment polarity: Effective link by exploiting BERT. Expert Syst Appl 209:118290. https://doi.org/10.1016/j.eswa.2022.118290
Article Google Scholar
Chen Q, Huang G, Wang Y (2022) The weighted cross-modal attention mechanism with sentiment prediction auxiliary task for multimodal sentiment analysis. IEEE/ACM Transactions on Audio, Speech, and Language Processing 30:2689–2695. https://doi.org/10.1109/TASLP.2022.3192728
Article Google Scholar
Zhao X, Chen Y, Liu S, Tang B (2022) Shared-private memory networks for multimodal sentiment analysis. IEEE Trans Affect Comput 14(4):2889–2900. https://doi.org/10.1109/TAFFC.2022.3222023
Article Google Scholar
Graves A, Fernández S, Gomez F, Schmidhuber J (2006) Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. ICML 2006:369–376
Article Google Scholar
Wang D, Guo X, Tian Y, Liu J, He L, Luo X (2023) TETFN: A text enhanced transformer fusion network for multimodal sentiment analysis. Pattern Recognit 136:109259. https://doi.org/10.1016/j.patcog.2022.109259
Article Google Scholar

Download references

Acknowledgements

This work is supported by the National Key Research and Development Program of China (Grant No. 2022YFC3301804), the Humanities and Social Sciences Youth Foundation, Ministry of Education of China (Grant No.20YJCZH172), the China Postdoctoral Science Foundation (Grant No. 2019M651262), the Heilongjiang Provincial Postdoctoral Science Foundation (Grant No. LBH-Z19015), and the National Natural Science Foundation of China (No. 61672179).

Author information

Authors and Affiliations

College of Computer Science and Technology, Harbin Engineering University, Harbin, 150001, Heilongjiang, China
Cong Liu, Yong Wang & Jing Yang
Modeling and Emulation in E-Government National Engineering Laboratory, Harbin Engineering University, Harbin, 150001, Heilongjiang, China
Cong Liu, Yong Wang & Jing Yang

Authors

Cong Liu
View author publications
You can also search for this author inPubMed Google Scholar
Yong Wang
View author publications
You can also search for this author inPubMed Google Scholar
Jing Yang
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

Cong Liu: Conceptualization, Methodology, Software, Investigation, Writing - original draft. Yong Wang: Conceptualization, Resources, Funding acquisition, Supervision, Writing - review & editing. Jing Yang: Supervision, Funding acquisition, Writing - review & editing.

Corresponding author

Correspondence to Yong Wang.

Ethics declarations

Competing of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A: Positional embedding

Compared with the temporally related models such as CNN, RNN, etc., transformer completely employs the mechanism of multi-head attention and ignores the temporal factors in the sequence. Therefore, for the different orders of the same temporal sequence, transformer produces the same result. To address this issue, consistent with [17], we similarly embed the position information in the temporal sequence. Specifically, for the temporal sequence ${\textbf {X}}\in \mathbb {R}^{T\times {d}}$, we use the sin and cos functions to encode the position information, and the frequency of the functions is determined by the feature index. The encoding process is shown in (A1) and (A2):

$$\begin{aligned} PE[pos,2j]= & {} sin(\frac{pos}{10000^{\frac{2j}{d}}}) \end{aligned}$$

(A1)

$$\begin{aligned} PE[pos,2j+1]= & {} cos(\frac{pos}{10000^{\frac{2j}{d}}}) \end{aligned}$$

(A2)

where $pos=1,2,....,T$ is the position in the temporal sequence and $j=0,\lfloor {\frac{d}{2}}\rfloor $ is the dimension corresponding to d. Therefore, the positional encoding generated by positional embedding shows a sinusoidal character. During the intra-modal feature extraction and interaction and inter-modal multi-attention interaction stages, positional embedding information can be added to the sequence by summing the encoded features with the temporal sequence via (9) and (12).

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Liu, C., Wang, Y. & Yang, J. A transformer-encoder-based multimodal multi-attention fusion network for sentiment analysis. Appl Intell 54, 8415–8441 (2024). https://doi.org/10.1007/s10489-024-05623-7

Download citation

Accepted: 16 June 2024
Published: 27 June 2024
Issue Date: September 2024
DOI: https://doi.org/10.1007/s10489-024-05623-7

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A transformer-encoder-based multimodal multi-attention fusion network for sentiment analysis

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Attention fusion network for multimodal sentiment analysis

Text-Oriented Modality Reinforcement Network for Multimodal Sentiment Analysis from Unaligned Multimodal Sequences

NRAFN: a non-text reinforcement and adaptive fusion network for multimodal sentiment analysis

Explore related subjects

Data Availability

Code Availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing of interest

Additional information

Publisher's Note

Appendix A: Positional embedding

Appendix A: Positional embedding

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now