Multi-token Fusion Framework for Multimodal Sentiment Analysis

Long, Zhihui; Deng, Huan; Yang, Zhenguo; Liu, Wenyin

doi:10.1007/978-981-97-2390-4_29

Zhihui Long¹²,
Huan Deng¹²,
Zhenguo Yang¹² &
…
Wenyin Liu¹²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14332))

Included in the following conference series:

Asia-Pacific Web (APWeb) and Web-Age Information Management (WAIM) Joint International Conference on Web and Big Data

Abstract

In this paper, we design a multi-token fusion (MTF) framework to process inter-modality and intra-modality information in parallel for multimodal sentiment analysis. Specifically, a tri-token transformer (TT) module is proposed to extract three tokens from each modality where one of them retains the unimodal feature and the other two tokens learn multi-modal features from the other two modalities respectively. Furthermore, a module based on the hierarchical element-wise self-attention (HESA) is used to process the three tokens of each modality extracted by TT. As a result, the important elements of tokens will be given more attention. Finally, we conduct extensive experiments on two public datasets, which prove the effectiveness and scalability of our network.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 149.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
iMotions 2017. https://imotions.com/

References

Cheng, M., et al.: Vista: vision and scene text aggregation for cross-modal retrieval. In: CoRR abs/2203.16778 (2022)
Google Scholar
Degottex, G., Kane, J., Drugman, T., Raitio, T., Scherer, S.: COVAREP - a collaborative voice analysis repository for speech technologies. In: ICASSP, pp. 960–964 (2014)
Google Scholar
Deng, H., Kang, P., Yang, Z., Hao, T., Li, Q., Liu, W.: Dense fusion network with multimodal residual for sentiment classification. In: ICME, pp. 1–6 (2021)
Google Scholar
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL-HLT, pp. 4171–4186 (2019)
Google Scholar
Guo, M., et al.: Attention mechanisms in computer vision: a survey. Comput. Vis. Media 8(3), 331–368 (2022)
Google Scholar
Hazarika, D., Zimmermann, R., Poria, S.: MISA: modality-invariant and -specific representations for multimodal sentiment analysis. In: ACM Multimedia, pp. 1122–1131 (2020)
Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015)
Google Scholar
Lai, H., Yan, X.: Multimodal sentiment analysis with asymmetric window multi-attentions. Multimedia Tools Appl. 81(14), 19415–19428 (2022)
Article Google Scholar
Lin, J., Yang, A., Zhang, Y., Liu, J., Zhou, J., Yang, H.: InterBERT: vision-and-language interaction for multi-modal pretraining. CoRR abs/2003.13198 (2020)
Google Scholar
Liu, H., Dai, Z., So, D.R., Le, Q.V.: Pay attention to MLPs. In: NeurIPS, pp. 9204–9215 (2021)
Google Scholar
Liu, Z., Shen, Y., Lakshminarasimhan, V.B., Liang, P.P., Zadeh, A., Morency, L.: Efficient low-rank multimodal fusion with modality-specific factors. In: ACL, pp. 2247–2256 (2018)
Google Scholar
Mai, S., Hu, H., Xing, S.: Modality to modality translation: an adversarial representation learning and graph fusion network for multimodal fusion. In: AAAI, pp. 164–172 (2020)
Google Scholar
Mittal, T., Bhattacharya, U., Chandra, R., Bera, A., Manocha, D.: M3ER: multiplicative multimodal emotion recognition using facial, textual, and speech cues. In: AAAI, pp. 1359–1367 (2020)
Google Scholar
Nagrani, A., Yang, S., Arnab, A., Jansen, A., Schmid, C., Sun, C.: Attention bottlenecks for multimodal fusion. In: NeurIPS, pp. 14200–14213 (2021)
Google Scholar
Pham, H., Liang, P.P., Manzini, T., Morency, L., Póczos, B.: Found in translation: learning robust joint representations by cyclic translations between modalities. In: AAAI, pp. 6892–6899 (2019)
Google Scholar
Sahay, S., Okur, E., Kumar, S.H., Nachman, L.: Low rank fusion based transformers for multimodal sequences. CoRR abs/2007.02038 (2020)
Google Scholar
Tsai, Y.H., Bai, S., Liang, P.P., Kolter, J.Z., Morency, L., Salakhutdinov, R.: Multimodal transformer for unaligned multimodal language sequences. In: ACL, pp. 6558–6569 (2019)
Google Scholar
Tsai, Y.H., Liang, P.P., Zadeh, A., Morency, L., Salakhutdinov, R.: Learning factorized multimodal representations. In: ICLR (2019)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: NeurIPS, pp. 5998–6008 (2017)
Google Scholar
Wang, Y., Shen, Y., Liu, Z., Liang, P.P., Zadeh, A., Morency, L.: Words can shift: dynamically adjusting word representations using nonverbal behaviors. In: AAAI, pp. 7216–7223 (2019)
Google Scholar
Yan, X., Xue, H., Jiang, S., Liu, Z.: Multimodal sentiment analysis using multi-tensor fusion network with cross-modal modeling. Appl. Artif. Intell. 36(1), 2000688 (2022)
Article Google Scholar
Yu, W., Xu, H., Yuan, Z., Wu, J.: Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis. In: AAAI, pp. 10790–10797 (2021)
Google Scholar
Zadeh, A., Chen, M., Poria, S., Cambria, E., Morency, L.: Tensor fusion network for multimodal sentiment analysis. In: EMNLP, pp. 1103–1114 (2017)
Google Scholar
Zadeh, A., Liang, P.P., Mazumder, N., Poria, S., Cambria, E., Morency, L.: Memory fusion network for multi-view sequential learning. In: AAAI, pp. 5634–5641 (2018)
Google Scholar
Zadeh, A., Liang, P.P., Poria, S., Cambria, E., Morency, L.: Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph. In: ACL, pp. 2236–2246 (2018)
Google Scholar
Zadeh, A., Liang, P.P., Poria, S., Vij, P., Cambria, E., Morency, L.: Multi-attention recurrent network for human communication comprehension. In: AAAI, pp. 5642–5649 (2018)
Google Scholar
Zadeh, A., Zellers, R., Pincus, E., Morency, L.: Multimodal sentiment intensity analysis in videos: facial gestures and verbal messages. IEEE Intell. Syst. 31(6), 82–88 (2016)
Article Google Scholar
Zhang, W., Yu, J., Wang, Y., Wang, W.: Multimodal deep fusion for image question answering. Knowl. Based Syst. 212, 106639 (2021)
Article Google Scholar
Zhao, S., et al.: An end-to-end visual-audio attention network for emotion recognition in user-generated videos. In: AAAI, pp. 303–311 (2020)
Google Scholar

Download references

Acknowledgements

This work is supported by the Youth Talent Support Programme of Guangdong Provincial Association for Science and Technology (No. SKXRC202305) and the Huangpu International Sci &Tech Cooperation Foundation under Grant 2021GH12.

Author information

Authors and Affiliations

Guangdong University of Technology, Guangzhou, China
Zhihui Long, Huan Deng, Zhenguo Yang & Wenyin Liu

Authors

Zhihui Long
View author publications
You can also search for this author in PubMed Google Scholar
Huan Deng
View author publications
You can also search for this author in PubMed Google Scholar
Zhenguo Yang
View author publications
You can also search for this author in PubMed Google Scholar
Wenyin Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhenguo Yang .

Editor information

Editors and Affiliations

Peng Cheng Laboratory, Shenzhen, China
Xiangyu Song
China University of Geosciences, Wuhan, China
Ruyi Feng
China University of Geosciences, Wuhan, China
Yunliang Chen
Deakin University, Burwood, VIC, Australia
Jianxin Li
University of Exeter, Exeter, UK
Geyong Min

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Long, Z., Deng, H., Yang, Z., Liu, W. (2024). Multi-token Fusion Framework for Multimodal Sentiment Analysis. In: Song, X., Feng, R., Chen, Y., Li, J., Min, G. (eds) Web and Big Data. APWeb-WAIM 2023. Lecture Notes in Computer Science, vol 14332. Springer, Singapore. https://doi.org/10.1007/978-981-97-2390-4_29

Download citation

DOI: https://doi.org/10.1007/978-981-97-2390-4_29
Published: 28 April 2024
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-2389-8
Online ISBN: 978-981-97-2390-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Multi-token Fusion Framework for Multimodal Sentiment Analysis