Skip to main content

Multi-token Fusion Framework for Multimodal Sentiment Analysis

  • Conference paper
  • First Online:
Web and Big Data (APWeb-WAIM 2023)

Abstract

In this paper, we design a multi-token fusion (MTF) framework to process inter-modality and intra-modality information in parallel for multimodal sentiment analysis. Specifically, a tri-token transformer (TT) module is proposed to extract three tokens from each modality where one of them retains the unimodal feature and the other two tokens learn multi-modal features from the other two modalities respectively. Furthermore, a module based on the hierarchical element-wise self-attention (HESA) is used to process the three tokens of each modality extracted by TT. As a result, the important elements of tokens will be given more attention. Finally, we conduct extensive experiments on two public datasets, which prove the effectiveness and scalability of our network.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 149.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    iMotions 2017. https://imotions.com/

References

  1. Cheng, M., et al.: Vista: vision and scene text aggregation for cross-modal retrieval. In: CoRR abs/2203.16778 (2022)

    Google Scholar 

  2. Degottex, G., Kane, J., Drugman, T., Raitio, T., Scherer, S.: COVAREP - a collaborative voice analysis repository for speech technologies. In: ICASSP, pp. 960–964 (2014)

    Google Scholar 

  3. Deng, H., Kang, P., Yang, Z., Hao, T., Li, Q., Liu, W.: Dense fusion network with multimodal residual for sentiment classification. In: ICME, pp. 1–6 (2021)

    Google Scholar 

  4. Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL-HLT, pp. 4171–4186 (2019)

    Google Scholar 

  5. Guo, M., et al.: Attention mechanisms in computer vision: a survey. Comput. Vis. Media 8(3), 331–368 (2022)

    Google Scholar 

  6. Hazarika, D., Zimmermann, R., Poria, S.: MISA: modality-invariant and -specific representations for multimodal sentiment analysis. In: ACM Multimedia, pp. 1122–1131 (2020)

    Google Scholar 

  7. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015)

    Google Scholar 

  8. Lai, H., Yan, X.: Multimodal sentiment analysis with asymmetric window multi-attentions. Multimedia Tools Appl. 81(14), 19415–19428 (2022)

    Article  Google Scholar 

  9. Lin, J., Yang, A., Zhang, Y., Liu, J., Zhou, J., Yang, H.: InterBERT: vision-and-language interaction for multi-modal pretraining. CoRR abs/2003.13198 (2020)

    Google Scholar 

  10. Liu, H., Dai, Z., So, D.R., Le, Q.V.: Pay attention to MLPs. In: NeurIPS, pp. 9204–9215 (2021)

    Google Scholar 

  11. Liu, Z., Shen, Y., Lakshminarasimhan, V.B., Liang, P.P., Zadeh, A., Morency, L.: Efficient low-rank multimodal fusion with modality-specific factors. In: ACL, pp. 2247–2256 (2018)

    Google Scholar 

  12. Mai, S., Hu, H., Xing, S.: Modality to modality translation: an adversarial representation learning and graph fusion network for multimodal fusion. In: AAAI, pp. 164–172 (2020)

    Google Scholar 

  13. Mittal, T., Bhattacharya, U., Chandra, R., Bera, A., Manocha, D.: M3ER: multiplicative multimodal emotion recognition using facial, textual, and speech cues. In: AAAI, pp. 1359–1367 (2020)

    Google Scholar 

  14. Nagrani, A., Yang, S., Arnab, A., Jansen, A., Schmid, C., Sun, C.: Attention bottlenecks for multimodal fusion. In: NeurIPS, pp. 14200–14213 (2021)

    Google Scholar 

  15. Pham, H., Liang, P.P., Manzini, T., Morency, L., Póczos, B.: Found in translation: learning robust joint representations by cyclic translations between modalities. In: AAAI, pp. 6892–6899 (2019)

    Google Scholar 

  16. Sahay, S., Okur, E., Kumar, S.H., Nachman, L.: Low rank fusion based transformers for multimodal sequences. CoRR abs/2007.02038 (2020)

    Google Scholar 

  17. Tsai, Y.H., Bai, S., Liang, P.P., Kolter, J.Z., Morency, L., Salakhutdinov, R.: Multimodal transformer for unaligned multimodal language sequences. In: ACL, pp. 6558–6569 (2019)

    Google Scholar 

  18. Tsai, Y.H., Liang, P.P., Zadeh, A., Morency, L., Salakhutdinov, R.: Learning factorized multimodal representations. In: ICLR (2019)

    Google Scholar 

  19. Vaswani, A., et al.: Attention is all you need. In: NeurIPS, pp. 5998–6008 (2017)

    Google Scholar 

  20. Wang, Y., Shen, Y., Liu, Z., Liang, P.P., Zadeh, A., Morency, L.: Words can shift: dynamically adjusting word representations using nonverbal behaviors. In: AAAI, pp. 7216–7223 (2019)

    Google Scholar 

  21. Yan, X., Xue, H., Jiang, S., Liu, Z.: Multimodal sentiment analysis using multi-tensor fusion network with cross-modal modeling. Appl. Artif. Intell. 36(1), 2000688 (2022)

    Article  Google Scholar 

  22. Yu, W., Xu, H., Yuan, Z., Wu, J.: Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis. In: AAAI, pp. 10790–10797 (2021)

    Google Scholar 

  23. Zadeh, A., Chen, M., Poria, S., Cambria, E., Morency, L.: Tensor fusion network for multimodal sentiment analysis. In: EMNLP, pp. 1103–1114 (2017)

    Google Scholar 

  24. Zadeh, A., Liang, P.P., Mazumder, N., Poria, S., Cambria, E., Morency, L.: Memory fusion network for multi-view sequential learning. In: AAAI, pp. 5634–5641 (2018)

    Google Scholar 

  25. Zadeh, A., Liang, P.P., Poria, S., Cambria, E., Morency, L.: Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph. In: ACL, pp. 2236–2246 (2018)

    Google Scholar 

  26. Zadeh, A., Liang, P.P., Poria, S., Vij, P., Cambria, E., Morency, L.: Multi-attention recurrent network for human communication comprehension. In: AAAI, pp. 5642–5649 (2018)

    Google Scholar 

  27. Zadeh, A., Zellers, R., Pincus, E., Morency, L.: Multimodal sentiment intensity analysis in videos: facial gestures and verbal messages. IEEE Intell. Syst. 31(6), 82–88 (2016)

    Article  Google Scholar 

  28. Zhang, W., Yu, J., Wang, Y., Wang, W.: Multimodal deep fusion for image question answering. Knowl. Based Syst. 212, 106639 (2021)

    Article  Google Scholar 

  29. Zhao, S., et al.: An end-to-end visual-audio attention network for emotion recognition in user-generated videos. In: AAAI, pp. 303–311 (2020)

    Google Scholar 

Download references

Acknowledgements

This work is supported by the Youth Talent Support Programme of Guangdong Provincial Association for Science and Technology (No. SKXRC202305) and the Huangpu International Sci &Tech Cooperation Foundation under Grant 2021GH12.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhenguo Yang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Long, Z., Deng, H., Yang, Z., Liu, W. (2024). Multi-token Fusion Framework for Multimodal Sentiment Analysis. In: Song, X., Feng, R., Chen, Y., Li, J., Min, G. (eds) Web and Big Data. APWeb-WAIM 2023. Lecture Notes in Computer Science, vol 14332. Springer, Singapore. https://doi.org/10.1007/978-981-97-2390-4_29

Download citation

  • DOI: https://doi.org/10.1007/978-981-97-2390-4_29

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-97-2389-8

  • Online ISBN: 978-981-97-2390-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics