Abstract
In this paper, we design a multi-token fusion (MTF) framework to process inter-modality and intra-modality information in parallel for multimodal sentiment analysis. Specifically, a tri-token transformer (TT) module is proposed to extract three tokens from each modality where one of them retains the unimodal feature and the other two tokens learn multi-modal features from the other two modalities respectively. Furthermore, a module based on the hierarchical element-wise self-attention (HESA) is used to process the three tokens of each modality extracted by TT. As a result, the important elements of tokens will be given more attention. Finally, we conduct extensive experiments on two public datasets, which prove the effectiveness and scalability of our network.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
iMotions 2017. https://imotions.com/
References
Cheng, M., et al.: Vista: vision and scene text aggregation for cross-modal retrieval. In: CoRR abs/2203.16778 (2022)
Degottex, G., Kane, J., Drugman, T., Raitio, T., Scherer, S.: COVAREP - a collaborative voice analysis repository for speech technologies. In: ICASSP, pp. 960–964 (2014)
Deng, H., Kang, P., Yang, Z., Hao, T., Li, Q., Liu, W.: Dense fusion network with multimodal residual for sentiment classification. In: ICME, pp. 1–6 (2021)
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL-HLT, pp. 4171–4186 (2019)
Guo, M., et al.: Attention mechanisms in computer vision: a survey. Comput. Vis. Media 8(3), 331–368 (2022)
Hazarika, D., Zimmermann, R., Poria, S.: MISA: modality-invariant and -specific representations for multimodal sentiment analysis. In: ACM Multimedia, pp. 1122–1131 (2020)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015)
Lai, H., Yan, X.: Multimodal sentiment analysis with asymmetric window multi-attentions. Multimedia Tools Appl. 81(14), 19415–19428 (2022)
Lin, J., Yang, A., Zhang, Y., Liu, J., Zhou, J., Yang, H.: InterBERT: vision-and-language interaction for multi-modal pretraining. CoRR abs/2003.13198 (2020)
Liu, H., Dai, Z., So, D.R., Le, Q.V.: Pay attention to MLPs. In: NeurIPS, pp. 9204–9215 (2021)
Liu, Z., Shen, Y., Lakshminarasimhan, V.B., Liang, P.P., Zadeh, A., Morency, L.: Efficient low-rank multimodal fusion with modality-specific factors. In: ACL, pp. 2247–2256 (2018)
Mai, S., Hu, H., Xing, S.: Modality to modality translation: an adversarial representation learning and graph fusion network for multimodal fusion. In: AAAI, pp. 164–172 (2020)
Mittal, T., Bhattacharya, U., Chandra, R., Bera, A., Manocha, D.: M3ER: multiplicative multimodal emotion recognition using facial, textual, and speech cues. In: AAAI, pp. 1359–1367 (2020)
Nagrani, A., Yang, S., Arnab, A., Jansen, A., Schmid, C., Sun, C.: Attention bottlenecks for multimodal fusion. In: NeurIPS, pp. 14200–14213 (2021)
Pham, H., Liang, P.P., Manzini, T., Morency, L., Póczos, B.: Found in translation: learning robust joint representations by cyclic translations between modalities. In: AAAI, pp. 6892–6899 (2019)
Sahay, S., Okur, E., Kumar, S.H., Nachman, L.: Low rank fusion based transformers for multimodal sequences. CoRR abs/2007.02038 (2020)
Tsai, Y.H., Bai, S., Liang, P.P., Kolter, J.Z., Morency, L., Salakhutdinov, R.: Multimodal transformer for unaligned multimodal language sequences. In: ACL, pp. 6558–6569 (2019)
Tsai, Y.H., Liang, P.P., Zadeh, A., Morency, L., Salakhutdinov, R.: Learning factorized multimodal representations. In: ICLR (2019)
Vaswani, A., et al.: Attention is all you need. In: NeurIPS, pp. 5998–6008 (2017)
Wang, Y., Shen, Y., Liu, Z., Liang, P.P., Zadeh, A., Morency, L.: Words can shift: dynamically adjusting word representations using nonverbal behaviors. In: AAAI, pp. 7216–7223 (2019)
Yan, X., Xue, H., Jiang, S., Liu, Z.: Multimodal sentiment analysis using multi-tensor fusion network with cross-modal modeling. Appl. Artif. Intell. 36(1), 2000688 (2022)
Yu, W., Xu, H., Yuan, Z., Wu, J.: Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis. In: AAAI, pp. 10790–10797 (2021)
Zadeh, A., Chen, M., Poria, S., Cambria, E., Morency, L.: Tensor fusion network for multimodal sentiment analysis. In: EMNLP, pp. 1103–1114 (2017)
Zadeh, A., Liang, P.P., Mazumder, N., Poria, S., Cambria, E., Morency, L.: Memory fusion network for multi-view sequential learning. In: AAAI, pp. 5634–5641 (2018)
Zadeh, A., Liang, P.P., Poria, S., Cambria, E., Morency, L.: Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph. In: ACL, pp. 2236–2246 (2018)
Zadeh, A., Liang, P.P., Poria, S., Vij, P., Cambria, E., Morency, L.: Multi-attention recurrent network for human communication comprehension. In: AAAI, pp. 5642–5649 (2018)
Zadeh, A., Zellers, R., Pincus, E., Morency, L.: Multimodal sentiment intensity analysis in videos: facial gestures and verbal messages. IEEE Intell. Syst. 31(6), 82–88 (2016)
Zhang, W., Yu, J., Wang, Y., Wang, W.: Multimodal deep fusion for image question answering. Knowl. Based Syst. 212, 106639 (2021)
Zhao, S., et al.: An end-to-end visual-audio attention network for emotion recognition in user-generated videos. In: AAAI, pp. 303–311 (2020)
Acknowledgements
This work is supported by the Youth Talent Support Programme of Guangdong Provincial Association for Science and Technology (No. SKXRC202305) and the Huangpu International Sci &Tech Cooperation Foundation under Grant 2021GH12.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Long, Z., Deng, H., Yang, Z., Liu, W. (2024). Multi-token Fusion Framework for Multimodal Sentiment Analysis. In: Song, X., Feng, R., Chen, Y., Li, J., Min, G. (eds) Web and Big Data. APWeb-WAIM 2023. Lecture Notes in Computer Science, vol 14332. Springer, Singapore. https://doi.org/10.1007/978-981-97-2390-4_29
Download citation
DOI: https://doi.org/10.1007/978-981-97-2390-4_29
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-2389-8
Online ISBN: 978-981-97-2390-4
eBook Packages: Computer ScienceComputer Science (R0)