PMFNet: A Progressive Multichannel Fusion Network for Multimodal Sentiment Analysis

Li, Jiaming; Tao, Chuanqi; Guan, Donghai

doi:10.1007/978-981-99-8181-6_21

Jiaming Li¹⁰,
Chuanqi Tao^10,11 &
Donghai Guan^10,11

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1968))

Included in the following conference series:

International Conference on Neural Information Processing

1005 Accesses
1 Citations

Abstract

The core of multimodal sentiment analysis is to find effective encoding and fusion methods to make accurate predictions. However, previous works ignore the problems caused by the sampling heterogeneity of modalities, and visual-audio fusion does not filter out noise and redundancy in a progressive manner. On the other hand, current deep learning approaches for multimodal fusion rely on single-channel fusion (horizontal position/vertical space channel), and models of the human brain highlight the importance of multichannel fusion. In this paper, inspired by the perceptual mechanisms of the human brain in neuroscience, to overcome the above problems, we propose a novel framework named Progressive Multichannel Fusion Network (PMFNet) to meet the different processing needs of each modality and provide interaction and integration between modalities at different encoded representation densities, enabling them to be better encoded in a progressive manner and fused over multiple channels. Extensive experiments conducted on public datasets demonstrate that our method gains superior or comparable results to the state-of-the-art models.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Enhancing Sentiment Analysis Accuracy Through Multimodal Data Fusion: A Deep Learning Approach

CCMA: CapsNet for audio–video sentiment analysis using cross-modal attention

Article 21 May 2024

A context-sensitive multi-tier deep learning framework for multimodal sentiment analysis

Article 01 December 2023

References

Morency, L.P., Mihalcea, R., Doshi, P.: Towards multimodal sentiment analysis: harvesting opinions from the web. In: Proceedings of the 13th International Conference on Multimodal Interfaces, pp. 169–176 (2011)
Google Scholar
Hazarika, D., Zimmermann, R., Poria, S.: MISA: modality-invariant and –specific representations for multimodal sentiment analysis. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 1122–1131 (2020)
Google Scholar
Yu, W., Xu, H., Yuan, Z., Wu, J.: Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 10790–10797 (2021)
Google Scholar
Lin, R., Hu, H.: Multimodal contrastive learning via uni-Modal coding and cross-Modal prediction for multimodal sentiment analysis. In: Findings of the Association for Computational Linguistics, EMNLP 2022, pp. 511–523 (2022)
Google Scholar
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 4171–4186 (2019)
Google Scholar
Degottex, G., Kane, J., Drugman, T., Raitio, T., Scherer, S.: COVAREP – a collaborative voice analysis repository for speech technologies. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 960–964 (2014)
Google Scholar
Facial expression analysis. https://imotions.com/
Magee, J.: Dendritic integration of excitatory synaptic input. Nat. Rev. Neurosci. 1, 181–190 (2000)
Article Google Scholar
Branco, T., Häusser, M.: The single dendritic branch as a fundamental functional unit in the nervous system. Curr. Opin. Neurobiol. 20(4), 494–502 (2010)
Article Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, vol. 30, pp. 6000–6010 (2017)
Google Scholar
Nagrani, A., Yang, S., Arnab, A., Jansen, A., Schmid, C., Sun, C.: Attention bottlenecks for multimodal fusion. In: Advances in Neural Information Processing Systems, vol. 34, pp. 14200–14213 (2021)
Google Scholar
Han, W., Chen, H., Gelbukh, A., Zadeh, A., Morency, L.P., Poria, S.: Bi-bimodal modality fusion for correlation-controlled multimodal sentiment analysis. In: Proceedings of the 2021 International Conference on Multimodal Interaction, pp. 6–15 (2021)
Google Scholar
Williams, S.: Spatial compartmentalization and functional impact of conductance in pyramidal neurons. Nat. Neurosci. 7, 961–967 (2004)
Article Google Scholar
Ran, Y., Huang, Z., Baden, T., et al.: Type-specific dendritic integration in mouse retinal ganglion cells. Nat. Commun. 11, 2101 (2020)
Article Google Scholar
Li, S., Liu, N., Zhang, X., Zhou, D., Cai, D.: Bilinearity in spatiotemporal integration of synaptic inputs. PLoS Comput. Biol. 10(12), e1004014 (2014)
Article Google Scholar
Zadeh, A., Liang, P.P., Mazumder, N., Poria, S., Cambria, E., Morency, L.P.: Memory fusion network for multi-view sequential learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 5634–5641 (2018)
Google Scholar
Zadeh, A., Chen, M., Poria, S., Cambria, E., Morency, L.P.: Tensor fusion network for multimodal sentiment analysis. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 1103–1114 (2017)
Google Scholar
Tsai, Y.H., Bai, S., Liang, P.P., Kolter, J.Z., Morency, L.P., Salakhutdinov, R.: Multimodal transformer for unaligned multimodal language sequences. In: Conference of the Association for Computational Linguistics, vol. 1, pp. 6558–6569 (2019)
Google Scholar
Paraskevopoulos, G., Georgiou, E., Potamianos, A.: MMLatch: bottom-up top-down fusion for multimodal sentiment analysis. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4573–4577 (2022)
Google Scholar
Caglayan, O., Madhyastha, P.S., Specia, L., Barrault, L.: Probing the need for visual context in multimodal machine translation. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp.4159–4170 (2019)
Google Scholar
Paraskevopoulos, G., Parthasarathy, S., Khare, A., Sundaram, S.: Multimodal and multiresolution speech recognition with transformers. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 2381–2387 (2020)
Google Scholar
You, Q., Jin, H., Wang, Z., Fang, C., Luo, J.: Image captioning with semantic attention. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 4651–4659 (2016)
Google Scholar
Agrawal, A., Lu, J., Antol, S., Mitchell, M., et al.: VQA: visual question answering. In: IEEE International Conference on Computer Vision, pp. 2425–2433 (2015)
Google Scholar
Seo, P.H., Nagrani, A., Schmid, C.: Look before you speak: visually contextualized utterances. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16872–16882 (2021)
Google Scholar
Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10494–10503 (2019)
Google Scholar
Sun, C., Myers, A., Vondrick, C., Murphy, K., Schmid, C.: VideoBERT: a joint model for video and language representation learning. In: IEEE/CVF International Conference on Computer Vision, pp. 7463–7472 (2019)
Google Scholar
Yuan, Z., Li, W., Xu, H., Yu, W.: Transformer-based feature reconstruction network for robust multimodal sentiment analysis. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 4400–4407 (2021)
Google Scholar
Tolstikhin, I.O., et al.: MLP-mixer: an all-MLP architecture for vision. In: Advances in Neural Information Processing Systems, vol. 34, pp. 24261–24272 (2021)
Google Scholar
Touvron, H., Bojanowski, P., Caron, M., Cord, M., El-Nouby, A., Grave, E., et al.: ResMLP: feedforward networks for image classification with data-efficient training. IEEE Trans. Pattern Anal. Mach. Intell. 45(4), 5314–5321 (2021)
Google Scholar
Liu, H., Dai, Z., So, D.R., Le, Q.V.: Pay attention to MLPs. In: Advances in Neural Information Processing Systems, vol. 34, pp. 9204–9215 (2021)
Google Scholar
Chen, S., Xie, E., Ge, C., Liang, D., Luo, P.: CycleMLP: a MLP-like architecture for dense prediction. In: International Conference on Learning Representations (2022)
Google Scholar
Guo, J., et al.: Hire-MLP: vision MLP via hierarchical rearrangement. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 816–826 (2022)
Google Scholar
Nie, Y., et al.: MLP architectures for vision-and-language modeling: an empirical study. arXiv preprint arXiv:2112.04453 (2021)
Oord, A.V., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)
Bromley, J., et al.: Signature verification using a “siamese” time delay neural network. In: Proceedings of the 6th International Conference on Neural Information Processing Systems, pp. 737–744 (1993)
Google Scholar
Zadeh, A., Zellers, R., Pincus, E., Morency, L.P.: Multimodal sentiment intensity analysis in videos: facial gestures and verbal messages. IEEE Intell. Syst. 31(6), 82–88 (2016)
Article Google Scholar
Zadeh, A., Liang, P.P., Poria, S., Cambria, E., Morency, L.P.: Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, vol. 1, pp. 2236–2246 (2018)
Google Scholar
Liu, Z., Shen, Y., Lakshminarasimhan, V. B., Liang, P.P., Zadeh, A., et al.: Efficient low-rank multimodal fusion with modality-specific factors. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, pp. 2247–2256 (2018)
Google Scholar
Tsai, Y.H., Liang, P.P., Zadeh, A., Morency, L.P., Salakhutdinov, R.: Learning factorized multimodal representations. In: International Conference on Learning Representations (2019)
Google Scholar
Wang, Y., Shen, Y., Liu, Z., Liang, P.P., Zadeh, A., Morency, L.P.: Words can shift: dynamically adjusting word representations using nonverbal behaviors. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 7216–7223 (2019)
Google Scholar
Pham, H., Liang, P.P., Manzini, T., Morency, L.P., et al.: Found in translation: learning robust joint representations by cyclic translations between modalities. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 6892–6899 (2019)
Google Scholar
Sun, Z., Sarma, P.K., Sethares, W.A., et al.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 8992–8999 (2020)
Google Scholar
Lv, F., Chen, X., Huang, Y., Duan, L., Lin, G.: Progressive modality reinforcement for human multimodal emotion recognition from unaligned multi-modal sequences. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2554–2562 (2021)
Google Scholar
Cheng, J., Fostiropoulos, I., Boehm, B.W., Soleymani, M.: Multimodal phased transformer for sentiment analysis. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 2447–2458 (2021)
Google Scholar
Wu, Y., Lin, Z., Zhao, Y., Qin, B., Zhu, L.: A text-centered shared-private framework via cross-modal prediction for multimodal sentiment analysis. In: Findings of the Association for Computational Linguistics, ACL-IJCNLP 2021, pp. 4730–4738 (2021)
Google Scholar

Download references

Author information

Authors and Affiliations

College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing, China
Jiaming Li, Chuanqi Tao & Donghai Guan
Key Laboratory of Ministry of Industry and Information Technology for Safety-Critical Software, Nanjing University of Aeronautics and Astronautics, Nanjing, China
Chuanqi Tao & Donghai Guan

Authors

Jiaming Li
View author publications
You can also search for this author in PubMed Google Scholar
Chuanqi Tao
View author publications
You can also search for this author in PubMed Google Scholar
Donghai Guan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chuanqi Tao .

Editor information

Editors and Affiliations

Scholl of Automation, Central South University, Changsha, China
Biao Luo
Institute of Automation, Chinese Academy of Sciences, Beijing, China
Long Cheng
Institute of Cyber-Systems and Control, Zhejiang University, Hangzhou, China
Zheng-Guang Wu
School of Automation, Guangdong University of Technology, Guangzhou, China
Hongyi Li
School of Electrical Engineering and Telecommunications, UNSW Sydney, Sydney, NSW, Australia
Chaojie Li

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, J., Tao, C., Guan, D. (2024). PMFNet: A Progressive Multichannel Fusion Network for Multimodal Sentiment Analysis. In: Luo, B., Cheng, L., Wu, ZG., Li, H., Li, C. (eds) Neural Information Processing. ICONIP 2023. Communications in Computer and Information Science, vol 1968. Springer, Singapore. https://doi.org/10.1007/978-981-99-8181-6_21

Download citation

DOI: https://doi.org/10.1007/978-981-99-8181-6_21
Published: 27 November 2023
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-8180-9
Online ISBN: 978-981-99-8181-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics