Skip to main content

PMFNet: A Progressive Multichannel Fusion Network for Multimodal Sentiment Analysis

  • Conference paper
  • First Online:
Neural Information Processing (ICONIP 2023)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1968))

Included in the following conference series:

Abstract

The core of multimodal sentiment analysis is to find effective encoding and fusion methods to make accurate predictions. However, previous works ignore the problems caused by the sampling heterogeneity of modalities, and visual-audio fusion does not filter out noise and redundancy in a progressive manner. On the other hand, current deep learning approaches for multimodal fusion rely on single-channel fusion (horizontal position/vertical space channel), and models of the human brain highlight the importance of multichannel fusion. In this paper, inspired by the perceptual mechanisms of the human brain in neuroscience, to overcome the above problems, we propose a novel framework named Progressive Multichannel Fusion Network (PMFNet) to meet the different processing needs of each modality and provide interaction and integration between modalities at different encoded representation densities, enabling them to be better encoded in a progressive manner and fused over multiple channels. Extensive experiments conducted on public datasets demonstrate that our method gains superior or comparable results to the state-of-the-art models.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Morency, L.P., Mihalcea, R., Doshi, P.: Towards multimodal sentiment analysis: harvesting opinions from the web. In: Proceedings of the 13th International Conference on Multimodal Interfaces, pp. 169–176 (2011)

    Google Scholar 

  2. Hazarika, D., Zimmermann, R., Poria, S.: MISA: modality-invariant and –specific representations for multimodal sentiment analysis. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 1122–1131 (2020)

    Google Scholar 

  3. Yu, W., Xu, H., Yuan, Z., Wu, J.: Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 10790–10797 (2021)

    Google Scholar 

  4. Lin, R., Hu, H.: Multimodal contrastive learning via uni-Modal coding and cross-Modal prediction for multimodal sentiment analysis. In: Findings of the Association for Computational Linguistics, EMNLP 2022, pp. 511–523 (2022)

    Google Scholar 

  5. Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 4171–4186 (2019)

    Google Scholar 

  6. Degottex, G., Kane, J., Drugman, T., Raitio, T., Scherer, S.: COVAREP – a collaborative voice analysis repository for speech technologies. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 960–964 (2014)

    Google Scholar 

  7. Facial expression analysis. https://imotions.com/

  8. Magee, J.: Dendritic integration of excitatory synaptic input. Nat. Rev. Neurosci. 1, 181–190 (2000)

    Article  Google Scholar 

  9. Branco, T., Häusser, M.: The single dendritic branch as a fundamental functional unit in the nervous system. Curr. Opin. Neurobiol. 20(4), 494–502 (2010)

    Article  Google Scholar 

  10. Vaswani, A., et al.: Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, vol. 30, pp. 6000–6010 (2017)

    Google Scholar 

  11. Nagrani, A., Yang, S., Arnab, A., Jansen, A., Schmid, C., Sun, C.: Attention bottlenecks for multimodal fusion. In: Advances in Neural Information Processing Systems, vol. 34, pp. 14200–14213 (2021)

    Google Scholar 

  12. Han, W., Chen, H., Gelbukh, A., Zadeh, A., Morency, L.P., Poria, S.: Bi-bimodal modality fusion for correlation-controlled multimodal sentiment analysis. In: Proceedings of the 2021 International Conference on Multimodal Interaction, pp. 6–15 (2021)

    Google Scholar 

  13. Williams, S.: Spatial compartmentalization and functional impact of conductance in pyramidal neurons. Nat. Neurosci. 7, 961–967 (2004)

    Article  Google Scholar 

  14. Ran, Y., Huang, Z., Baden, T., et al.: Type-specific dendritic integration in mouse retinal ganglion cells. Nat. Commun. 11, 2101 (2020)

    Article  Google Scholar 

  15. Li, S., Liu, N., Zhang, X., Zhou, D., Cai, D.: Bilinearity in spatiotemporal integration of synaptic inputs. PLoS Comput. Biol. 10(12), e1004014 (2014)

    Article  Google Scholar 

  16. Zadeh, A., Liang, P.P., Mazumder, N., Poria, S., Cambria, E., Morency, L.P.: Memory fusion network for multi-view sequential learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 5634–5641 (2018)

    Google Scholar 

  17. Zadeh, A., Chen, M., Poria, S., Cambria, E., Morency, L.P.: Tensor fusion network for multimodal sentiment analysis. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 1103–1114 (2017)

    Google Scholar 

  18. Tsai, Y.H., Bai, S., Liang, P.P., Kolter, J.Z., Morency, L.P., Salakhutdinov, R.: Multimodal transformer for unaligned multimodal language sequences. In: Conference of the Association for Computational Linguistics, vol. 1, pp. 6558–6569 (2019)

    Google Scholar 

  19. Paraskevopoulos, G., Georgiou, E., Potamianos, A.: MMLatch: bottom-up top-down fusion for multimodal sentiment analysis. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4573–4577 (2022)

    Google Scholar 

  20. Caglayan, O., Madhyastha, P.S., Specia, L., Barrault, L.: Probing the need for visual context in multimodal machine translation. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp.4159–4170 (2019)

    Google Scholar 

  21. Paraskevopoulos, G., Parthasarathy, S., Khare, A., Sundaram, S.: Multimodal and multiresolution speech recognition with transformers. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 2381–2387 (2020)

    Google Scholar 

  22. You, Q., Jin, H., Wang, Z., Fang, C., Luo, J.: Image captioning with semantic attention. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 4651–4659 (2016)

    Google Scholar 

  23. Agrawal, A., Lu, J., Antol, S., Mitchell, M., et al.: VQA: visual question answering. In: IEEE International Conference on Computer Vision, pp. 2425–2433 (2015)

    Google Scholar 

  24. Seo, P.H., Nagrani, A., Schmid, C.: Look before you speak: visually contextualized utterances. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16872–16882 (2021)

    Google Scholar 

  25. Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10494–10503 (2019)

    Google Scholar 

  26. Sun, C., Myers, A., Vondrick, C., Murphy, K., Schmid, C.: VideoBERT: a joint model for video and language representation learning. In: IEEE/CVF International Conference on Computer Vision, pp. 7463–7472 (2019)

    Google Scholar 

  27. Yuan, Z., Li, W., Xu, H., Yu, W.: Transformer-based feature reconstruction network for robust multimodal sentiment analysis. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 4400–4407 (2021)

    Google Scholar 

  28. Tolstikhin, I.O., et al.: MLP-mixer: an all-MLP architecture for vision. In: Advances in Neural Information Processing Systems, vol. 34, pp. 24261–24272 (2021)

    Google Scholar 

  29. Touvron, H., Bojanowski, P., Caron, M., Cord, M., El-Nouby, A., Grave, E., et al.: ResMLP: feedforward networks for image classification with data-efficient training. IEEE Trans. Pattern Anal. Mach. Intell. 45(4), 5314–5321 (2021)

    Google Scholar 

  30. Liu, H., Dai, Z., So, D.R., Le, Q.V.: Pay attention to MLPs. In: Advances in Neural Information Processing Systems, vol. 34, pp. 9204–9215 (2021)

    Google Scholar 

  31. Chen, S., Xie, E., Ge, C., Liang, D., Luo, P.: CycleMLP: a MLP-like architecture for dense prediction. In: International Conference on Learning Representations (2022)

    Google Scholar 

  32. Guo, J., et al.: Hire-MLP: vision MLP via hierarchical rearrangement. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 816–826 (2022)

    Google Scholar 

  33. Nie, Y., et al.: MLP architectures for vision-and-language modeling: an empirical study. arXiv preprint arXiv:2112.04453 (2021)

  34. Oord, A.V., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)

  35. Bromley, J., et al.: Signature verification using a “siamese” time delay neural network. In: Proceedings of the 6th International Conference on Neural Information Processing Systems, pp. 737–744 (1993)

    Google Scholar 

  36. Zadeh, A., Zellers, R., Pincus, E., Morency, L.P.: Multimodal sentiment intensity analysis in videos: facial gestures and verbal messages. IEEE Intell. Syst. 31(6), 82–88 (2016)

    Article  Google Scholar 

  37. Zadeh, A., Liang, P.P., Poria, S., Cambria, E., Morency, L.P.: Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, vol. 1, pp. 2236–2246 (2018)

    Google Scholar 

  38. Liu, Z., Shen, Y., Lakshminarasimhan, V. B., Liang, P.P., Zadeh, A., et al.: Efficient low-rank multimodal fusion with modality-specific factors. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, pp. 2247–2256 (2018)

    Google Scholar 

  39. Tsai, Y.H., Liang, P.P., Zadeh, A., Morency, L.P., Salakhutdinov, R.: Learning factorized multimodal representations. In: International Conference on Learning Representations (2019)

    Google Scholar 

  40. Wang, Y., Shen, Y., Liu, Z., Liang, P.P., Zadeh, A., Morency, L.P.: Words can shift: dynamically adjusting word representations using nonverbal behaviors. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 7216–7223 (2019)

    Google Scholar 

  41. Pham, H., Liang, P.P., Manzini, T., Morency, L.P., et al.: Found in translation: learning robust joint representations by cyclic translations between modalities. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 6892–6899 (2019)

    Google Scholar 

  42. Sun, Z., Sarma, P.K., Sethares, W.A., et al.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 8992–8999 (2020)

    Google Scholar 

  43. Lv, F., Chen, X., Huang, Y., Duan, L., Lin, G.: Progressive modality reinforcement for human multimodal emotion recognition from unaligned multi-modal sequences. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2554–2562 (2021)

    Google Scholar 

  44. Cheng, J., Fostiropoulos, I., Boehm, B.W., Soleymani, M.: Multimodal phased transformer for sentiment analysis. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 2447–2458 (2021)

    Google Scholar 

  45. Wu, Y., Lin, Z., Zhao, Y., Qin, B., Zhu, L.: A text-centered shared-private framework via cross-modal prediction for multimodal sentiment analysis. In: Findings of the Association for Computational Linguistics, ACL-IJCNLP 2021, pp. 4730–4738 (2021)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chuanqi Tao .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Li, J., Tao, C., Guan, D. (2024). PMFNet: A Progressive Multichannel Fusion Network for Multimodal Sentiment Analysis. In: Luo, B., Cheng, L., Wu, ZG., Li, H., Li, C. (eds) Neural Information Processing. ICONIP 2023. Communications in Computer and Information Science, vol 1968. Springer, Singapore. https://doi.org/10.1007/978-981-99-8181-6_21

Download citation

  • DOI: https://doi.org/10.1007/978-981-99-8181-6_21

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-99-8180-9

  • Online ISBN: 978-981-99-8181-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics