Abstract
The core of multimodal sentiment analysis is to find effective encoding and fusion methods to make accurate predictions. However, previous works ignore the problems caused by the sampling heterogeneity of modalities, and visual-audio fusion does not filter out noise and redundancy in a progressive manner. On the other hand, current deep learning approaches for multimodal fusion rely on single-channel fusion (horizontal position/vertical space channel), and models of the human brain highlight the importance of multichannel fusion. In this paper, inspired by the perceptual mechanisms of the human brain in neuroscience, to overcome the above problems, we propose a novel framework named Progressive Multichannel Fusion Network (PMFNet) to meet the different processing needs of each modality and provide interaction and integration between modalities at different encoded representation densities, enabling them to be better encoded in a progressive manner and fused over multiple channels. Extensive experiments conducted on public datasets demonstrate that our method gains superior or comparable results to the state-of-the-art models.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Morency, L.P., Mihalcea, R., Doshi, P.: Towards multimodal sentiment analysis: harvesting opinions from the web. In: Proceedings of the 13th International Conference on Multimodal Interfaces, pp. 169–176 (2011)
Hazarika, D., Zimmermann, R., Poria, S.: MISA: modality-invariant and –specific representations for multimodal sentiment analysis. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 1122–1131 (2020)
Yu, W., Xu, H., Yuan, Z., Wu, J.: Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 10790–10797 (2021)
Lin, R., Hu, H.: Multimodal contrastive learning via uni-Modal coding and cross-Modal prediction for multimodal sentiment analysis. In: Findings of the Association for Computational Linguistics, EMNLP 2022, pp. 511–523 (2022)
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 4171–4186 (2019)
Degottex, G., Kane, J., Drugman, T., Raitio, T., Scherer, S.: COVAREP – a collaborative voice analysis repository for speech technologies. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 960–964 (2014)
Facial expression analysis. https://imotions.com/
Magee, J.: Dendritic integration of excitatory synaptic input. Nat. Rev. Neurosci. 1, 181–190 (2000)
Branco, T., Häusser, M.: The single dendritic branch as a fundamental functional unit in the nervous system. Curr. Opin. Neurobiol. 20(4), 494–502 (2010)
Vaswani, A., et al.: Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, vol. 30, pp. 6000–6010 (2017)
Nagrani, A., Yang, S., Arnab, A., Jansen, A., Schmid, C., Sun, C.: Attention bottlenecks for multimodal fusion. In: Advances in Neural Information Processing Systems, vol. 34, pp. 14200–14213 (2021)
Han, W., Chen, H., Gelbukh, A., Zadeh, A., Morency, L.P., Poria, S.: Bi-bimodal modality fusion for correlation-controlled multimodal sentiment analysis. In: Proceedings of the 2021 International Conference on Multimodal Interaction, pp. 6–15 (2021)
Williams, S.: Spatial compartmentalization and functional impact of conductance in pyramidal neurons. Nat. Neurosci. 7, 961–967 (2004)
Ran, Y., Huang, Z., Baden, T., et al.: Type-specific dendritic integration in mouse retinal ganglion cells. Nat. Commun. 11, 2101 (2020)
Li, S., Liu, N., Zhang, X., Zhou, D., Cai, D.: Bilinearity in spatiotemporal integration of synaptic inputs. PLoS Comput. Biol. 10(12), e1004014 (2014)
Zadeh, A., Liang, P.P., Mazumder, N., Poria, S., Cambria, E., Morency, L.P.: Memory fusion network for multi-view sequential learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 5634–5641 (2018)
Zadeh, A., Chen, M., Poria, S., Cambria, E., Morency, L.P.: Tensor fusion network for multimodal sentiment analysis. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 1103–1114 (2017)
Tsai, Y.H., Bai, S., Liang, P.P., Kolter, J.Z., Morency, L.P., Salakhutdinov, R.: Multimodal transformer for unaligned multimodal language sequences. In: Conference of the Association for Computational Linguistics, vol. 1, pp. 6558–6569 (2019)
Paraskevopoulos, G., Georgiou, E., Potamianos, A.: MMLatch: bottom-up top-down fusion for multimodal sentiment analysis. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4573–4577 (2022)
Caglayan, O., Madhyastha, P.S., Specia, L., Barrault, L.: Probing the need for visual context in multimodal machine translation. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp.4159–4170 (2019)
Paraskevopoulos, G., Parthasarathy, S., Khare, A., Sundaram, S.: Multimodal and multiresolution speech recognition with transformers. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 2381–2387 (2020)
You, Q., Jin, H., Wang, Z., Fang, C., Luo, J.: Image captioning with semantic attention. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 4651–4659 (2016)
Agrawal, A., Lu, J., Antol, S., Mitchell, M., et al.: VQA: visual question answering. In: IEEE International Conference on Computer Vision, pp. 2425–2433 (2015)
Seo, P.H., Nagrani, A., Schmid, C.: Look before you speak: visually contextualized utterances. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16872–16882 (2021)
Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10494–10503 (2019)
Sun, C., Myers, A., Vondrick, C., Murphy, K., Schmid, C.: VideoBERT: a joint model for video and language representation learning. In: IEEE/CVF International Conference on Computer Vision, pp. 7463–7472 (2019)
Yuan, Z., Li, W., Xu, H., Yu, W.: Transformer-based feature reconstruction network for robust multimodal sentiment analysis. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 4400–4407 (2021)
Tolstikhin, I.O., et al.: MLP-mixer: an all-MLP architecture for vision. In: Advances in Neural Information Processing Systems, vol. 34, pp. 24261–24272 (2021)
Touvron, H., Bojanowski, P., Caron, M., Cord, M., El-Nouby, A., Grave, E., et al.: ResMLP: feedforward networks for image classification with data-efficient training. IEEE Trans. Pattern Anal. Mach. Intell. 45(4), 5314–5321 (2021)
Liu, H., Dai, Z., So, D.R., Le, Q.V.: Pay attention to MLPs. In: Advances in Neural Information Processing Systems, vol. 34, pp. 9204–9215 (2021)
Chen, S., Xie, E., Ge, C., Liang, D., Luo, P.: CycleMLP: a MLP-like architecture for dense prediction. In: International Conference on Learning Representations (2022)
Guo, J., et al.: Hire-MLP: vision MLP via hierarchical rearrangement. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 816–826 (2022)
Nie, Y., et al.: MLP architectures for vision-and-language modeling: an empirical study. arXiv preprint arXiv:2112.04453 (2021)
Oord, A.V., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)
Bromley, J., et al.: Signature verification using a “siamese” time delay neural network. In: Proceedings of the 6th International Conference on Neural Information Processing Systems, pp. 737–744 (1993)
Zadeh, A., Zellers, R., Pincus, E., Morency, L.P.: Multimodal sentiment intensity analysis in videos: facial gestures and verbal messages. IEEE Intell. Syst. 31(6), 82–88 (2016)
Zadeh, A., Liang, P.P., Poria, S., Cambria, E., Morency, L.P.: Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, vol. 1, pp. 2236–2246 (2018)
Liu, Z., Shen, Y., Lakshminarasimhan, V. B., Liang, P.P., Zadeh, A., et al.: Efficient low-rank multimodal fusion with modality-specific factors. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, pp. 2247–2256 (2018)
Tsai, Y.H., Liang, P.P., Zadeh, A., Morency, L.P., Salakhutdinov, R.: Learning factorized multimodal representations. In: International Conference on Learning Representations (2019)
Wang, Y., Shen, Y., Liu, Z., Liang, P.P., Zadeh, A., Morency, L.P.: Words can shift: dynamically adjusting word representations using nonverbal behaviors. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 7216–7223 (2019)
Pham, H., Liang, P.P., Manzini, T., Morency, L.P., et al.: Found in translation: learning robust joint representations by cyclic translations between modalities. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 6892–6899 (2019)
Sun, Z., Sarma, P.K., Sethares, W.A., et al.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 8992–8999 (2020)
Lv, F., Chen, X., Huang, Y., Duan, L., Lin, G.: Progressive modality reinforcement for human multimodal emotion recognition from unaligned multi-modal sequences. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2554–2562 (2021)
Cheng, J., Fostiropoulos, I., Boehm, B.W., Soleymani, M.: Multimodal phased transformer for sentiment analysis. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 2447–2458 (2021)
Wu, Y., Lin, Z., Zhao, Y., Qin, B., Zhu, L.: A text-centered shared-private framework via cross-modal prediction for multimodal sentiment analysis. In: Findings of the Association for Computational Linguistics, ACL-IJCNLP 2021, pp. 4730–4738 (2021)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Li, J., Tao, C., Guan, D. (2024). PMFNet: A Progressive Multichannel Fusion Network for Multimodal Sentiment Analysis. In: Luo, B., Cheng, L., Wu, ZG., Li, H., Li, C. (eds) Neural Information Processing. ICONIP 2023. Communications in Computer and Information Science, vol 1968. Springer, Singapore. https://doi.org/10.1007/978-981-99-8181-6_21
Download citation
DOI: https://doi.org/10.1007/978-981-99-8181-6_21
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-8180-9
Online ISBN: 978-981-99-8181-6
eBook Packages: Computer ScienceComputer Science (R0)