Abstract
Multimodal recommendation systems aim to deliver precise and personalized recommendations by integrating diverse modalities such as text, images, and audio. Despite their potential, these systems often struggle with effective modality fusion strategies and comprehensive modeling of user preferences. To address these issues, we propose the Multifactorial Modality Fusion Network (MMFN). MMFN overcomes the limitations of previous models by following pivotal architectures. First, this novel approach employs three Graph Neural Networks (GNN) to extract foundational interactions and semantic information across modalities meticulously. Second, a Gated Multi-factor Semantic Sensor operates through a series of stacked gating units, guided by interaction embeddings, to extract features from modal embeddings deeply. Third, a User Preference-Oriented Modality Aligner, leveraging contrastive learning to synchronize user preferences with item features, thus enhancing the expressiveness of embeddings and the overall quality of recommendations. We demonstrate the marked superiority of MMFN in both performance and efficiency compared to traditional collaborative filtering methods and contemporary deep multimodal recommendation systems. Through comprehensive evaluations on the baby, sports, and clothing datasets, MMFN achieves significant gains in Recall@20 metrics, with improvements of 2.49%, 8.79%, and 24.51% over the following best baseline models. Additionally, MMFN also leads in training efficiency, outperforming most competing models. MMFN paves the way for future multimodal recommendation systems, leveraging the full spectrum of deep learning technologies.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data Availability
The dataset is derived from a publicly available dataset: https://drive.google.com/drive/folders/13cBy1EA_saTUuXxVllKgtfci2A09jyaG
Code availability
The codes for the baselines are provided by the corresponding authors, and the code for our model will be available upon request after the paper is accepted for publication.
Materials availability
Not applicable
References
Ko H, Lee S, Park Y et al (2022) A survey of recommendation systems: recommendation models, techniques, and application fields. Electronics 11(1):141
Touvron H, Cord M, Douze M, Massa F, Sablayrolles A, Jégou H (2021) Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning. PMLR, pp 10347–10357
Chen T, Kornblith S, Norouzi M, Hinton G (2020) A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning. PMLR, pp 1597–1607
Talaat AS (2023) Sentiment analysis classification system using hybrid bert models. J Big Data 10(1):110
Wu T, He S, Liu J, Sun S, Liu K, Han Q-L, Tang Y (2023) A brief overview of chatgpt: The history, status quo and potential future development. IEEE/CAA J Auto Sinica 10(5):1122–1136
Xiao Y, Huang J, Yang J (2024) Hfnf: learning a hybrid fourier neural filter with a heterogeneous loss for sequential recommendation. Appl Intell 54(1):283–300
Wang M, Li W, Shi J, Wu S, Bai Q (2023) Dor: a novel dual-observation-based approach for recommendation systems. Appl Intell 53(23):29109–29127
Wang X, He X, Wang M, et al (2019) Neural graph collaborative filtering. In: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. pp 165–174
He X, Deng K, Wang X, et al (2020) Lightgcn: Simplifying and powering graph convolution network for recommendation. In: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. pp 639–648
Natarajan S, Vairavasundaram S, Natarajan S, Gandomi AH (2020) Resolving data sparsity and cold start problem in collaborative filtering recommender system using linked open data. Exp Syst Appl 149:113248
He R, McAuley J (2016) Vbpr: visual bayesian personalized ranking from implicit feedback. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 30
Wei Y, Wang X, Nie L, He X, Hong R, Chua T-S (2019) Mmgcn: Multi-modal graph convolution network for personalized recommendation of micro-video. In: Proceedings of the 27th ACM International Conference on Multimedia. pp 1437–1445
Wang Q, Wei Y, Yin J, Wu J, Song X, Nie L (2021) Dualgnn: Dual graph neural network for multimedia recommendation. IEEE Trans Multimedia 25:1074–1084
Wei Y, Wang X, Nie L, He X, Chua T-S (2020) Graph-refined convolutional network for multimedia recommendation with implicit feedback. In: Proceedings of the 28th ACM International Conference on Multimedia. pp 3541–3549
Du X, Wu Z, Feng F, et al (2022) Invariant representation learning for multimedia recommendation. In: Proceedings of the 30th ACM International Conference on Multimedia. pp. 619–628
Zhou X, Zhou H, Liu Y et al (2023) Bootstrap latent representations for multi-modal recommendation. Proce ACM Web Conf 2023:845–854
Zhang X, Xu B, Ma F, Li C, Yang L, Lin H (2023) Beyond co-occurrence: Multi-modal session-based recommendation. IEEE Trans Knowl Data Eng
Liu K, Xue F, Li S, Sang S, Hong R (2022) Multimodal hierarchical graph collaborative filtering for multimedia-based recommendation. IEEE Trans Comput Soc Syst 11(1):216–227
Phan HT, Nguyen NT, Hwang D (2023) Aspect-level sentiment analysis: A survey of graph convolutional network methods. Inform Fusion 91:149–172
Velickovic P, Cucurull G, Casanova A, Romero A, Lio P, Bengio Y et al (2017) Graph attention networks. stat 1050(20):10–48550
Liu J, Ong GP, Chen X (2020) Graphsage-based traffic speed forecasting for segment network with sparse data. IEEE Trans Intell Transp Syst 23(3):1755–1766
Wei T, Chow TW, Ma J, Zhao M (2023) Expgcn: Review-aware graph convolution network for explainable recommendation. Neural Netw 157:202–215
Yin Y, Li Y, Gao H, Liang T, Pan Q (2022) Fgc: Gcn-based federated learning approach for trust industrial service recommendation. IEEE Trans Industr Inf 19(3):3240–3250
Zhou X (2023) Mmrec: Simplifying multimodal recommendation. In: Proceedings of the 5th ACM International Conference on Multimedia in Asia Workshops. pp 1–2
Zhang J, Zhu Y, Liu Q, Wu S, Wang S, Wang L (2021) Mining latent structures for multimedia recommendation. In: Proceedings of the 29th ACM International Conference on Multimedia. ACM, pp 3872–3880
Yi Z, Wang X, Ounis I, et al (2022) Multi-modal graph contrastive learning for micro-video recommendation. In: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. pp 1807–1811
Zhang J, Zhu Y, Liu Q, Zhang M, Wu S, Wang L (2022) Latent structure mining with contrastive modality fusion for multimedia recommendation. IEEE Trans Knowl Data Eng 35(9):9154–9167
Tran N-T, Lauw HW (2022) Aligning dual disentangled user representations from ratings and textual content. In: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. pp 1798–1806
Xu Y, Wang Z, Gao H, Jiang Z, Yin Y, Li R (2023) Towards machine-learning-driven effective mashup recommendations from big data in mobile networks and the internet-of-things. Dig Commun Netw 9(1):138–145
Chen J, Cao B, Peng Z, Xie Z, Liu S, Peng Q (2024) Tn-mr: topic-aware neural network-based mobile application recommendation. Int J Web Inform Syst 20(2):159–175
Xiong Y, Fu X (2024) User credibility evaluation for reputation measurement of online service. Int J Web Inform Syst 20(2):176–194
Gao H, Jiang W, Ran Q, Wang Y (2024) Vision-language interaction via contrastive learning for surface anomaly detection in consumer electronics manufacturing. IEEE Trans Consumer Electron
Jing M, Zhu Y, Zang T, Wang K (2023) Contrastive self-supervised learning in recommender systems: A survey. ACM Trans Inform Syst 42(2):1–39
Wang F, Wang Y, Li D, Gu H, Lu T, Zhang P, Gu N (2023) Cl4ctr: A contrastive learning framework for ctr prediction. In: Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining. pp 805–813
Cai D, Qian S, Fang Q, Hu J, Ding W, Xu C (2022) Heterogeneous graph contrastive learning network for personalized micro-video recommendation. IEEE Trans Multimed
Gao H, Jiang W, Ran Q, Wang Y (2024) Vision-language interaction via contrastive learning for surface anomaly detection in consumer electronics manufacturing. IEEE Trans Consum Electron
Liu F, Chen H, Cheng Z, Liu A, Nie L, Kankanhalli M (2022) Disentangled multimodal representation learning for recommendation. IEEE Trans Multimedia 25:7149–7159
Guo W, Tian J, Li M (2023) Price-aware enhanced dynamic recommendation based on deep learning. J Retail Consum Serv 75:103500
Zhao S, Gong M, Zhao H, Zhang J, Tao D (2023) Deep corner. Int J Comput Vision 131(11):2908–2932
Wang J, Wu J, Jia C, Zhang Z (2023) Self-supervised variational autoencoder towards recommendation by nested contrastive learning. Appl Intell 53(15):18887–18897
Lin X-Y, Xu Y-Y, Wang W-J, Zhang Y, Feng F-L (2023) Mitigating spurious correlations for self-supervised recommendation. Mach Intell Res 20(2):263–275
Zhang A, Sheng L, Cai Z, Wang X, Chua T-S (2024) Empowering collaborative filtering with principled adversarial contrastive loss. Adv Neural Inform Process Syst 36
McAuley J, Targett C, Shi Q, Van Den Hengel A (2015) Image-based recommendations on styles and substitutes. In: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval. pp 43–52
Devika R, Vairavasundaram S, Mahenthar CSJ, Varadarajan V, Kotecha K (2021) A deep learning model based on bert and sentence transformer for semantic keyphrase extraction on big social data. IEEE Access 9:165252–165261
Rendle S, Freudenthaler C, Gantner Z, Schmidt-Thieme L (2012) Bpr: Bayesian personalized ranking from implicit feedback. arXiv:1205.2618
Wei W, Huang C, Xia L, Zhang C (2023) Multi-modal self-supervised learning for recommendation. Proc ACM Web Conf 2023:790–800
Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. JMLR Workshop and Conference Proceedings, pp 249–256
Funding
This work was supported by the Natural Science Foundation of Chongqing, China under Grant CSTB2023NSCQ-LMX0013.
Author information
Authors and Affiliations
Contributions
Yanke Chen designed the study, established the proposed model, and played a leading role in the design of experiments. Tianhao Sun contributed to the development of the algorithms and the execution of experiments. Yunhao Ma was responsible for drafting the manuscript and carrying out experiments. Huhai Zou prepared the figures and tables and contributed to the analysis and interpretation of the data. All authors discussed the results and implications at all stages and contributed to the revision of the manuscript. Each author has read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable
Conflict of interest
The authors declare that they have no conflicts of interest or financial interests in any organizations or entities with a direct financial interest in the subject matter or materials discussed in the manuscript.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Chen, Y., sun, T., Ma, Y. et al. Multifactorial modality fusion network for multimodal recommendation. Appl Intell 55, 139 (2025). https://doi.org/10.1007/s10489-024-06038-0
Accepted:
Published:
DOI: https://doi.org/10.1007/s10489-024-06038-0