research-article

Modality Matches Modality: Pretraining Modality-Disentangled Item Representations for Recommendation

Authors:

Chenliang LiAuthors Info & Claims

WWW '22: Proceedings of the ACM Web Conference 2022

Pages 2058 - 2066

https://doi.org/10.1145/3485447.3512079

Published: 25 April 2022 Publication History

Abstract

Recent works have shown the effectiveness of incorporating textual and visual information to tackle the sparsity problem in recommendation scenarios. To fuse these useful heterogeneous modality information, an essential prerequisite is to align these information for modality-robust features learning and semantic understanding. Unfortunately, existing works mainly focus on tackling the learning of common knowledge across modalities, while the specific characteristics of each modality is discarded, which may inevitably degrade the recommendation performance.

To this end, we propose a pretraining framework PAMD, which stands for PretrAining Modality-Disentangled Representations Model. Specifically, PAMD utilizes pretrained VGG19 and Glove to embed items’ both visual and textual modalities into the continuous embedding space. Based on these primitive heterogeneous representations, a disentangled encoder is devised to automatically extract their modality-common characteristics while preserving their modality-specific characteristics. After this, a contrastive learning is further designed to guarantee the consistence and gaps between modality-disentangled representations. To the best of our knowledge, this is the first pretraining framework to learn modality-disentangled representations in recommendation scenarios. Extensive experiments on three public real-world datasets demonstrate the effectiveness of our pretraining solution against a series of state-of-the-art alternatives, which results in the significant performance gain of 4.70%-17.44%.

References

[1]

Tadas Baltrusaitis, Chaitanya Ahuja, and Louis-Philippe Morency. 2019. Multimodal Machine Learning: A Survey and Taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 41, 2 (2019), 423–443.

Digital Library

[2]

Yang Bao, Hui Fang, and Jie Zhang. 2014. TopicMF: Simultaneously Exploiting Ratings and Reviews for Recommendation. In AAAI 2014. AAAI Press, 2–8.

[3]

Jingyuan Chen, Hanwang Zhang, Xiangnan He, Liqiang Nie, Wei Liu, and Tat-Seng Chua. 2017. Attentive Collaborative Filtering: Multimedia Recommendation with Item- and Component-Level Attention. In SIGIR 2017. ACM, 335–344.

Digital Library

[4]

Xu Chen, Hanxiong Chen, Hongteng Xu, Yongfeng Zhang, Yixin Cao, Zheng Qin, and Hongyuan Zha. 2019. Personalized Fashion Recommendation with Visual Explanations based on Multimodal Attention Network: Towards Visually Explainable Recommendation. In SIGIR. ACM, 765–774.

[5]

Zhiyong Cheng, Ying Ding, Lei Zhu, and Mohan S. Kankanhalli. 2018. Aspect-Aware Latent Factor Model: Rating Prediction with Ratings and Reviews. In WWW 2018. ACM, 639–648.

[6]

Qiang Cui, Shu Wu, Qiang Liu, Wen Zhong, and Liang Wang. 2020. MV-RNN: A Multi-View Recurrent Neural Network for Sequential Recommendation. IEEE Trans. Knowl. Data Eng. 32, 2 (2020), 317–331.

Digital Library

[7]

George E. Dahl, Dong Yu, Li Deng, and Alex Acero. 2012. Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition. IEEE Trans. Speech Audio Process. 20, 1 (2012), 30–42.

Digital Library

[8]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT (1). Association for Computational Linguistics, 4171–4186.

[9]

Fangxiang Feng, Xiaojie Wang, and Ruifan Li. 2014. Cross-modal Retrieval with Correspondence Autoencoder. In ACM Multimedia. ACM, 7–16.

[10]

Guibing Guo, Shichang Ouyang, Xiaodong He, Fajie Yuan, and Xiaohua Liu. 2019. Dynamic Item Block and Prediction Enhancing Block for Sequential Recommendation. In IJCAI. 1373–1379.

[11]

Ruining He, Chunbin Lin, Jianguo Wang, and Julian J. McAuley. 2016. Sherlock: Sparse Hierarchical Embeddings for Visually-Aware One-Class Collaborative Filtering. In IJCAI. IJCAI/AAAI Press, 3740–3746.

[12]

Ruining He and Julian J. McAuley. 2016. Ups and Downs: Modeling the Visual Evolution of Fashion Trends with One-Class Collaborative Filtering. In WWW 2016. 507–517.

Digital Library

[13]

Ruining He and Julian J. McAuley. 2016. VBPR: Visual Bayesian Personalized Ranking from Implicit Feedback. In AAAI 2016. AAAI Press, 144–150.

[14]

Xiangnan He, Tao Chen, Min-Yen Kan, and Xiao Chen. 2015. TriRank: Review-aware Explainable Recommendation by Modeling Aspects. In CIKM 2015. ACM, 1661–1670.

Digital Library

[15]

Xiangnan He and Tat-Seng Chua. 2017. Neural Factorization Machines for Sparse Predictive Analytics. In SIGIR. ACM, 355–364.

[16]

Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. 2017. Neural Collaborative Filtering. In WWW. ACM, 173–182.

[17]

G. E. Hinton and R. R. Salakhutdinov. 2006. Reducing the Dimensionality of Data with Neural Networks. Science 313(2006).

[18]

Xun Huang, Ming-Yu Liu, Serge J. Belongie, and Jan Kautz. 2018. Multimodal Unsupervised Image-to-Image Translation. In ECCV (3)(Lecture Notes in Computer Science, Vol. 11207). Springer, 179–196.

[19]

Yu-Gang Jiang, Zuxuan Wu, Jun Wang, Xiangyang Xue, and Shih-Fu Chang. 2018. Exploiting Feature and Class Relationships in Video Categorization with Regularized Deep Neural Networks. IEEE Trans. Pattern Anal. Mach. Intell. 40, 2 (2018), 352–364.

Digital Library

[20]

Wang-Cheng Kang, Chen Fang, Zhaowen Wang, and Julian J. McAuley. 2017. Visually-Aware Fashion Recommendation and Design with Generative Image Models. In ICDM 2017. IEEE Computer Society, 207–216.

[21]

Wang-Cheng Kang and Julian J. McAuley. 2018. Self-Attentive Sequential Recommendation. In ICDM 2018. 197–206.

[22]

Angeliki Lazaridou, Nghia The Pham, and Marco Baroni. 2015. Combining Language and Vision with a Multimodal Skip-gram Model. In NAACL HLT 2015, The 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Denver, Colorado, USA, May 31 - June 5, 2015. 153–163.

[23]

Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In ACL. Association for Computational Linguistics, 7871–7880.

[24]

Chenliang Li, Xichuan Niu, Xiangyang Luo, Zhenzhong Chen, and Cong Quan. 2019. A Review-Driven Neural Model for Sequential Recommendation. In IJCAI 2019. 2866–2872.

[25]

Hao Liu, Jindong Han, Yanjie Fu, Jingbo Zhou, Xinjiang Lu, and Hui Xiong. 2020. Multi-Modal Transportation Recommendation with Unified Route Representation Learning. Proc. VLDB Endow. 14, 3 (2020), 342–350.

Digital Library

[26]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. CoRR abs/1907.11692(2019).

[27]

Zhiwei Liu, Ziwei Fan, Yu Wang, and Philip S. Yu. 2021. Augmenting Sequential Recommendation with Pseudo-Prior Items via Reversely Pre-training Transformer. In SIGIR. ACM, 1608–1612.

[28]

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y. Ng. 2011. Multimodal Deep Learning. In Proceedings of the 28th International Conference on Machine Learning, ICML 2011, Bellevue, Washington, USA, June 28 - July 2, 2011. 689–696.

[29]

Aditya Pal, Chantat Eksombatchai, Yitong Zhou, Bo Zhao, Charles Rosenberg, and Jure Leskovec. 2020. PinnerSage: Multi-Modal User Embedding Framework for Recommendations at Pinterest. In KDD ’20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, CA, USA, August 23-27, 2020. 2311–2320.

Digital Library

[30]

Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global Vectors for Word Representation. In EMNLP. ACL, 1532–1543.

[31]

Jinwei Qi and Yuxin Peng. 2018. Cross-modal Bidirectional Translation via Reinforcement Learning. In IJCAI. ijcai.org, 2630–2636.

[32]

Zhaopeng Qiu, Xian Wu, Jingyue Gao, and Wei Fan. 2021. U-BERT: Pre-training User Representations for Improved Recommendation. In AAAI. AAAI Press, 4320–4327.

[33]

Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme. 2009. BPR: Bayesian Personalized Ranking from Implicit Feedback. In UAI. AUAI Press, 452–461.

Digital Library

[34]

Steffen Rendle, Christoph Freudenthaler, and Lars Schmidt-Thieme. 2010. Factorizing personalized Markov chains for next-basket recommendation. In WWW 2010. ACM, 811–820.

[35]

Mert Bülent Sariyildiz, Julien Perez, and Diane Larlus. 2020. Learning Visual Representations with Caption Annotations. In Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part VIII. 153–170.

[36]

Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition. In ICLR.

[37]

Fei Sun, Jun Liu, Jian Wu, Changhua Pei, Xiao Lin, Wenwu Ou, and Peng Jiang. 2019. BERT4Rec: Sequential Recommendation with Bidirectional Encoder Representations from Transformer. In CIKM. ACM, 1441–1450.

[38]

Hao Tan and Mohit Bansal. 2019. LXMERT: Learning Cross-Modality Encoder Representations from Transformers. In EMNLP/IJCNLP (1). Association for Computational Linguistics, 5099–5110.

[39]

Yinwei Wei, Xiang Wang, Liqiang Nie, Xiangnan He, Richang Hong, and Tat-Seng Chua. 2019. MMGCN: Multi-modal Graph Convolution Network for Personalized Recommendation of Micro-video. In ACM Multimedia. ACM, 1437–1445.

Digital Library

[40]

Shuang Wu, Sravanthi Bondugula, Florian Luisier, Xiaodan Zhuang, and Pradeep Natarajan. 2014. Zero-Shot Event Detection Using Multi-modal Fusion of Weakly Supervised Concepts. In CVPR. IEEE Computer Society, 2665–2672.

[41]

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio. 2015. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. In ICML(JMLR Workshop and Conference Proceedings, Vol. 37). JMLR.org, 2048–2057.

[42]

Fei Yan and Krystian Mikolajczyk. 2015. Deep correlation for matching images and text. In CVPR. IEEE Computer Society, 3441–3450.

[43]

Wenmeng Yu, Hua Xu, Ziqi Yuan, and Jiele Wu. 2021. Learning Modality-Specific Representations with Self-Supervised Multi-Task Learning for Multimodal Sentiment Analysis. In AAAI. AAAI Press, 10790–10797.

[44]

Xin Yuan, Zhe Lin, Jason Kuen, Jianming Zhang, Yilin Wang, Michael Maire, Ajinkya Kale, and Baldo Faieta. 2021. Multimodal Contrastive Training for Visual Representation Learning. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021. 6995–7004.

[45]

Zheni Zeng, Chaojun Xiao, Yuan Yao, Ruobing Xie, Zhiyuan Liu, Fen Lin, Leyu Lin, and Maosong Sun. 2021. Knowledge Transfer via Pre-training for Recommendation: A Review and Prospect. Frontiers Big Data 4(2021), 602071.

[46]

Kun Zhou, Hui Wang, Wayne Xin Zhao, Yutao Zhu, Sirui Wang, Fuzheng Zhang, Zhongyuan Wang, and Ji-Rong Wen. 2020. S3-Rec: Self-Supervised Learning for Sequential Recommendation with Mutual Information Maximization. In CIKM. ACM, 1893–1902.

Digital Library

Cited By

Liu QHu JXiao YZhao XGao JWang WLi QTang J(2024)Multimodal Recommender Systems: A SurveyACM Computing Surveys10.1145/369546157:2(1-17)Online publication date: 10-Oct-2024
https://dl.acm.org/doi/10.1145/3695461
Yang WYang QCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Multimodal-aware Multi-intention Learning for RecommendationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681412(5663-5672)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681412
Wen YChen STian YGuan WWang PDeng HXu JZheng BLi ZZou LLi CAngélica LLattanzi SMuñoz Medina AAkoglu LGionis AVassilvitskii S(2024)Unified Visual Preference Learning for User Intent UnderstandingProceedings of the 17th ACM International Conference on Web Search and Data Mining10.1145/3616855.3635858(816-825)Online publication date: 4-Mar-2024
https://dl.acm.org/doi/10.1145/3616855.3635858
Show More Cited By

Index Terms

Modality Matches Modality: Pretraining Modality-Disentangled Item Representations for Recommendation

Index terms have been assigned to the content through auto-classification.

Recommendations

Prompt-based and weak-modality enhanced multimodal recommendation
Abstract
Beyond conventional recommendation systems that rely merely on user-item interaction data, multimodal recommendation systems additionally exploit the item multimodal data for boosting the recommendation performance. In this research line, late ...
Highlights
- Effectively and efficiently modeling different modality-specific user interests.
- Identifying the weak modality where the user interest is not well-learned.
- Enhancing the user interest learning in the weak modality.
- The great ...
Contrastive Modality-Disentangled Learning for Multimodal Recommendation
Multimodal recommendation, which utilizes rich multimodal information to learn user preferences, has attracted significant attention. Most works focus on designing powerful encoders for extracting multimodal features, and simply aggregate the learned ...
M6: Multi-Modality-to-Multi-Modality Multitask Mega-transformer for Unified Pretraining
KDD '21: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining

Multimodal pretraining has demonstrated success in the downstream tasks of cross-modal representation learning. However, it is limited to the English data, and there is still a lack of large-scale dataset for multimodal pretraining in Chinese. In this ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

WWW '22: Proceedings of the ACM Web Conference 2022

April 2022

3764 pages

ISBN:9781450390965

DOI:10.1145/3485447

Editors:
Frédérique Laforest
INSA Lyon, France
,
Raphaël Troncy
EURECOM, France
,
Elena Simperl
King’s College London, UK
,
Deepak Agarwal
Pinterest, USA
,
Aristides Gionis
KTH Royal Institute of Technology, Sweden
,
Ivan Herman
W3C / retired
,
Lionel Médini
Université Lyon 1, France

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 April 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

National Natural Science Foundation of China

Conference

WWW '22

Sponsor:

SIGWEB

WWW '22: The ACM Web Conference 2022

April 25 - 29, 2022

Virtual Event, Lyon, France

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

17
Total Citations
View Citations
819
Total Downloads

Downloads (Last 12 months)182
Downloads (Last 6 weeks)12

Reflects downloads up to 01 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Liu QHu JXiao YZhao XGao JWang WLi QTang J(2024)Multimodal Recommender Systems: A SurveyACM Computing Surveys10.1145/369546157:2(1-17)Online publication date: 10-Oct-2024
https://dl.acm.org/doi/10.1145/3695461
Yang WYang QCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Multimodal-aware Multi-intention Learning for RecommendationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681412(5663-5672)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681412
Wen YChen STian YGuan WWang PDeng HXu JZheng BLi ZZou LLi CAngélica LLattanzi SMuñoz Medina AAkoglu LGionis AVassilvitskii S(2024)Unified Visual Preference Learning for User Intent UnderstandingProceedings of the 17th ACM International Conference on Web Search and Data Mining10.1145/3616855.3635858(816-825)Online publication date: 4-Mar-2024
https://dl.acm.org/doi/10.1145/3616855.3635858
Wang LWu SLiu QZhu YTao XZhang MWang L(2024)Bi-Level Graph Structure Learning for Next POI RecommendationIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.339768336:11(5695-5708)Online publication date: Nov-2024
https://doi.org/10.1109/TKDE.2024.3397683
Wu HGuo GYang ELuo YChu YJiang LWang X(2024)PESIKnowledge-Based Systems10.1016/j.knosys.2023.111133283:COnline publication date: 4-Mar-2024
https://dl.acm.org/doi/10.1016/j.knosys.2023.111133
Wei YXu YZhu LMa JPeng C(2024)Multi-level cross-modal contrastive learning for review-aware recommendationExpert Systems with Applications: An International Journal10.1016/j.eswa.2024.123341247:COnline publication date: 1-Aug-2024
https://dl.acm.org/doi/10.1016/j.eswa.2024.123341
Jing MZhu YZang TWang K(2023)Contrastive Self-supervised Learning in Recommender Systems: A SurveyACM Transactions on Information Systems10.1145/362715842:2(1-39)Online publication date: 8-Nov-2023
https://dl.acm.org/doi/10.1145/3627158
Zhao YChen RLai RHan QSong HChen L(2023)Augmented Negative Sampling for Collaborative FilteringProceedings of the 17th ACM Conference on Recommender Systems10.1145/3604915.3608811(256-266)Online publication date: 14-Sep-2023
https://dl.acm.org/doi/10.1145/3604915.3608811
Yang WYang JLiu YFrommholz IHopfgartner FLee MOakes MLalmas MZhang MSantos R(2023)Multimodal Optimal Transport Knowledge Distillation for Cross-domain RecommendationProceedings of the 32nd ACM International Conference on Information and Knowledge Management10.1145/3583780.3614983(2959-2968)Online publication date: 21-Oct-2023
https://dl.acm.org/doi/10.1145/3583780.3614983
Yang WFang ZZhang TWu SLu CEl Saddik AMei TCucchiara RBertini MTobon Vallejo DAtrey PHossain M(2023)Modal-aware Bias Constrained Contrastive Learning for Multimodal RecommendationProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612568(6369-6378)Online publication date: 26-Oct-2023
https://dl.acm.org/doi/10.1145/3581783.3612568
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten