skip to main content
10.1145/3539597.3570423acmconferencesArticle/Chapter ViewAbstractPublication PageswsdmConference Proceedingsconference-collections
research-article

Multimodal Pre-Training with Self-Distillation for Product Understanding in E-Commerce

Published: 27 February 2023 Publication History

Abstract

Product understanding refers to a series of product-centric tasks, such as classification, alignment and attribute values prediction, which requires fine-grained fusion of various modalities of products. Excellent product modeling ability will enhance the user experience and benefit search and recommendation systems. In this paper, we propose MBSD, a pre-trained vision-and-language model which can integrate the heterogeneous information of product in a single stream BERT-style architecture. Compared with current approaches, MBSD uses a lightweight convolutional neural network instead of a heavy feature extractor for image encoding, which has lower latency. Besides, we cleverly utilize user behavior data to design a two-stage pre-training task to understand products from different perspectives. In addition, there is an underlying imbalanced problem in multimodal pre-training, which will impairs downstream tasks. To this end, we propose a novel self-distillation strategy to transfer the knowledge in dominated modality to weaker modality, so that each modality can be fully tapped during pre-training. Experimental results on several product understanding tasks demonstrate that the performance of MBSD outperforms the competitive baselines.

Supplementary Material

MP4 File (WSDM23-fp384.mp4)
10.1145/3539597.3570423 Product understanding refers to a series of product-centric tasks, which requires fine-grained fusion of various modalities of products. Excellent product modeling ability will enhance the user experience and benefit search and recommendation systems. In this paper, we propose MBSD, a pre-trained vision-and-language model which can integrate the heterogeneous information of product in a single stream BERT-style architecture. Compared with current approaches, MBSD uses a lightweight convolutional neural network instead of a heavy feature extractor for image encoding, which has lower latency. Besides, we cleverly utilize user behavior data to design a two-stage pre-training task to understand products from different perspectives. Experimental results on several product understanding tasks demonstrate that the performance of MBSD outperforms the competitive baselines.

References

[1]
Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. In Proc. of CVPR.
[2]
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. 2015. VQA: Visual Question Answering. In Proc. of ICCV.
[3]
Chaoyu Bai. 2022. E-Commerce Knowledge Extraction via Multi-modal Machine Reading Comprehension. In DASFAA'22 (Lecture Notes in Computer Science). Springer. https://doi.org/10.1007/978--3-031-00129--1_21
[4]
Xinlei Chen, Saining Xie, and Kaiming He. 2021. An Empirical Study of Training Self-Supervised Vision Transformers. In Proc. of ICCV.
[5]
Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. 2020. UNITER: UNiversal Image-TExt Representation Learning. In Proc. of ECCV.
[6]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proc. of AACL.
[7]
Xiao Dong, Xunlin Zhan, Yangxin Wu, Yunchao Wei, Xiaoyong Wei, Minlong Lu, and Xiaodan Liang. 2021. M5Product: A Multi-modal Pretraining Benchmark for E-commercial Product Downstream Tasks. (2021).
[8]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In Proc. of ICLR.
[9]
Chenzhuang Du, Tingle Li, Yichen Liu, Zixin Wen, Tianyu Hua, Yue Wang, and Hang Zhao. 2021. Improving Multi-Modal Learning with Uni-Modal Teachers. In arXiv: 2106.11059.
[10]
Dehong Gao, Linbo Jin, Ben Chen, Minghui Qiu, Peng Li, Yi Wei, Yi Hu, and Hao Wang. 2020. FashionBERT: Text and Image Matching with Adaptive Loss for Cross-modal Retrieval. In Proc. of SIGIR'20.
[11]
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross B. Girshick. 2021. Masked Autoencoders Are Scalable Vision Learners. In arXiv: 2111.06377.
[12]
Kaiming He, Georgia Gkioxari, Piotr Dollá r, and Ross B. Girshick. 2017. Mask R-CNN. In Proc. of ICCV.
[13]
Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean. 2015. Distilling the Knowledge in a Neural Network. arXiv: 2203.05557.
[14]
Wonjae Kim, Bokyung Son, and Ildoo Kim. 2021. ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision. In ICML'21.
[15]
Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2020. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. In Proc. of ICLR'20.
[16]
Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, Yejin Choi, and Jianfeng Gao. 2020. Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks. In Proc. of ECCV.
[17]
Junyang Lin, Rui Men, An Yang, Chang Zhou, Yichang Zhang, Peng Wang, Jingren Zhou, Jie Tang, and Hongxia Yang. 2021. M6: Multi-Modality-to-Multi-Modality Multitask Mega-transformer for Unified Pretraining. In Proc. of KDD'21.
[18]
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proc. of ICCV.
[19]
Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. In Proc. of NeurIPS.
[20]
Arsha Nagrani, Shan Yang, Anurag Arnab, Aren Jansen, Cordelia Schmid, and Chen Sun. 2021. Attention Bottlenecks for Multimodal Fusion. In Proc. of NeurIPS.
[21]
Xiaokang Peng, Yake Wei, Andong Deng, Dong Wang, and Di Hu. 2022. Balanced Multimodal Learning via On-the-fly Gradient Modulation. In Proc. of CVPR.
[22]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. In ICML'21.
[23]
Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, and Douwe Kiela. 2021. FLAVA: A Foundational Language And Vision Alignment Model. In arXiv: 2112.04482.
[24]
Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai. 2020. VL-BERT: Pre-training of Generic Visual-Linguistic Representations. In Proc. of ICLR.
[25]
Ya Sun, Sijie Mai, and Haifeng Hu. 2021. Learning to Balance the Learning Rates Between Various Modalities via Adaptive Tracking Factor. IEEE Signal Process. Lett. (2021).
[26]
Hao Tan and Mohit Bansal. 2019. LXMERT: Learning Cross-Modality Encoder Representations from Transformers. In Proc. of EMNLP.
[27]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In NeurIPS.
[28]
Nam Vo, Lu Jiang, Chen Sun, Kevin Murphy, Li-Jia Li, Li Fei-Fei, and James Hays. 2019. Composing Text and Image for Image Retrieval - an Empirical Odyssey. In Proc. of CVPR.
[29]
Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, and Hongxia Yang. 2022. Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework. In ICML'22.
[30]
Weiyao Wang, Du Tran, and Matt Feiszli. 2020. What Makes Training Multi-Modal Classification Networks Hard?. In Proc. of CVPR.
[31]
Rong Xiao, Jianhui Ji, Baoliang Cui, Haihong Tang, Wenwu Ou, Yanghua Xiao, Jiwei Tan, and Xuan Ju. 2019. Weakly Supervised Co-Training of Query Rewriting andSemantic Matching for e-Commerce. In Proc. of WSDM.
[32]
Tete Xiao, Mannat Singh, Eric Mintun, Trevor Darrell, Piotr Dollá r, and Ross B. Girshick. 2021. Early Convolutions Help Transformers See Better. In Proc. of NeurIPS.
[33]
Zhenda Xie, Zheng Zhang, Yue Cao, Yutong Lin, Jianmin Bao, Zhuliang Yao, Qi Dai, and Han Hu. 2021. SimMIM: A Simple Framework for Masked Image Modeling. arXiv: 2111.09886.
[34]
Jinyu Yang, Jiali Duan, Son Tran, Yi Xu, Sampath Chanda, Liqun Chen, Belinda Zeng, Trishul Chilimbi, and Junzhou Huang. 2022a. Vision-Language Pre-Training with Triple Contrastive Learning. arXiv: 2202.10401.
[35]
Li Yang, Qifan Wang, Zac Yu, Anand Kulkarni, Sumit Sanghai, Bin Shu, Jon Elsas, and Bhargav Kanagal. 2022b. MAVE: A Product Dataset for Multi-source Attribute Value Extraction. In WSDM'22.
[36]
Shaowei Yao, Jiwei Tan, Xi Chen, Keping Yang, Rong Xiao, Hongbo Deng, and Xiaojun Wan. 2021. Learning a Product Relevance Model from Click-Through Data in E-Commerce. In Proc. of WWW.
[37]
Yang You, Jing Li, Sashank J. Reddi, Jonathan Hseu, Sanjiv Kumar, Srinadh Bhojanapalli, Xiaodan Song, James Demmel, Kurt Keutzer, and Cho-Jui Hsieh. 2020. Large Batch Optimization for Deep Learning: Training BERT in 76 minutes. In ICLR'20.
[38]
Xunlin Zhan, Yangxin Wu, Xiao Dong, Yunchao Wei, Minlong Lu, Yichi Zhang, Hang Xu, and Xiaodan Liang. 2021. Product1M: Towards Weakly Supervised Instance-Level Product Retrieval via Cross-Modal Pretraining. In ICCV'21.
[39]
Qian Zhao, Jilin Chen, Minmin Chen, Sagar Jain, Alex Beutel, Francois Belletti, and Ed H. Chi. 2018. Categorical-attributes-based item classification for recommender systems. In RecSys'19.
[40]
Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. 2022. Conditional Prompt Learning for Vision-Language Models. arXiv: 2203.05557.
[41]
Mingchen Zhuge, Dehong Gao, Deng-Ping Fan, Linbo Jin, Ben Chen, Haoming Zhou, Minghui Qiu, and Ling Shao. 2021. Kaleido-BERT: Vision-Language Pre-Training on Fashion Domain. In Proc. of CVPR.

Cited By

View all
  • (2024)The development and application of a novel E-commerce recommendation system used in electric power B2B sectorFrontiers in Big Data10.3389/fdata.2024.13749807Online publication date: 31-Jul-2024
  • (2024)Multi-Label Zero-Shot Product Attribute-Value ExtractionProceedings of the ACM Web Conference 202410.1145/3589334.3645649(2259-2270)Online publication date: 13-May-2024
  • (2024)De-noised Vision-language Fusion Guided by Visual Cues for E-commerce Product Search2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)10.1109/CVPRW63382.2024.00204(1986-1996)Online publication date: 17-Jun-2024

Index Terms

  1. Multimodal Pre-Training with Self-Distillation for Product Understanding in E-Commerce

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      WSDM '23: Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining
      February 2023
      1345 pages
      ISBN:9781450394079
      DOI:10.1145/3539597
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 27 February 2023

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. modality imbalance
      2. product understanding
      3. vision-and-language pre-training

      Qualifiers

      • Research-article

      Conference

      WSDM '23

      Acceptance Rates

      Overall Acceptance Rate 498 of 2,863 submissions, 17%

      Upcoming Conference

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)140
      • Downloads (Last 6 weeks)14
      Reflects downloads up to 03 Mar 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)The development and application of a novel E-commerce recommendation system used in electric power B2B sectorFrontiers in Big Data10.3389/fdata.2024.13749807Online publication date: 31-Jul-2024
      • (2024)Multi-Label Zero-Shot Product Attribute-Value ExtractionProceedings of the ACM Web Conference 202410.1145/3589334.3645649(2259-2270)Online publication date: 13-May-2024
      • (2024)De-noised Vision-language Fusion Guided by Visual Cues for E-commerce Product Search2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)10.1109/CVPRW63382.2024.00204(1986-1996)Online publication date: 17-Jun-2024

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media