skip to main content
10.1145/3477495.3532076acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

V2P: Vision-to-Prompt based Multi-Modal Product Summary Generation

Published: 07 July 2022 Publication History

Abstract

Multi-modal Product Summary Generation is a new yet challenging task, which aims to generate a concise and readable summary for a product given its multi-modal content, e.g., its long text description and image. Although existing methods have achieved great success, they still suffer from three key limitations: 1) overlook the benefit of pre-training, 2) lack the representation-level supervision, and 3) ignore the diversity of the seller-generated data. To address these limitations, in this work, we propose a Vision-to-Prompt based multi-modal product summary generation framework, dubbed as V2P, where a Generative Pre-trained Language Model (GPLM) is adopted as the backbone. In particular, to maintain the original text capability of the GPLM and fully utilize the high-level concepts contained in the product image, we design V2P with two key components: vision-based prominent attribute prediction, and attribute prompt-guided summary generation. The first component works on obtaining the vital semantic attributes of the product from its image by the Swin Transformer, while the second component aims to generate the summary based on the product's long text description and the attribute prompts yielded by the first component with a GPLM. Towards comprehensive supervision over the second component, apart from the conventional output-level supervision, we introduce the representation-level regularization. Meanwhile, we design the data augmentation-based robustness regularization to handle the diverse inputs and improve the robustness of the second component. Extensive experiments on a large-scale Chinese dataset verify the superiority of our model over cutting-edge methods.

Supplementary Material

MP4 File (SIGIR22-fp171.mp4)
Presentation video.

References

[1]
Philip Bachman, R. Devon Hjelm, and William Buchwalter. 2019. Learning Representations by Maximizing Mutual Information Across Views. In Proceedings of the Annual Conference on Neural Information Processing Systems, Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d'Alché -Buc, Emily B. Fox, and Roman Garnett (Eds.). 15509--15519.
[2]
Qibin Chen, Junyang Lin, Yichang Zhang, Hongxia Yang, Jingren Zhou, and Jie Tang. 2019. Towards Knowledge-Based Personalized Product Description Generation in E-commerce. In Proceedings of the The ACM SIGKDD Conference on Knowledge Discovery and Data Mining. ACM, 3040--3050.
[3]
Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey E. Hinton. 2020. A Simple Framework for Contrastive Learning of Visual Representations. In Proceedings of the International Conference on Machine Learning, Vol. 119. PMLR, 1597--1607.
[4]
Bowen Cheng, Alexander G. Schwing, and Alexander Kirillov. 2021. Per-Pixel Classification is Not All You Need for Semantic Segmentation. CoRR, Vol. abs/2107.06278 (2021).
[5]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. ACL, 4171--4186.
[6]
Günes Erkan and Dragomir R. Radev. 2004. LexRank: Graph-based Lexical Centrality as Salience in Text Summarization. Journal of Artificial Intelligence Research, Vol. 22 (2004), 457--479.
[7]
Yue Feng, Zhaochun Ren, Weijie Zhao, Mingming Sun, and Ping Li. 2021. Multi-Type Textual Reasoning for Product-Aware Answer Generation. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 1135--1145.
[8]
Yu Gong, Xusheng Luo, Kenny Q. Zhu, Wenwu Ou, Zhao Li, and Lu Duan. 2019. Automatic Generation of Chinese Short Product Titles for Mobile Display. In The Innovative Applications of Artificial Intelligence Conference. AAAI Press, 9460--9465.
[9]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 770--778.
[10]
R. Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Philip Bachman, Adam Trischler, and Yoshua Bengio. 2019. Learning deep representations by mutual information estimation and maximization. In Proceedings of the International Conference on Learning Representations. OpenReview.net.
[11]
Chandra Khatri, Gyanit Singh, and Nish Parikh. 2018. Abstractive and Extractive Text Summarization using Document Context Vector and Recurrent Neural Networks. CoRR, Vol. abs/1807.08000 (2018).
[12]
Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In Proceedings of the International Conference on Learning Representations. OpenReview.net, 1--15.
[13]
Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. ACL, 7871--7880.
[14]
Haoran Li, Peng Yuan, Song Xu, Youzheng Wu, Xiaodong He, and Bowen Zhou. 2020. Aspect-Aware Multimodal Summarization for Chinese E-Commerce Products. In The Innovative Applications of Artificial Intelligence Conference. AAAI Press, 8188--8195.
[15]
Qintong Li, Piji Li, Xinyi Li, Zhaochun Ren, Zhumin Chen, and Maarten de Rijke. 2021. Abstractive Opinion Tagging. In Proceedings of the ACM International Conference on Web Search and Data Mining. 337--345.
[16]
Jinzhi Liao, Xiang Zhao, Xinyi Li, Lingling Zhang, and Jiuyang Tang. 2021. Learning Discriminative Neural Representations for Event Detection. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 644--653.
[17]
Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. In Meeting of the Association for Computational Linguistics.
[18]
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. CoRR, Vol. abs/2103.14030 (2021).
[19]
Tomás Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013. Distributed Representations of Words and Phrases and their Compositionality. In Proceedings of the Annual Conference on Neural Information Processing Systems. 3111--3119.
[20]
Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global Vectors for Word Representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Alessandro Moschitti, Bo Pang, and Walter Daelemans (Eds.). ACL, 1532--1543.
[21]
Ben Poole, Sherjil Ozair, Aäron van den Oord, Alex Alemi, and George Tucker. 2019. On Variational Bounds of Mutual Information. In Proceedings of the International Conference on Machine Learning. PMLR, 5171--5180.
[22]
Weizhen Qi, Yu Yan, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang, and Ming Zhou. 2020. ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training. In Findings of the Association for Computational Linguistics: Empirical Methods in Natural Language Processing. ACL, 2401--2410.
[23]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Learning Research, Vol. 21 (2020), 140:1--140:67.
[24]
Abigail See, Peter J. Liu, and Christopher D. Manning. 2017. Get To The Point: Summarization with Pointer-Generator Networks. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. ACL, 1073--1083.
[25]
Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition. In Proceedings of the International Conference on Learning Representations, Yoshua Bengio and Yann LeCun (Eds.).
[26]
Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. 2019. MASS: Masked Sequence to Sequence Pre-training for Language Generation. In Proceedings of the International Conference on Machine Learning, Vol. 97. PMLR, 5926--5936.
[27]
Xue Song, Jingjing Chen, Zuxuan Wu, and Yu-Gang Jiang. 2021. Spatial-temporal graphs for cross-modal text2video retrieval. IEEE Transactions on Multimedia (2021).
[28]
Hoang Van, Vikas Yadav, and Mihai Surdeanu. 2021. Cheap and Good? Simple and Effective Data Augmentation for Low Resource Machine Reading. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2116--2120.
[29]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Proceedings of the Annual Conference on Neural Information Processing Systems. 5998--6008.
[30]
Junke Wang, Zuxuan Wu, Jingjing Chen, and Yu-Gang Jiang. 2021. M2tr: Multi-modal multi-scale transformers for deepfake detection. arXiv preprint arXiv:2104.09770 (2021).
[31]
Jason W. Wei and Kai Zou. 2019. EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and the International Joint Conference on Natural Language Processing. ACL, 6381--6387.
[32]
Joan Xiao and Robert Munro. 2019. Text Summarization of Product Titles. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval, Vol. 2410. CEUR-WS.org.
[33]
Zhenda Xie, Yutong Lin, Zhuliang Yao, Zheng Zhang, Qi Dai, Yue Cao, and Han Hu. 2021. Self-Supervised Learning with Swin Transformers. CoRR, Vol. abs/2105.04553 (2021).
[34]
Guohai Xu, Yan Shao, Chenliang Li, Feng-Lin Li, Bin Bi, Ji Zhang, and Haiqing Chen. 2021. AliMe DA: A Data Augmentation Framework for Question Answering in Cold-start Scenarios. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2637--2638.
[35]
Min Yang, Qiang Qu, Ying Shen, Qiao Liu, Wei Zhao, and Jia Zhu. 2018. Aspect and Sentiment Aware Abstractive Review Summarization. In Proceedings of the International Conference on Computational Linguistics. ACL, 1110--1120.
[36]
Tiezheng Yu, Wenliang Dai, Zihan Liu, and Pascale Fung. 2021. Vision Guided Generative Pre-trained Language Models for Multimodal Abstractive Summarization. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. ACL, 3995--4007.
[37]
Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Peter J. Liu. 2020. PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization. In Proceedings of the International Conference on Machine Learning, Vol. 119. PMLR, 11328--11339.
[38]
Jianguo Zhang, Pengcheng Zou, Zhao Li, Yao Wan, Xiuming Pan, Yu Gong, and Philip S. Yu. 2019. Multi-Modal Generative Adversarial Network for Short Product Title Generation in Mobile E-Commerce. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. ACL, 64--72.
[39]
Xueying Zhang, Yunjiang Jiang, Yue Shang, Zhaomeng Cheng, Chi Zhang, Xiaochuan Fan, Yun Xiao, and Bo Long. 2021. DSGPT: Domain-Specific Generative Pre-Training of Transformers for Text Generation in E-commerce Title and Review Summarization. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2146--2150.
[40]
Mengxue Zhao, Yang Yang, Miao Li, Jingang Wang, Wei Wu, Pengjie Ren, Maarten de Rijke, and Zhaochun Ren. 2022. Personalized Abstractive Opinion Tagging. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval.
[41]
Chenguang Zhu, Ziyi Yang, Robert Gmyr, Michael Zeng, and Xuedong Huang. 2021. Leveraging Lead Bias for Zero-shot Abstractive News Summarization. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 1462--1471.

Cited By

View all
  • (2024)Homogeneous-listing-augmented Self-supervised Multimodal Product Title RefinementProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3661347(2870-2874)Online publication date: 10-Jul-2024
  • (2024)MemSAM: Taming Segment Anything Model for Echocardiography Video Segmentation2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.00919(9622-9631)Online publication date: 16-Jun-2024
  • (2024)Product promotion copywriting from multimodal dataNeurocomputing10.1016/j.neucom.2024.127253575:COnline publication date: 16-May-2024
  • Show More Cited By

Index Terms

  1. V2P: Vision-to-Prompt based Multi-Modal Product Summary Generation

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGIR '22: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval
    July 2022
    3569 pages
    ISBN:9781450387323
    DOI:10.1145/3477495
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 07 July 2022

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. multi-modal summarization
    2. pre-trained language model
    3. product summary generation

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    SIGIR '22
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 792 of 3,983 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)141
    • Downloads (Last 6 weeks)14
    Reflects downloads up to 19 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Homogeneous-listing-augmented Self-supervised Multimodal Product Title RefinementProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3661347(2870-2874)Online publication date: 10-Jul-2024
    • (2024)MemSAM: Taming Segment Anything Model for Echocardiography Video Segmentation2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.00919(9622-9631)Online publication date: 16-Jun-2024
    • (2024)Product promotion copywriting from multimodal dataNeurocomputing10.1016/j.neucom.2024.127253575:COnline publication date: 16-May-2024
    • (2024)Contrastive learning based on linguistic knowledge and adaptive augmentation for text classificationKnowledge-Based Systems10.1016/j.knosys.2024.112189300:COnline publication date: 18-Nov-2024
    • (2024)Instruction-ViT: Multi-modal prompts for instruction learning in vision transformerInformation Fusion10.1016/j.inffus.2023.102204104(102204)Online publication date: Apr-2024
    • (2023)Multimodal Dialog Systems with Dual Knowledge-enhanced Generative Pretrained Language ModelACM Transactions on Information Systems10.1145/360636842:2(1-25)Online publication date: 6-Oct-2023
    • (2023)Calibration Learning for Few-shot Novel Product DescriptionProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591959(1864-1868)Online publication date: 19-Jul-2023
    • (2023)OFAR: A Multimodal Evidence Retrieval Framework for Illegal Live-streaming IdentificationProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591864(3410-3414)Online publication date: 19-Jul-2023
    • (2023)Adapting Generative Pretrained Language Model for Open-domain Multimodal Sentence SummarizationProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591633(195-204)Online publication date: 19-Jul-2023
    • (2023)Beyond Co-Occurrence: Multi-Modal Session-Based RecommendationIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2023.330999536:4(1450-1462)Online publication date: 30-Aug-2023
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media