research-article

V2P: Vision-to-Prompt based Multi-Modal Product Summary Generation

Authors:

Liqiang NieAuthors Info & Claims

SIGIR '22: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval

Pages 992 - 1001

https://doi.org/10.1145/3477495.3532076

Published: 07 July 2022 Publication History

Get Access

Abstract

Multi-modal Product Summary Generation is a new yet challenging task, which aims to generate a concise and readable summary for a product given its multi-modal content, e.g., its long text description and image. Although existing methods have achieved great success, they still suffer from three key limitations: 1) overlook the benefit of pre-training, 2) lack the representation-level supervision, and 3) ignore the diversity of the seller-generated data. To address these limitations, in this work, we propose a Vision-to-Prompt based multi-modal product summary generation framework, dubbed as V2P, where a Generative Pre-trained Language Model (GPLM) is adopted as the backbone. In particular, to maintain the original text capability of the GPLM and fully utilize the high-level concepts contained in the product image, we design V2P with two key components: vision-based prominent attribute prediction, and attribute prompt-guided summary generation. The first component works on obtaining the vital semantic attributes of the product from its image by the Swin Transformer, while the second component aims to generate the summary based on the product's long text description and the attribute prompts yielded by the first component with a GPLM. Towards comprehensive supervision over the second component, apart from the conventional output-level supervision, we introduce the representation-level regularization. Meanwhile, we design the data augmentation-based robustness regularization to handle the diverse inputs and improve the robustness of the second component. Extensive experiments on a large-scale Chinese dataset verify the superiority of our model over cutting-edge methods.

Supplementary Material

MP4 File (SIGIR22-fp171.mp4)

Presentation video.

Download
18.49 MB

References

[1]

Philip Bachman, R. Devon Hjelm, and William Buchwalter. 2019. Learning Representations by Maximizing Mutual Information Across Views. In Proceedings of the Annual Conference on Neural Information Processing Systems, Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d'Alché -Buc, Emily B. Fox, and Roman Garnett (Eds.). 15509--15519.

Abstract

Supplementary Material

References

Cited By

Index Terms

Recommendations

Dialogue summarization enhanced response generation for multi-domain task-oriented dialogue systems

Multi-granularity contrastive zero-shot learning model based on attribute decomposition

When Few-Shot Learning Meets Large-Scale Knowledge-Enhanced Pre-training: Alibaba at FewCLUE

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations