research-article

Enhancing Sentence Representation with Visually-supervised Multimodal Pre-training

Authors:

Laurence T. Yang,

Xianjun DengAuthors Info & Claims

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Pages 5686 - 5695

https://doi.org/10.1145/3581783.3612254

Published: 27 October 2023 Publication History

Abstract

Large-scale pre-trained language models have garnered significant attention in recent years due to their effectiveness in extracting sentence representations. However, most pre-trained models currently use transformer-based encoder with a single modality and are primarily designed for specific tasks such as natural language inference and question-answering. Unfortunately, this approach neglects the complementary information provided by multimodal data, which can enhance the effectiveness of sentence representation. To address this issue, we propose a Visually-supervised Pre-trained Multimodal Model (ViP) for sentence representation. Our model leverages diverse label-free multimodal proxy tasks to embed visual information into language, facilitating effective modality alignment and complementarity exploration. Additionally, our model utilizes a novel approach to distinguish highly similar negative and positive samples. We conduct comprehensive downstream experiments on natural language understanding and sentiment classification, demonstrating that ViP outperforms both existing unimodal and multimodal pre-trained models. Our contributions include a novel approach to multimodal pre-training and a state-of-the-art model for sentence representation that incorporates visual information.1 Our code is available at https://github.com/gentlefress/ViP

References

[1]

Jean-Baptiste Alayrac, Adria Recasens, Rosalia Schneider, Relja Arandjelović, Jason Ramapuram, Jeffrey De Fauw, Lucas Smaira, Sander Dieleman, and Andrew Zisserman. 2020. Self-supervised multimodal versatile networks. Advances in Neural Information Processing Systems, Vol. 33 (2020), 25--37.

[2]

Hedi Ben-Younes, Rémi Cadene, Matthieu Cord, and Nicolas Thome. 2017. Mutan: Multimodal tucker fusion for visual question answering. In Proceedings of the IEEE international conference on computer vision. 2612--2620.

[3]

Patrick Bordes, Éloi Zablocki, Laure Soulier, Benjamin Piwowarski, and Patrick Gallinari. 2019. Incorporating Visual Semantics into Sentence Representations within a Grounded Space. In 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing. Association for Computational Linguistics, 696--707.

[4]

Samuel R Bowman, Gabor Angeli, Christopher Potts, and Christopher D Manning. 2015. A large annotated corpus for learning natural language inference. In Conference on Empirical Methods in Natural Language Processing, EMNLP 2015. Association for Computational Linguistics (ACL), 632--642.

[5]

Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St John, Noah Constant, Mario Guajardo-Cespedes, Steve Yuan, Chris Tar, et al. 2018. Universal sentence encoder for English. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP). 169--174.

[6]

Alexis Conneau, Douwe Kiela, Holger Schwenk, Loïc Barrault, and Antoine Bordes. 2017. Supervised Learning of Universal Sentence Representations from Natural Language Inference Data. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP). 670--680.

[7]

Wanyun Cui, Guangyu Zheng, and Wei Wang. 2020. Unsupervised Natural Language Inference via Decoupled Multimodal Contrastive Learning. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 5511--5520.

[8]

Bill Dolan and Chris Brockett. 2005. Automatically Constructing a Corpus of Sentential Paraphrases. In Third International Workshop on Paraphrasing (IWP2005).

[9]

Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, and Marcus Rohrbach. 2016. Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP). 457--468.

[10]

Tianyu Gao, Xingcheng Yao, and Danqi Chen. 2021. SimCSE: Simple Contrastive Learning of Sentence Embeddings. In 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021. Association for Computational Linguistics (ACL), 6894--6910.

[11]

Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. 2020. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 9729--9738.

[12]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 770--778.

[13]

Po-Yao Huang, Mandela Patrick, Junjie Hu, Graham Neubig, Florian Metze, and Alexander G Hauptmann. 2021. Multilingual Multimodal Pre-training for Zero-Shot Cross-Lingual Transfer of Vision-Language Models. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2443--2459.

[14]

Yuyun Huang and Jinhua Du. 2019. Self-attention enhanced CNNs and collaborative curriculum learning for distantly supervised relation extraction. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP). 389--398.

[15]

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. 2021. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning. PMLR, 4904--4916.

[16]

Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of NAACL-HLT. 4171--4186.

[17]

Jin-Hwa Kim, Kyoung-Woon On, Woosang Lim, Jeonghee Kim, Jung-Woo Ha, and Byoung-Tak Zhang. 2016. Hadamard product for low-rank bilinear pooling. arXiv preprint arXiv:1610.04325 (2016), arXiv--1610.

[18]

Byungsoo Ko and Geonmo Gu. 2022. Large-scale Bilingual Language-Image Contrastive Learning. arXiv e-prints (2022), arXiv--2203.

[19]

Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. 2021b. Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems, Vol. 34 (2021), 9694--9705.

[20]

Junlong Li, Yiheng Xu, Tengchao Lv, Lei Cui, Cha Zhang, and Furu Wei. 2022a. Dit: Self-supervised pre-training for document image transformer. In Proceedings of the 30th ACM International Conference on Multimedia. 3530--3539.

Digital Library

[21]

Kunpeng Li, Yulun Zhang, Kai Li, Yuanyuan Li, and Yun Fu. 2022c. Image-text embedding learning via visual and textual semantic reasoning. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 45, 1 (2022), 641--656.

[22]

Wei Li, Can Gao, Guocheng Niu, Xinyan Xiao, Hao Liu, Jiachen Liu, Hua Wu, and Haifeng Wang. 2021a. UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2592--2607.

[23]

Zhen Li, Bing Xu, Conghui Zhu, and Tiejun Zhao. 2022b. CLMLF: A Contrastive Learning and Multi-Layer Fusion Method for Multimodal Sentiment Detection. In Findings of the Association for Computational Linguistics: NAACL 2022. 2282--2294.

[24]

Xiao Liu, Fanjin Zhang, Zhenyu Hou, Li Mian, Zhaoyu Wang, Jing Zhang, and Jie Tang. 2021. Self-supervised learning: Generative or contrastive. IEEE transactions on knowledge and data engineering, Vol. 35, 1 (2021), 857--876.

[25]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv e-prints (2019), arXiv--1907.

[26]

Ishan Misra and Laurens van der Maaten. 2020. Self-supervised learning of pretext-invariant representations. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 6707--6717.

[27]

Adam Poliak. 2020. A survey on Recognizing Textual Entailment as an NLP Evaluation. In Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems. 92--109.

[28]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748--8763.

[29]

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000 Questions for Machine Comprehension of Text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 2383--2392.

[30]

Jun Rao, Liang Ding, Shuhan Qi, Meng Fang, Yang Liu, Li Shen, and Dacheng Tao. 2023. Dynamic Contrastive Distillation for Image-Text Retrieval. IEEE Transactions on Multimedia (2023).

Digital Library

[31]

Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 3982--3992.

[32]

Lei Ren, Zihao Meng, Xiaokang Wang, Lin Zhang, and Laurence T Yang. 2020. A data-driven approach of product quality prediction for complex production systems. IEEE Transactions on Industrial Informatics, Vol. 17, 9 (2020), 6457--6465.

[33]

Hao Tan and Mohit Bansal. 2020. Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2066--2080.

[34]

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation Learning with Contrastive Predictive Coding. arXiv e-prints (2018), arXiv--1807.

[35]

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2018. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP). 353--355.

[36]

Haoran Wang, Dongliang He, Wenhao Wu, Boyang Xia, Min Yang, Fu Li, Yunlong Yu, Zhong Ji, Errui Ding, and Jingdong Wang. 2022b. CODER: Coupled diversity-sensitive momentum contrastive learning for image-text retrieval. In European Conference on Computer Vision. Springer, 700--716.

Digital Library

[37]

Weizhi Wang, Li Dong, Hao Cheng, Haoyu Song, Xiaodong Liu, Xifeng Yan, Jianfeng Gao, and Furu Wei. 2022a. Visually-augmented language modeling. arXiv preprint arXiv:2205.10178 (2022).

[38]

Zhiguo Wang, Wael Hamza, and Radu Florian. 2017. Bilateral Multi-Perspective Matching for Natural Language Sentences. In Proceedings of the 26th International Joint Conference on Artificial Intelligence. 4144--4150.

[39]

Zihao Wang, Xihui Liu, Hongsheng Li, Lu Sheng, Junjie Yan, Xiaogang Wang, and Jing Shao. 2019. Camp: Cross-modal adaptive message passing for text-image retrieval. In Proceedings of the IEEE/CVF international conference on computer vision. 5764--5773.

[40]

John Wieting, Graham Neubig, and Taylor Berg-Kirkpatrick. 2020. A Bilingual Generative Transformer for Semantic Sentence Embedding. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 1581--1594.

[41]

Adina Williams, Nikita Nangia, and Samuel R Bowman. 2018. A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference. In Proceedings of NAACL-HLT. 1112--1122.

[42]

Jinyu Yang, Jiali Duan, Son Tran, Yi Xu, Sampath Chanda, Liqun Chen, Belinda Zeng, Trishul Chilimbi, and Junzhou Huang. 2022a. Vision-language pre-training with triple contrastive learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15671--15680.

[43]

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. 2019. XLNet: Generalized Autoregressive Pretraining for Language Understanding. Advances in Neural Information Processing Systems, Vol. 32 (2019).

[44]

Ziyi Yang, Yuwei Fang, Chenguang Zhu, Reid Pryzant, Dongdong Chen, Yu Shi, Yichong Xu, Yao Qian, Mei Gao, Yi-Ling Chen, et al. 2022b. i-Code: An Integrative and Composable Multimodal Learning Framework. arXiv e-prints (2022), arXiv--2205.

[45]

Guojun Yin, Bin Liu, Lu Sheng, Nenghai Yu, Xiaogang Wang, and Jing Shao. 2019. Semantics disentangling for text-to-image generation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2327--2336.

[46]

Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, Vol. 2 (2014), 67--78.

[47]

Lu Yuan, Dongdong Chen, Yi-Ling Chen, Noel Codella, Xiyang Dai, Jianfeng Gao, Houdong Hu, Xuedong Huang, Boxin Li, Chunyuan Li, et al. 2021. Florence: A New Foundation Model for Computer Vision. arXiv e-prints (2021), arXiv--2111.

[48]

Amir Ali Bagher Zadeh, Paul Pu Liang, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2018. Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. 2236--2246.

[49]

Han Zhang, Jing Yu Koh, Jason Baldridge, Honglak Lee, and Yinfei Yang. 2021. Cross-modal contrastive learning for text-to-image generation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 833--842.

[50]

Miaoran Zhang, Marius Mosbach, David Adelani, Michael Hedderich, and Dietrich Klakow. 2022b. MCSE: Multimodal Contrastive Learning of Sentence Embeddings. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 5959--5969.

[51]

Yan Zhang, Ruidan He, Zuozhu Liu, Kwan Hui Lim, and Lidong Bing. 2020. An Unsupervised Sentence Embedding Method by Mutual Information Maximization. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 1601--1610.

[52]

Zhuosheng Zhang, Kehai Chen, Rui Wang, Masao Utiyama, Eiichiro Sumita, Zuchao Li, and Hai Zhao. 2019. Neural machine translation with universal visual representation. In International Conference on Learning Representations.

[53]

Zhenrong Zhang, Jiefeng Ma, Jun Du, Licheng Wang, and Jianshu Zhang. 2022a. Multimodal pre-training based on graph attention network for document understanding. IEEE Transactions on Multimedia (2022).

Digital Library

[54]

Yanpeng Zhao and Ivan Titov. 2020. Visually Grounded Compound PCFGs. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 4369--4379.

[55]

Ziqi Zhou, Shengshan Hu, Minghui Li, Hangtao Zhang, Yechao Zhang, and Hai Jin. 2023 a. AdvCLIP: Downstream-agnostic Adversarial Examples in Multimodal Contrastive Learning. In Proceedings of the 31st ACM International Conference on Multimedia (ACM MM'23).

Digital Library

[56]

Ziqi Zhou, Shengshan Hu, Ruizhi Zhao, Qian Wang, Leo Yu Zhang, Junhui Hou, and Hai Jin. 2023 b. Downstream-agnostic Adversarial Examples. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV'23).

[57]

Haidong Zhu, Zhaoheng Zheng, Mohammad Soleymani, and Ram Nevatia. 2022. Self-supervised learning for sentiment analysis via image-text matching. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1710--1714.

Cited By

Li ZGao ZTan CRen BYang LLi S(2024)General Point Model Pretraining with Autoencoding and Autoregressive2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.01980(20954-20964)Online publication date: 16-Jun-2024
https://doi.org/10.1109/CVPR52733.2024.01980
Li ZYang LRen BNie XGao ZTan CLi S(2024)MLIP: Enhancing Medical Visual Representation with Divergence Encoder and Knowledge-guided Contrastive Learning2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.01112(11704-11714)Online publication date: 16-Jun-2024
https://doi.org/10.1109/CVPR52733.2024.01112
Zhou ZHu SLi MZhang HZhang YJin HEl Saddik AMei TCucchiara RBertini MTobon Vallejo DAtrey PHossain M(2023)AdvCLIP: Downstream-agnostic Adversarial Examples in Multimodal Contrastive LearningProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612454(6311-6320)Online publication date: 26-Oct-2023
https://dl.acm.org/doi/10.1145/3581783.3612454

Index Terms

Enhancing Sentence Representation with Visually-supervised Multimodal Pre-training
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing

Recommendations

Hierarchical Multimodal Pre-training for Visually Rich Webpage Understanding
WSDM '24: Proceedings of the 17th ACM International Conference on Web Search and Data Mining

The growing prevalence of visually rich documents, such as webpages and scanned/digital-born documents (images, PDFs, etc.), has led to increased interest in automatic document understanding and information extraction across academia and industry. ...
UniGen: Unified Generative Pre-training for Multilingual Multimodal Representation
ICIAI '24: Proceedings of the 2024 International Conference on Innovation in Artificial Intelligence

Multilingual multimodal pre-training has garnered significant attention, but it faces challenges due to the substantial need for diverse multilingual text-image data, especially for minor languages. This article introduces UniGen, a unified strategy for ...
Understanding Chinese Video and Language via Contrastive Multimodal Pre-Training
MM '21: Proceedings of the 29th ACM International Conference on Multimedia

The pre-trained neural models have recently achieved impressive performance in understanding multimodal content. However, it is still very challenging to pre-train neural models for video and language understanding, especially for Chinese video-language ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

October 2023

9913 pages

ISBN:9798400701085

DOI:10.1145/3581783

General Chairs:
Abdulmotaleb El Saddik
University of Ottawa, Canada & MBZUAI, UAE
,
Tao Mei
HiDream.ai, China
,
Rita Cucchiara
University of Modena and Reggio Emilia, Italy
,
Program Chairs:
Marco Bertini
University of Florence, Italy
,
Diana Patricia Tobon Vallejo
Unversidad de Medellin, Colombia
,
Pradeep K. Atrey
University at Albany, State University of New York, USA
,
M. Shamim Hossain
M. Shamim Hossain (King Saud University, KSA

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

MM '23

Sponsor:

SIGMM

MM '23: The 31st ACM International Conference on Multimedia

October 29 - November 3, 2023

Ottawa ON, Canada

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
250
Total Downloads

Downloads (Last 12 months)120
Downloads (Last 6 weeks)4

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Li ZGao ZTan CRen BYang LLi S(2024)General Point Model Pretraining with Autoencoding and Autoregressive2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.01980(20954-20964)Online publication date: 16-Jun-2024
https://doi.org/10.1109/CVPR52733.2024.01980
Li ZYang LRen BNie XGao ZTan CLi S(2024)MLIP: Enhancing Medical Visual Representation with Divergence Encoder and Knowledge-guided Contrastive Learning2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.01112(11704-11714)Online publication date: 16-Jun-2024
https://doi.org/10.1109/CVPR52733.2024.01112
Zhou ZHu SLi MZhang HZhang YJin HEl Saddik AMei TCucchiara RBertini MTobon Vallejo DAtrey PHossain M(2023)AdvCLIP: Downstream-agnostic Adversarial Examples in Multimodal Contrastive LearningProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612454(6311-6320)Online publication date: 26-Oct-2023
https://dl.acm.org/doi/10.1145/3581783.3612454

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten