skip to main content
10.1145/3474085.3475703acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

CoCo-BERT: Improving Video-Language Pre-training with Contrastive Cross-modal Matching and Denoising

Published: 17 October 2021 Publication History

Abstract

BERT-type structure has led to the revolution of vision-language pre-training and the achievement of state-of-the-art results on numerous vision-language downstream tasks. Existing solutions dominantly capitalize on the multi-modal inputs with mask tokens to trigger mask-based proxy pre-training tasks (e.g., masked language modeling and masked object/frame prediction). In this work, we argue that such masked inputs would inevitably introduce noise for cross-modal matching proxy task, and thus leave the inherent vision-language association under-explored. As an alternative, we derive a particular form of cross-modal proxy objective for video-language pre-training, i.e., Contrastive Cross-modal matching and denoising (CoCo). By viewing the masked frame/word sequences as the noisy augmentation of primary unmasked ones, CoCo strengthens video-language association by simultaneously pursuing inter-modal matching and intra-modal denoising between masked and unmasked inputs in a contrastive manner. Our CoCo proxy objective can be further integrated into any BERT-type encoder-decoder structure for video-language pre-training, named as Contrastive Cross-modal BERT (CoCo-BERT). We pre-train CoCo-BERT on TV dataset and a newly collected large-scale GIF video dataset (ACTION). Through extensive experiments over a wide range of downstream tasks (e.g., cross-modal retrieval, video question answering, and video captioning), we demonstrate the superiority of CoCo-BERT as a pre-trained structure.

References

[1]
Nayyer Aafaq, Naveed Akhtar, Wei Liu, Syed Zulqarnain Gilani, and Ajmal Mian. 2019. Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning. In CVPR.
[2]
Philip Bachman, R Devon Hjelm, and William Buchwalter. 2019. Learning representations by maximizing mutual information across views. In NeurIPS.
[3]
Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In ACL workshop.
[4]
Qi Cai, Yu Wang, Yingwei Pan, Ting Yao, and Tao Mei. 2020. Joint Contrastive Learning with Infinite Possibilities. In NeurIPS.
[5]
David Chen and William B Dolan. 2011. Collecting highly parallel data for paraphrase evaluation. In ACL.
[6]
Jingwen Chen, Yingwei Pan, Yehao Li, Ting Yao, Hongyang Chao, and Tao Mei. 2019. Temporal deformable convolutional encoder-decoder networks for video captioning. In AAAI.
[7]
Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020 a. A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709 (2020).
[8]
Yangyu Chen, Shuhui Wang, Weigang Zhang, and Qingming Huang. 2018. Less is more: Picking informative frames for video captioning. In ECCV.
[9]
Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. 2020 b. Uniter: Learning universal image-text representations. In ECCV.
[10]
Pradipto Das, Chenliang Xu, Richard F Doell, and Jason J Corso. 2013. A thousand frames in just a few words: Lingual description of videos through latent topics and sparse object stitching. In CVPR.
[11]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR.
[12]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL.
[13]
Fartash Faghri, David J Fleet, Jamie Ryan Kiros, and Sanja Fidler. 2018. VSE: Improving Visual-Semantic Embeddings with Hard Negatives. In BMVC.
[14]
Chenyou Fan, Xiaofan Zhang, Shu Zhang, Wensheng Wang, Chi Zhang, and Heng Huang. 2019. Heterogeneous memory enhanced multimodal attention model for video question answering. In CVPR.
[15]
Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. Slowfast networks for video recognition. In CVPR.
[16]
Jiyang Gao, Runzhou Ge, Kan Chen, and Ram Nevatia. 2018. Motion-appearance co-memory networks for video question answering. In CVPR.
[17]
Raia Hadsell, Sumit Chopra, and Yann LeCun. 2006. Dimensionality reduction by learning an invariant mapping. In CVPR.
[18]
Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. 2020. Momentum contrast for unsupervised visual representation learning. In CVPR.
[19]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR.
[20]
R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Phil Bachman, Adam Trischler, and Yoshua Bengio. 2018. Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670 (2018).
[21]
Yunseok Jang, Yale Song, Youngjae Yu, Youngjin Kim, and Gunhee Kim. 2017. Tgif-qa: Toward spatio-temporal reasoning in visual question answering. In CVPR.
[22]
Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et almbox. 2017. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017).
[23]
Diederik Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In ICLR.
[24]
Jie Lei, Licheng Yu, Mohit Bansal, and Tamara L Berg. 2018. TVQA: Localized, Compositional Video Question Answering. In EMNLP.
[25]
Gen Li, Nan Duan, Yuejian Fang, Daxin Jiang, and Ming Zhou. 2019 a. Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training. arXiv preprint arXiv:1908.06066 (2019).
[26]
Linjie Li, Yen-Chun Chen, Yu Cheng, Zhe Gan, Licheng Yu, and Jingjing Liu. 2020. HERO: Hierarchical Encoder for Video Language Omni-representation Pre-training. In EMNLP.
[27]
Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. 2019 b. Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557 (2019).
[28]
Yehao Li, Yingwei Pan, Ting Yao, Jingwen Chen, and Tao Mei. 2021. Scheduled Sampling in Vision-Language Pretraining with Decoupled Encoder-Decoder Network. In AAAI.
[29]
Yehao Li, Ting Yao, Yingwei Pan, Hongyang Chao, and Tao Mei. 2018. Jointly localizing and describing events for dense video captioning. In CVPR.
[30]
Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In ACL Workshop.
[31]
Sheng Liu, Zhou Ren, and Junsong Yuan. 2018. SibNet: Sibling Convolutional Encoder for Video Captioning. In ACM MM.
[32]
Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In NeurIPS.
[33]
Niluthpol Chowdhury Mithun, Juncheng Li, Florian Metze, and Amit K Roy-Chowdhury. 2018. Learning joint embedding with multimodal cues for cross-modal video-text retrieval. In ICMR.
[34]
Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018).
[35]
Yingwei Pan, Yehao Li, Jianjie Luo, Jun Xu, Ting Yao, and Tao Mei. 2020 a. Auto-captions on GIF: A Large-scale Video-sentence Dataset for Vision-language Pre-training. arXiv preprint arXiv:2007.02375 (2020).
[36]
Yingwei Pan, Tao Mei, Ting Yao, Houqiang Li, and Yong Rui. 2016. Jointly modeling embedding and translation to bridge video and language. In CVPR.
[37]
Yingwei Pan, Ting Yao, Houqiang Li, and Tao Mei. 2017. Video captioning with transferred semantic attributes. In CVPR .
[38]
Yingwei Pan, Ting Yao, Yehao Li, and Tao Mei. 2020 b. X-linear attention networks for image captioning. In CVPR.
[39]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et almbox. 2019. PyTorch: An imperative style, high-performance deep learning library. In NeurIPS.
[40]
Michaela Regneri, Marcus Rohrbach, Dominikus Wetzel, Stefan Thater, Bernt Schiele, and Manfred Pinkal. 2013. Grounding action descriptions in videos. Transactions of the Association for Computational Linguistics (2013).
[41]
Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai. 2020. Vl-bert: Pre-training of generic visual-linguistic representations. In ICLR.
[42]
Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. 2019. Videobert: A joint model for video and language representation learning. In ICCV.
[43]
Hao Tan and Mohit Bansal. 2019. LXMERT: Learning Cross-Modality Encoder Representations from Transformers. In EMNLP-IJCNLP.
[44]
Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image description evaluation. In CVPR.
[45]
Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin. 2018. Unsupervised feature learning via non-parametric instance discrimination. In CVPR.
[46]
Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. Msr-vtt: A large video description dataset for bridging video and language. In CVPR.
[47]
Jun Xu, Ting Yao, Yongdong Zhang, and Tao Mei. 2017. Learning multimodal attention LSTM networks for video captioning. In ACM MM.
[48]
Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher Pal, Hugo Larochelle, and Aaron Courville. 2015. Describing videos by exploiting temporal structure. In ICCV.
[49]
Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei. 2019. Hierarchy parsing for image captioning. In ICCV.
[50]
Ting Yao, Yiheng Zhang, Zhaofan Qiu, Yingwei Pan, and Tao Mei. 2021. Seco: Exploring sequence supervision for unsupervised representation learning. In AAAI.
[51]
Youngjae Yu, Jongseok Kim, and Gunhee Kim. 2018. A joint sequence fusion model for video question answering and retrieval. In ECCV.
[52]
Zhou Yu, Jun Yu, Yuhao Cui, Dacheng Tao, and Qi Tian. 2019. Deep modular co-attention networks for visual question answering. In CVPR.
[53]
Ziqi Zhang, Yaya Shi, Chunfeng Yuan, Bing Li, Peijin Wang, Weiming Hu, and Zheng-Jun Zha. 2020. Object relational graph with teacher-recommended learning for video captioning. In CVPR .
[54]
Luowei Zhou, Hamid Palangi, Lei Zhang, Houdong Hu, Jason J Corso, and Jianfeng Gao. 2020. Unified vision-language pre-training for image captioning and vqa. In AAAI.
[55]
Linchao Zhu and Yi Yang. 2020. ActBERT: Learning Global-Local Video-Text Representations. In CVPR.

Cited By

View all
  • (2024)Regularized Contrastive Partial Multi-view Outlier DetectionProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681125(8711-8720)Online publication date: 28-Oct-2024
  • (2024)A Dual-Masked Deep Structural Clustering Network With Adaptive Bidirectional Information DeliveryIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2023.328157035:10(14783-14796)Online publication date: Oct-2024
  • (2024)Learning Semantic Polymorphic Mapping for Text-Based Person RetrievalIEEE Transactions on Multimedia10.1109/TMM.2024.341012926(10678-10691)Online publication date: 1-Jan-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '21: Proceedings of the 29th ACM International Conference on Multimedia
October 2021
5796 pages
ISBN:9781450386517
DOI:10.1145/3474085
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. contrastive learning
  2. cross-modal retrieval
  3. video captioning
  4. video understanding
  5. vision-language pre-training

Qualifiers

  • Research-article

Funding Sources

  • National Key R&D Program of China

Conference

MM '21
Sponsor:
MM '21: ACM Multimedia Conference
October 20 - 24, 2021
Virtual Event, China

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)50
  • Downloads (Last 6 weeks)4
Reflects downloads up to 23 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Regularized Contrastive Partial Multi-view Outlier DetectionProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681125(8711-8720)Online publication date: 28-Oct-2024
  • (2024)A Dual-Masked Deep Structural Clustering Network With Adaptive Bidirectional Information DeliveryIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2023.328157035:10(14783-14796)Online publication date: Oct-2024
  • (2024)Learning Semantic Polymorphic Mapping for Text-Based Person RetrievalIEEE Transactions on Multimedia10.1109/TMM.2024.341012926(10678-10691)Online publication date: 1-Jan-2024
  • (2024)Large-Scale Visual Language Model Boosted by Contrast Domain Adaptation for Intelligent Industrial Visual MonitoringIEEE Transactions on Industrial Informatics10.1109/TII.2024.344163820:12(14114-14123)Online publication date: Dec-2024
  • (2024)Reliable Phrase Feature Mining for Hierarchical Video-Text RetrievalIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.342286934:11(12019-12031)Online publication date: Nov-2024
  • (2024) SNP-S 3 : Shared Network Pre-Training and Significant Semantic Strengthening for Various Video-Text Tasks IEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.330394534:4(2525-2535)Online publication date: Apr-2024
  • (2024)Prompt-Based Memory Bank for Continual Test-Time Domain Adaptation in Vision-Language Models2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10650069(1-8)Online publication date: 30-Jun-2024
  • (2024)SDGINKnowledge-Based Systems10.1016/j.knosys.2023.111251286:COnline publication date: 17-Apr-2024
  • (2024)Unleashing Text-to-Image Diffusion Prior for Zero-Shot Image CaptioningComputer Vision – ECCV 202410.1007/978-3-031-72998-0_14(237-254)Online publication date: 30-Sep-2024
  • (2023)TEVL: Trilinear Encoder for Video-language Representation LearningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/358538819:5s(1-20)Online publication date: 7-Jun-2023
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media