research-article

CoCo-BERT: Improving Video-Language Pre-training with Contrastive Cross-modal Matching and Denoising

Authors:

Tao MeiAuthors Info & Claims

MM '21: Proceedings of the 29th ACM International Conference on Multimedia

Pages 5600 - 5608

https://doi.org/10.1145/3474085.3475703

Published: 17 October 2021 Publication History

Abstract

BERT-type structure has led to the revolution of vision-language pre-training and the achievement of state-of-the-art results on numerous vision-language downstream tasks. Existing solutions dominantly capitalize on the multi-modal inputs with mask tokens to trigger mask-based proxy pre-training tasks (e.g., masked language modeling and masked object/frame prediction). In this work, we argue that such masked inputs would inevitably introduce noise for cross-modal matching proxy task, and thus leave the inherent vision-language association under-explored. As an alternative, we derive a particular form of cross-modal proxy objective for video-language pre-training, i.e., Contrastive Cross-modal matching and denoising (CoCo). By viewing the masked frame/word sequences as the noisy augmentation of primary unmasked ones, CoCo strengthens video-language association by simultaneously pursuing inter-modal matching and intra-modal denoising between masked and unmasked inputs in a contrastive manner. Our CoCo proxy objective can be further integrated into any BERT-type encoder-decoder structure for video-language pre-training, named as Contrastive Cross-modal BERT (CoCo-BERT). We pre-train CoCo-BERT on TV dataset and a newly collected large-scale GIF video dataset (ACTION). Through extensive experiments over a wide range of downstream tasks (e.g., cross-modal retrieval, video question answering, and video captioning), we demonstrate the superiority of CoCo-BERT as a pre-trained structure.

References

[1]

Nayyer Aafaq, Naveed Akhtar, Wei Liu, Syed Zulqarnain Gilani, and Ajmal Mian. 2019. Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning. In CVPR.

[2]

Philip Bachman, R Devon Hjelm, and William Buchwalter. 2019. Learning representations by maximizing mutual information across views. In NeurIPS.

Digital Library

[3]

Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In ACL workshop.

[4]

Qi Cai, Yu Wang, Yingwei Pan, Ting Yao, and Tao Mei. 2020. Joint Contrastive Learning with Infinite Possibilities. In NeurIPS.

[5]

David Chen and William B Dolan. 2011. Collecting highly parallel data for paraphrase evaluation. In ACL.

Digital Library

[6]

Jingwen Chen, Yingwei Pan, Yehao Li, Ting Yao, Hongyang Chao, and Tao Mei. 2019. Temporal deformable convolutional encoder-decoder networks for video captioning. In AAAI.

[7]

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020 a. A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709 (2020).

[8]

Yangyu Chen, Shuhui Wang, Weigang Zhang, and Qingming Huang. 2018. Less is more: Picking informative frames for video captioning. In ECCV.

[9]

Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. 2020 b. Uniter: Learning universal image-text representations. In ECCV.

[10]

Pradipto Das, Chenliang Xu, Richard F Doell, and Jason J Corso. 2013. A thousand frames in just a few words: Lingual description of videos through latent topics and sparse object stitching. In CVPR.

Digital Library

[11]

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR.

[12]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL.

[13]

Fartash Faghri, David J Fleet, Jamie Ryan Kiros, and Sanja Fidler. 2018. VSE: Improving Visual-Semantic Embeddings with Hard Negatives. In BMVC.

[14]

Chenyou Fan, Xiaofan Zhang, Shu Zhang, Wensheng Wang, Chi Zhang, and Heng Huang. 2019. Heterogeneous memory enhanced multimodal attention model for video question answering. In CVPR.

[15]

Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. Slowfast networks for video recognition. In CVPR.

[16]

Jiyang Gao, Runzhou Ge, Kan Chen, and Ram Nevatia. 2018. Motion-appearance co-memory networks for video question answering. In CVPR.

[17]

Raia Hadsell, Sumit Chopra, and Yann LeCun. 2006. Dimensionality reduction by learning an invariant mapping. In CVPR.

Digital Library

[18]

Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. 2020. Momentum contrast for unsupervised visual representation learning. In CVPR.

[19]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR.

[20]

R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Phil Bachman, Adam Trischler, and Yoshua Bengio. 2018. Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670 (2018).

[21]

Yunseok Jang, Yale Song, Youngjae Yu, Youngjin Kim, and Gunhee Kim. 2017. Tgif-qa: Toward spatio-temporal reasoning in visual question answering. In CVPR.

[22]

Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et almbox. 2017. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017).

[23]

Diederik Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In ICLR.

[24]

Jie Lei, Licheng Yu, Mohit Bansal, and Tamara L Berg. 2018. TVQA: Localized, Compositional Video Question Answering. In EMNLP.

[25]

Gen Li, Nan Duan, Yuejian Fang, Daxin Jiang, and Ming Zhou. 2019 a. Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training. arXiv preprint arXiv:1908.06066 (2019).

[26]

Linjie Li, Yen-Chun Chen, Yu Cheng, Zhe Gan, Licheng Yu, and Jingjing Liu. 2020. HERO: Hierarchical Encoder for Video Language Omni-representation Pre-training. In EMNLP.

[27]

Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. 2019 b. Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557 (2019).

[28]

Yehao Li, Yingwei Pan, Ting Yao, Jingwen Chen, and Tao Mei. 2021. Scheduled Sampling in Vision-Language Pretraining with Decoupled Encoder-Decoder Network. In AAAI.

[29]

Yehao Li, Ting Yao, Yingwei Pan, Hongyang Chao, and Tao Mei. 2018. Jointly localizing and describing events for dense video captioning. In CVPR.

[30]

Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In ACL Workshop.

[31]

Sheng Liu, Zhou Ren, and Junsong Yuan. 2018. SibNet: Sibling Convolutional Encoder for Video Captioning. In ACM MM.

Digital Library

[32]

Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In NeurIPS.

Digital Library

[33]

Niluthpol Chowdhury Mithun, Juncheng Li, Florian Metze, and Amit K Roy-Chowdhury. 2018. Learning joint embedding with multimodal cues for cross-modal video-text retrieval. In ICMR.

Digital Library

[34]

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018).

[35]

Yingwei Pan, Yehao Li, Jianjie Luo, Jun Xu, Ting Yao, and Tao Mei. 2020 a. Auto-captions on GIF: A Large-scale Video-sentence Dataset for Vision-language Pre-training. arXiv preprint arXiv:2007.02375 (2020).

[36]

Yingwei Pan, Tao Mei, Ting Yao, Houqiang Li, and Yong Rui. 2016. Jointly modeling embedding and translation to bridge video and language. In CVPR.

[37]

Yingwei Pan, Ting Yao, Houqiang Li, and Tao Mei. 2017. Video captioning with transferred semantic attributes. In CVPR .

[38]

Yingwei Pan, Ting Yao, Yehao Li, and Tao Mei. 2020 b. X-linear attention networks for image captioning. In CVPR.

[39]

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et almbox. 2019. PyTorch: An imperative style, high-performance deep learning library. In NeurIPS.

Digital Library

[40]

Michaela Regneri, Marcus Rohrbach, Dominikus Wetzel, Stefan Thater, Bernt Schiele, and Manfred Pinkal. 2013. Grounding action descriptions in videos. Transactions of the Association for Computational Linguistics (2013).

[41]

Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai. 2020. Vl-bert: Pre-training of generic visual-linguistic representations. In ICLR.

[42]

Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. 2019. Videobert: A joint model for video and language representation learning. In ICCV.

[43]

Hao Tan and Mohit Bansal. 2019. LXMERT: Learning Cross-Modality Encoder Representations from Transformers. In EMNLP-IJCNLP.

[44]

Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image description evaluation. In CVPR.

[45]

Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin. 2018. Unsupervised feature learning via non-parametric instance discrimination. In CVPR.

[46]

Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. Msr-vtt: A large video description dataset for bridging video and language. In CVPR.

[47]

Jun Xu, Ting Yao, Yongdong Zhang, and Tao Mei. 2017. Learning multimodal attention LSTM networks for video captioning. In ACM MM.

Digital Library

[48]

Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher Pal, Hugo Larochelle, and Aaron Courville. 2015. Describing videos by exploiting temporal structure. In ICCV.

Digital Library

[49]

Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei. 2019. Hierarchy parsing for image captioning. In ICCV.

[50]

Ting Yao, Yiheng Zhang, Zhaofan Qiu, Yingwei Pan, and Tao Mei. 2021. Seco: Exploring sequence supervision for unsupervised representation learning. In AAAI.

[51]

Youngjae Yu, Jongseok Kim, and Gunhee Kim. 2018. A joint sequence fusion model for video question answering and retrieval. In ECCV.

[52]

Zhou Yu, Jun Yu, Yuhao Cui, Dacheng Tao, and Qi Tian. 2019. Deep modular co-attention networks for visual question answering. In CVPR.

[53]

Ziqi Zhang, Yaya Shi, Chunfeng Yuan, Bing Li, Peijin Wang, Weiming Hu, and Zheng-Jun Zha. 2020. Object relational graph with teacher-recommended learning for video captioning. In CVPR .

[54]

Luowei Zhou, Hamid Palangi, Lei Zhang, Houdong Hu, Jason J Corso, and Jianfeng Gao. 2020. Unified vision-language pre-training for image captioning and vqa. In AAAI.

[55]

Linchao Zhu and Yi Yang. 2020. ActBERT: Learning Global-Local Video-Text Representations. In CVPR.

Cited By

Wang YXu QJiang YDai SHuang QCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Regularized Contrastive Partial Multi-view Outlier DetectionProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681125(8711-8720)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681125
Yang YSun YWang SGao JJu FYin B(2024)A Dual-Masked Deep Structural Clustering Network With Adaptive Bidirectional Information DeliveryIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2023.328157035:10(14783-14796)Online publication date: Oct-2024
https://doi.org/10.1109/TNNLS.2023.3281570
Li JJiang MKong JTao XLuo X(2024)Learning Semantic Polymorphic Mapping for Text-Based Person RetrievalIEEE Transactions on Multimedia10.1109/TMM.2024.341012926(10678-10691)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TMM.2024.3410129
Show More Cited By

Index Terms

CoCo-BERT: Improving Video-Language Pre-training with Contrastive Cross-modal Matching and Denoising
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Scene understanding
2. Information systems
  1. Information systems applications
    1. Multimedia information systems
      1. Multimedia content creation

Recommendations

Auto-captions on GIF: A Large-scale Video-sentence Dataset for Vision-language Pre-training
MM '22: Proceedings of the 30th ACM International Conference on Multimedia

In this work, we present Auto-captions on GIF (ACTION), which is a new large-scale pre-training dataset for generic video understanding. All video-sentence pairs are created by automatically extracting and filtering video caption annotations from ...
Contrastive Label Correlation Enhanced Unified Hashing Encoder for Cross-modal Retrieval
CIKM '22: Proceedings of the 31st ACM International Conference on Information & Knowledge Management

Cross-modal hashing (CMH) has been widely used in multimedia retrieval applications for its low storage cost and fast indexing speed. Thanks to the success of deep learning, cross-modal hashing has made significant progress with high-quality deep ...
Learning Multimodal Attention LSTM Networks for Video Captioning
MM '17: Proceedings of the 25th ACM international conference on Multimedia

Automatic generation of video caption is a challenging task as video is an information-intensive media with complex variations. Most existing methods, either based on language templates or sequence learning, have treated video as a flat data sequence ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '21: Proceedings of the 29th ACM International Conference on Multimedia

October 2021

5796 pages

ISBN:9781450386517

DOI:10.1145/3474085

General Chairs:
Heng Tao Shen
University of Electronic Science&Technology of China, China
,
Yueting Zhuang
Zhejiang University, China
,
John R. Smith
IBM, USA
,
Program Chairs:
Yang Yang
University of Electronic Science and Technology of China, China
,
Pablo Cesar
CWI&TU Delft, The Netherlands
,
Florian Metze
FACEBOOK, Inc., USA
,
Balakrishnan Prabhakaran
University of Texas at Dallas, USA

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Key R&D Program of China

Conference

MM '21

Sponsor:

SIGMM

MM '21: ACM Multimedia Conference

October 20 - 24, 2021

Virtual Event, China

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

31
Total Citations
View Citations
496
Total Downloads

Downloads (Last 12 months)50
Downloads (Last 6 weeks)4

Reflects downloads up to 23 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Wang YXu QJiang YDai SHuang QCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Regularized Contrastive Partial Multi-view Outlier DetectionProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681125(8711-8720)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681125
Yang YSun YWang SGao JJu FYin B(2024)A Dual-Masked Deep Structural Clustering Network With Adaptive Bidirectional Information DeliveryIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2023.328157035:10(14783-14796)Online publication date: Oct-2024
https://doi.org/10.1109/TNNLS.2023.3281570
Li JJiang MKong JTao XLuo X(2024)Learning Semantic Polymorphic Mapping for Text-Based Person RetrievalIEEE Transactions on Multimedia10.1109/TMM.2024.341012926(10678-10691)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TMM.2024.3410129
Wang HLi CLi Y(2024)Large-Scale Visual Language Model Boosted by Contrast Domain Adaptation for Intelligent Industrial Visual MonitoringIEEE Transactions on Industrial Informatics10.1109/TII.2024.344163820:12(14114-14123)Online publication date: Dec-2024
https://doi.org/10.1109/TII.2024.3441638
Lai HYang WZhang TZhang Y(2024)Reliable Phrase Feature Mining for Hierarchical Video-Text RetrievalIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.342286934:11(12019-12031)Online publication date: Nov-2024
https://doi.org/10.1109/TCSVT.2024.3422869
Dong XGuo QGan TWang QWu JRen XCheng YChu W(2024) SNP-S 3 : Shared Network Pre-Training and Significant Semantic Strengthening for Various Video-Text Tasks IEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.330394534:4(2525-2535)Online publication date: Apr-2024
https://doi.org/10.1109/TCSVT.2023.3303945
Wang RZuo HFang ZLu J(2024)Prompt-Based Memory Bank for Continual Test-Time Domain Adaptation in Vision-Language Models2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10650069(1-8)Online publication date: 30-Jun-2024
https://doi.org/10.1109/IJCNN60899.2024.10650069
Sun KXie ZGuo CZhang HLi Y(2024)SDGINKnowledge-Based Systems10.1016/j.knosys.2023.111251286:COnline publication date: 17-Apr-2024
https://dl.acm.org/doi/10.1016/j.knosys.2023.111251
Luo JChen JLi YPan YFeng JChao HYao T(2024)Unleashing Text-to-Image Diffusion Prior for Zero-Shot Image CaptioningComputer Vision – ECCV 202410.1007/978-3-031-72998-0_14(237-254)Online publication date: 30-Sep-2024
https://doi.org/10.1007/978-3-031-72998-0_14
Man XShao JChen FZhang MShen H(2023)TEVL: Trilinear Encoder for Video-language Representation LearningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/358538819:5s(1-20)Online publication date: 7-Jun-2023
https://dl.acm.org/doi/10.1145/3585388
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten