skip to main content
10.1145/3503161.3551581acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Auto-captions on GIF: A Large-scale Video-sentence Dataset for Vision-language Pre-training

Published: 10 October 2022 Publication History

Abstract

In this work, we present Auto-captions on GIF (ACTION), which is a new large-scale pre-training dataset for generic video understanding. All video-sentence pairs are created by automatically extracting and filtering video caption annotations from billions of web pages. Auto-captions on GIF dataset can be utilized to pre-train the generic feature representation or encoder-decoder structure for video captioning, and other downstream tasks (e.g., sentence localization in videos, video question answering, etc.) as well. We present a detailed analysis of Auto-captions on GIF dataset in comparison to existing video-sentence datasets. We also provide an evaluation of a Transformer-based encoder-decoder structure for vision-language pre-training, which is further adapted to video captioning downstream task and yields the compelling generalizability on MSR-VTT. The dataset is available at http://www.auto-video-captions.top/2022/dataset.

References

[1]
Nayyer Aafaq, Naveed Akhtar, Wei Liu, Syed Zulqarnain Gilani, and Ajmal Mian. 2019. Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning. In CVPR.
[2]
Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In CVPR.
[3]
Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. 2017. Localizing moments in video with natural language. In ICCV.
[4]
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question answering. In ICCV.
[5]
David L Chen and William B Dolan. 2011. Collecting highly parallel data for paraphrase evaluation. In ACL.
[6]
Jingwen Chen, Yingwei Pan, Yehao Li, Ting Yao, Hongyang Chao, and Tao Mei. 2019. Temporal deformable convolutional encoder-decoder networks for video captioning. In AAAI.
[7]
Shaoxiang Chen and Yu-Gang Jiang. 2019. Motion guided spatial attention for video captioning. In AAAI.
[8]
Yangyu Chen, Shuhui Wang, Weigang Zhang, and Qingming Huang. 2018. Less is more: Picking informative frames for video captioning. In ECCV.
[9]
Pradipto Das, Chenliang Xu, Richard F Doell, and Jason J Corso. 2013. A thousand frames in just a few words: Lingual description of videos through latent topics and sparse object stitching. In CVPR.
[10]
Jingyi Hou, Xinxiao Wu, Wentian Zhao, Jiebo Luo, and Yunde Jia. 2019. Joint syntax representation learning and visual cue translation for video captioning. In ICCV.
[11]
Jie Lei, Licheng Yu, Mohit Bansal, and Tamara Berg. 2018. TVQA: Localized, Compositional Video Question Answering. In EMNLP.
[12]
Yehao Li, Yingwei Pan, Ting Yao, Hongyang Chao, Yong Rui, and Tao Mei. 2019a. Learning click-based deep structure-preserving embeddings with visual attention. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) (2019).
[13]
Yehao Li, Yingwei Pan, Ting Yao, and Tao Mei. 2022a. Comprehending and Ordering Semantics for Image Captioning. In CVPR.
[14]
Yuncheng Li, Yale Song, Liangliang Cao, Joel Tetreault, Larry Goldberg, Alejandro Jaimes, and Jiebo Luo. 2016. TGIF: A new dataset and benchmark on animated GIF description. In CVPR.
[15]
Yehao Li, Ting Yao, Yingwei Pan, Hongyang Chao, and Tao Mei. 2018. Jointly localizing and describing events for dense video captioning. In CVPR.
[16]
Yehao Li, Ting Yao, Yingwei Pan, Hongyang Chao, and Tao Mei. 2019b. Pointing Novel Objects in Image Captioning. In CVPR.
[17]
Yehao Li, Ting Yao, Yingwei Pan, and Tao Mei. 2022b. Contextual transformer networks for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (2022).
[18]
Sheng Liu, Zhou Ren, and Junsong Yuan. 2018. SibNet: Sibling Convolutional Encoder for Video Captioning. In ACM MM.
[19]
Fuchen Long, Zhaofan Qiu, Yingwei Pan, Ting Yao, Jiebo Luo, and Tao Mei. 2022. Stand-Alone Inter-Frame Attention in Video Models. In CVPR.
[20]
Edward Loper and Steven Bird. 2002. NLTK: The Natural Language Toolkit. In Proceedings of the ACL-02 Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics.
[21]
Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In NeurIPS.
[22]
Yingwei Pan, Yue Chen, Qian Bao, Ning Zhang, Ting Yao, Jingen Liu, and Tao Mei. 2021. Smart Director: An Event-Driven Directing System for Live Broadcasting. ACM TOMM (2021).
[23]
Yingwei Pan, Tao Mei, Ting Yao, Houqiang Li, and Yong Rui. 2016. Jointly modeling embedding and translation to bridge video and language. In CVPR.
[24]
Yingwei Pan, Zhaofan Qiu, Ting Yao, Houqiang Li, and Tao Mei. 2017a. To Create What You Tell: Generating Videos from Captions. In MM.
[25]
Yingwei Pan, Ting Yao, Houqiang Li, and Tao Mei. 2017b. Video Captioning with Transferred Semantic Attributes. In CVPR.
[26]
Yingwei Pan, Ting Yao, Yehao Li, and Tao Mei. 2020. X-Linear Attention Networks for Image Captioning. In CVPR.
[27]
Yingwei Pan, Ting Yao, Tao Mei, Houqiang Li, Chong-Wah Ngo, and Yong Rui. 2014. Click-through-based cross-view learning for image search. In SIGIR.
[28]
Wenjie Pei, Jiyuan Zhang, Xiangrong Wang, Lei Ke, Xiaoyong Shen, and Yu-Wing Tai. 2019. Memory-attended recurrent network for video captioning. In CVPR.
[29]
Michaela Regneri, Marcus Rohrbach, Dominikus Wetzel, Stefan Thater, Bernt Schiele, and Manfred Pinkal. 2013. Grounding action descriptions in videos. Transactions of the Association for Computational Linguistics, Vol. 1 (2013), 25--36.
[30]
Anna Rohrbach, Marcus Rohrbach, Wei Qiu, Annemarie Friedrich, Manfred Pinkal, and Bernt Schiele. 2014. Coherent multi-sentence video description with variable level of detail. In German conference on pattern recognition. 184--195.
[31]
Anna Rohrbach, Marcus Rohrbach, Niket Tandon, and Bernt Schiele. 2015. A dataset for movie description. In CVPR.
[32]
Marcus Rohrbach, Wei Qiu, Ivan Titov, Stefan Thater, Manfred Pinkal, and Bernt Schiele. 2013. Translating video content to natural language descriptions. In ICCV.
[33]
Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. 2018. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL.
[34]
Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai. 2019. Vl-bert: Pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530 (2019).
[35]
Hao Tan and Mohit Bansal. 2019. LXMERT: Learning Cross-Modality Encoder Representations from Transformers. In EMNLP-IJCNLP.
[36]
Atousa Torabi, Christopher Pal, Hugo Larochelle, and Aaron Courville. 2015. Using descriptive video services to create a large data source for video annotation research. arXiv preprint arXiv:1503.01070 (2015).
[37]
Subhashini Venugopalan, Marcus Rohrbach, Jeffrey Donahue, Raymond Mooney, Trevor Darrell, and Kate Saenko. 2015a. Sequence to Sequence - Video to Text. In ICCV.
[38]
Subhashini Venugopalan, Huijuan Xu, Jeff Donahue, Marcus Rohrbach, Raymond Mooney, and Kate Saenko. 2015b. Translating Videos to Natural Language Using Deep Recurrent Neural Networks. In NAACL HLT.
[39]
Jing Wang, Yingwei Pan, Ting Yao, Jinhui Tang, and Tao Mei. 2019. Convolutional Auto-encoding of Sentence Topics for Image Paragraph Generation. In IJCAI.
[40]
Xin Wang, Wenhu Chen, Jiawei Wu, Yuan-Fang Wang, and William Yang Wang. 2018. Video captioning via hierarchical reinforcement learning. In CVPR.
[41]
Aming Wu and Yahong Han. 2018. Multi-modal Circulant Fusion for Video-to-Language and Backward. In IJCAI.
[42]
Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. Msr-vtt: A large video description dataset for bridging video and language. In CVPR.
[43]
Jun Xu, Ting Yao, Yongdong Zhang, and Tao Mei. 2017. Learning multimodal attention LSTM networks for video captioning. In ACM MM.
[44]
Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher Pal, Hugo Larochelle, and Aaron Courville. 2015b. Describing Videos by Exploiting Temporal Structure. In ICCV.
[45]
Ting Yao, Yehao Li, Yingwei Pan, Yu Wang, Xiao-Ping Zhang, and Tao Mei. 2022a. Dual Vision Transformer. arXiv preprint arXiv:2207.04976 (2022).
[46]
Ting Yao, Tao Mei, and Chong-Wah Ngo. 2015a. Learning Query and Image Similarities with Ranking Canonical Correlation Analysis. In ICCV.
[47]
Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei. 2017a. Incorporating Copying Mechanism in Image Captioning for Learning Novel Objects. In CVPR.
[48]
Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei. 2018. Exploring Visual Relationship for Image Captioning. In ECCV.
[49]
Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei. 2019. Hierarchy Parsing for Image Captioning. In ICCV.
[50]
Ting Yao, Yingwei Pan, Yehao Li, Chong-Wah Ngo, and Tao Mei. 2022b. Wave-ViT: Unifying Wavelet and Transformers for Visual Representation Learning. In ECCV.
[51]
Ting Yao, Yingwei Pan, Yehao Li, Zhaofan Qiu, and Tao Mei. 2017b. Boosting Image Captioning with Attributes. In ICCV.
[52]
Luowei Zhou, Hamid Palangi, Lei Zhang, Houdong Hu, Jason J Corso, and Jianfeng Gao. 2020. Unified vision-language pre-training for image captioning and vqa. In AAAI.

Cited By

View all
  • (2024)Backdoor Attack Against Dataset Distillation in Natural Language ProcessingApplied Sciences10.3390/app14231142514:23(11425)Online publication date: 9-Dec-2024
  • (2024)OSNeRF: On-demand Semantic Neural Radiance Fields for Fast and Robust 3D Object ReconstructionProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681686(4505-4514)Online publication date: 28-Oct-2024
  • (2024)End-to-End Video Scene Graph Generation With Temporal Propagation TransformerIEEE Transactions on Multimedia10.1109/TMM.2023.328387926(1613-1625)Online publication date: 1-Jan-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '22: Proceedings of the 30th ACM International Conference on Multimedia
October 2022
7537 pages
ISBN:9781450392037
DOI:10.1145/3503161
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 October 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. video captioning
  2. video understanding
  3. vision-language pre-training

Qualifiers

  • Research-article

Funding Sources

  • National Key R&D Program of China

Conference

MM '22
Sponsor:

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)38
  • Downloads (Last 6 weeks)4
Reflects downloads up to 23 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Backdoor Attack Against Dataset Distillation in Natural Language ProcessingApplied Sciences10.3390/app14231142514:23(11425)Online publication date: 9-Dec-2024
  • (2024)OSNeRF: On-demand Semantic Neural Radiance Fields for Fast and Robust 3D Object ReconstructionProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681686(4505-4514)Online publication date: 28-Oct-2024
  • (2024)End-to-End Video Scene Graph Generation With Temporal Propagation TransformerIEEE Transactions on Multimedia10.1109/TMM.2023.328387926(1613-1625)Online publication date: 1-Jan-2024
  • (2024)Reliable Phrase Feature Mining for Hierarchical Video-Text RetrievalIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.342286934:11(12019-12031)Online publication date: Nov-2024
  • (2024)EvCap: Element-Aware Video CaptioningIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.339993334:10(9718-9731)Online publication date: Oct-2024
  • (2024) SNP-S 3 : Shared Network Pre-Training and Significant Semantic Strengthening for Various Video-Text Tasks IEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.330394534:4(2525-2535)Online publication date: Apr-2024
  • (2024)DARTScore: DuAl-Reconstruction Transformer for Video Captioning EvaluationIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.329993234:4(2041-2055)Online publication date: Apr-2024
  • (2023)TEVL: Trilinear Encoder for Video-language Representation LearningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/358538819:5s(1-20)Online publication date: 7-Jun-2023
  • (2023)View while Moving: Efficient Video Recognition in Long-untrimmed VideosProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612035(173-183)Online publication date: 26-Oct-2023
  • (2023)Bottom-up and Top-down Object Inference Networks for Image CaptioningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/358036619:5(1-18)Online publication date: 16-Mar-2023
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media