short-paper

MM21 Pre-training for Video Understanding Challenge: Video Captioning with Pretraining Techniques

Authors:

Jing LiuAuthors Info & Claims

MM '21: Proceedings of the 29th ACM International Conference on Multimedia

Pages 4853 - 4857

https://doi.org/10.1145/3474085.3479216

Published: 17 October 2021 Publication History

Abstract

The quality of video representation directly decides the performance of video related tasks, for both understanding and generation. In this paper, we propose single-modality pretrained feature fusion technique which is composed of reasonable multi-view feature extraction method and designed multi-modality feature fusion strategy. We conduct comprehensive ablation studies on MSR-VTT dataset to demonstrate the effectiveness of proposed method and it surpasses the state-of-the-art methods on both MSR-VTT and VATEX datasets. We further propose the multi-modality pretrained model finetuning technique and dataset augmentation scheme to improve the model's generalization capability. Based on these two proposed pretraining techniques and dataset augmentation scheme, we win the first place in the video captioning track of the MM21 pretraining for video understanding challenge.

References

[1]

Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition. 6077--6086.

[2]

Jonathan T Barron. 2017. Continuously differentiable exponential linear units. arXiv preprint arXiv:1704.07483 (2017).

[3]

Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299--6308.

[4]

Shaoxiang Chen, Wenhao Jiang, Wei Liu, and Yu-Gang Jiang. 2020. Learning modality interaction for temporal sentence localization and event captioning in videos. In European Conference on Computer Vision. Springer, 333--351.

Digital Library

[5]

Longteng Guo, Jing Liu, Xinxin Zhu, Xingjian He, Jie Jiang, and Hanqing Lu. 2020. Non-autoregressive image captioning with counterfactuals-critical multi-agent learning. arXiv preprint arXiv:2005.04690 (2020).

[6]

Longteng Guo, Jing Liu, Xinxin Zhu, Peng Yao, Shichen Lu, and Hanqing Lu. 2020. Normalized and geometry-aware self-attention network for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10327--10336.

[7]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.

[8]

Jingyi Hou, Xinxiao Wu, Wentian Zhao, Jiebo Luo, and Yunde Jia. 2019. Joint syntax representation learning and visual cue translation for video captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 8918--8927.

[9]

Qiuqiang Kong, Yin Cao, Turab Iqbal, Yuxuan Wang, Wenwu Wang, and Mark D Plumbley. 2020. Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28 (2020), 2880--2894.

Digital Library

[10]

Jing Liu, Xinxin Zhu, Fei Liu, Longteng Guo, Zijia Zhao, Mingzhen Sun, Weining Wang, Jinqiao Wang, and Hanqing Lu. 2021. OPT: Omni-Perception Pre-Trainer for Cross-Modal Understanding and Generation. arXiv preprint arXiv:2107.00249 (2021).

[11]

Boxiao Pan, Haoye Cai, De-An Huang, Kuan-Hui Lee, Adrien Gaidon, Ehsan Adeli, and Juan Carlos Niebles. 2020. Spatio-temporal graph for video captioning with knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10870--10879.

[12]

Yingwei Pan, Ting Yao, Yehao Li, and Tao Mei. 2020. X-linear attention networks for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10971--10980.

[13]

Steven J Rennie, Etienne Marcheret, Youssef Mroueh, Jerret Ross, and Vaibhava Goel. 2017. Self-critical sequence training for image captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7008--7024.

[14]

Hugo Touvron, Andrea Vedaldi, Matthijs Douze, and Hervé Jégou. 2019. Fixing the train-test resolution discrepancy. In Advances in Neural Information Processing Systems (NeurIPS).

Digital Library

[15]

Du Tran, Heng Wang, Lorenzo Torresani, and Matt Feiszli. 2019. Video classi-fication with channel-separated convolutional networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5552--5561.

[16]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998--6008.

Digital Library

[17]

Bairui Wang, Lin Ma, Wei Zhang, Wenhao Jiang, Jingwen Wang, and Wei Liu. 2019. Controllable video captioning with pos sequence guidance based on gated fusion network. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2641--2650.

[18]

Xin Wang, Jiawei Wu, Junkun Chen, Lei Li, Yuan-Fang Wang, and William Yang Wang. 2019. Vatex: A large-scale, high-quality multilingual dataset for video-and-language research. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4581--4591.

[19]

Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5288--5296.

[20]

Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang, Lijuan Wang, Yejin Choi, and Jianfeng Gao. 2021. Vinvl: Revisiting visual representations in vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5579--5588.

[21]

Ziqi Zhang, Zhongang Qi, Chunfeng Yuan, Ying Shan, Bing Li, Ying Deng, and Weiming Hu. 2021. Open-book Video Captioning with Retrieve-Copy-Generate Network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9837--9846.

[22]

Ziqi Zhang, Yaya Shi, Chunfeng Yuan, Bing Li, Peijin Wang, Weiming Hu, and Zheng-Jun Zha. 2020. Object relational graph with teacher-recommended learning for video captioning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 13278--13288.

[23]

Qi Zheng, Chaoyue Wang, and Dacheng Tao. 2020. Syntax-aware action targeting for video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13096--13105.

Cited By

Liu JChen SHe XGuo LZhu XWang WTang J(2025)VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and DatasetIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.347977647:2(708-724)Online publication date: Feb-2025
https://doi.org/10.1109/TPAMI.2024.3479776
Yuan MJia GBao B(2024)GPT-Based Knowledge Guiding Network for Commonsense Video CaptioningIEEE Transactions on Multimedia10.1109/TMM.2023.333007026(5147-5158)Online publication date: 2024
https://doi.org/10.1109/TMM.2023.3330070
Kainulainen JGuo ZLaaksonen J(2024)Diffusion-Based Multimodal Video CaptioningComputer Vision – ACCV 202410.1007/978-981-96-0908-6_9(148-165)Online publication date: 7-Dec-2024
https://doi.org/10.1007/978-981-96-0908-6_9
Show More Cited By

Index Terms

MM21 Pre-training for Video Understanding Challenge: Video Captioning with Pretraining Techniques
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Video summarization

Recommendations

VideoTRM: Pre-training for Video Captioning Challenge 2020
MM '20: Proceedings of the 28th ACM International Conference on Multimedia

The Pre-training for Video Captioning Challenge 2020 mainly focuses on developing video captioning systems by pre-training on the newly released large-scale Auto-captions on GIF dataset and further transferring the pre-trained model to MSR-VTT ...
Learning Multimodal Attention LSTM Networks for Video Captioning
MM '17: Proceedings of the 25th ACM international conference on Multimedia

Automatic generation of video caption is a challenging task as video is an information-intensive media with complex variations. Most existing methods, either based on language templates or sequence learning, have treated video as a flat data sequence ...
XlanV Model with Adaptively Multi-Modality Feature Fusing for Video Captioning
MM '20: Proceedings of the 28th ACM International Conference on Multimedia

The dynamic feature extracted by the 3D convolutional network and the static feature extracted by CNN are proved to be beneficial for video captioning. We adaptively fuse these two kinds of features in the X-Linear Attention Network Video and propose ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '21: Proceedings of the 29th ACM International Conference on Multimedia

October 2021

5796 pages

ISBN:9781450386517

DOI:10.1145/3474085

General Chairs:
Heng Tao Shen
University of Electronic Science&Technology of China, China
,
Yueting Zhuang
Zhejiang University, China
,
John R. Smith
IBM, USA
,
Program Chairs:
Yang Yang
University of Electronic Science and Technology of China, China
,
Pablo Cesar
CWI&TU Delft, The Netherlands
,
Florian Metze
FACEBOOK, Inc., USA
,
Balakrishnan Prabhakaran
University of Texas at Dallas, USA

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Short-paper

Funding Sources

National Key Research and Development Program of China
Beijing Natural Science Foundation
National Natural Science Foundation of China

Conference

MM '21

Sponsor:

SIGMM

MM '21: ACM Multimedia Conference

October 20 - 24, 2021

Virtual Event, China

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
261
Total Downloads

Downloads (Last 12 months)16
Downloads (Last 6 weeks)3

Reflects downloads up to 20 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Liu JChen SHe XGuo LZhu XWang WTang J(2025)VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and DatasetIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.347977647:2(708-724)Online publication date: Feb-2025
https://doi.org/10.1109/TPAMI.2024.3479776
Yuan MJia GBao B(2024)GPT-Based Knowledge Guiding Network for Commonsense Video CaptioningIEEE Transactions on Multimedia10.1109/TMM.2023.333007026(5147-5158)Online publication date: 2024
https://doi.org/10.1109/TMM.2023.3330070
Kainulainen JGuo ZLaaksonen J(2024)Diffusion-Based Multimodal Video CaptioningComputer Vision – ACCV 202410.1007/978-981-96-0908-6_9(148-165)Online publication date: 7-Dec-2024
https://doi.org/10.1007/978-981-96-0908-6_9
Chowdhury SNag SDasgupta SChen JElhoseiny MGao RManocha D(2024)MEERKAT: Audio-Visual Large Language Model for Grounding in Space and TimeComputer Vision – ECCV 202410.1007/978-3-031-73039-9_4(52-70)Online publication date: 31-Oct-2024
https://doi.org/10.1007/978-3-031-73039-9_4
Beserra AGoularte R(2023)Multimodal early fusion operators for temporal video scene segmentation tasksMultimedia Tools and Applications10.1007/s11042-023-14953-682:20(31539-31556)Online publication date: 20-Mar-2023
https://dl.acm.org/doi/10.1007/s11042-023-14953-6

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten