skip to main content
10.1145/3474085.3479216acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
short-paper

MM21 Pre-training for Video Understanding Challenge: Video Captioning with Pretraining Techniques

Published: 17 October 2021 Publication History

Abstract

The quality of video representation directly decides the performance of video related tasks, for both understanding and generation. In this paper, we propose single-modality pretrained feature fusion technique which is composed of reasonable multi-view feature extraction method and designed multi-modality feature fusion strategy. We conduct comprehensive ablation studies on MSR-VTT dataset to demonstrate the effectiveness of proposed method and it surpasses the state-of-the-art methods on both MSR-VTT and VATEX datasets. We further propose the multi-modality pretrained model finetuning technique and dataset augmentation scheme to improve the model's generalization capability. Based on these two proposed pretraining techniques and dataset augmentation scheme, we win the first place in the video captioning track of the MM21 pretraining for video understanding challenge.

References

[1]
Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition. 6077--6086.
[2]
Jonathan T Barron. 2017. Continuously differentiable exponential linear units. arXiv preprint arXiv:1704.07483 (2017).
[3]
Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299--6308.
[4]
Shaoxiang Chen, Wenhao Jiang, Wei Liu, and Yu-Gang Jiang. 2020. Learning modality interaction for temporal sentence localization and event captioning in videos. In European Conference on Computer Vision. Springer, 333--351.
[5]
Longteng Guo, Jing Liu, Xinxin Zhu, Xingjian He, Jie Jiang, and Hanqing Lu. 2020. Non-autoregressive image captioning with counterfactuals-critical multi-agent learning. arXiv preprint arXiv:2005.04690 (2020).
[6]
Longteng Guo, Jing Liu, Xinxin Zhu, Peng Yao, Shichen Lu, and Hanqing Lu. 2020. Normalized and geometry-aware self-attention network for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10327--10336.
[7]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.
[8]
Jingyi Hou, Xinxiao Wu, Wentian Zhao, Jiebo Luo, and Yunde Jia. 2019. Joint syntax representation learning and visual cue translation for video captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 8918--8927.
[9]
Qiuqiang Kong, Yin Cao, Turab Iqbal, Yuxuan Wang, Wenwu Wang, and Mark D Plumbley. 2020. Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28 (2020), 2880--2894.
[10]
Jing Liu, Xinxin Zhu, Fei Liu, Longteng Guo, Zijia Zhao, Mingzhen Sun, Weining Wang, Jinqiao Wang, and Hanqing Lu. 2021. OPT: Omni-Perception Pre-Trainer for Cross-Modal Understanding and Generation. arXiv preprint arXiv:2107.00249 (2021).
[11]
Boxiao Pan, Haoye Cai, De-An Huang, Kuan-Hui Lee, Adrien Gaidon, Ehsan Adeli, and Juan Carlos Niebles. 2020. Spatio-temporal graph for video captioning with knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10870--10879.
[12]
Yingwei Pan, Ting Yao, Yehao Li, and Tao Mei. 2020. X-linear attention networks for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10971--10980.
[13]
Steven J Rennie, Etienne Marcheret, Youssef Mroueh, Jerret Ross, and Vaibhava Goel. 2017. Self-critical sequence training for image captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7008--7024.
[14]
Hugo Touvron, Andrea Vedaldi, Matthijs Douze, and Hervé Jégou. 2019. Fixing the train-test resolution discrepancy. In Advances in Neural Information Processing Systems (NeurIPS).
[15]
Du Tran, Heng Wang, Lorenzo Torresani, and Matt Feiszli. 2019. Video classi-fication with channel-separated convolutional networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5552--5561.
[16]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998--6008.
[17]
Bairui Wang, Lin Ma, Wei Zhang, Wenhao Jiang, Jingwen Wang, and Wei Liu. 2019. Controllable video captioning with pos sequence guidance based on gated fusion network. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2641--2650.
[18]
Xin Wang, Jiawei Wu, Junkun Chen, Lei Li, Yuan-Fang Wang, and William Yang Wang. 2019. Vatex: A large-scale, high-quality multilingual dataset for video-and-language research. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4581--4591.
[19]
Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5288--5296.
[20]
Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang, Lijuan Wang, Yejin Choi, and Jianfeng Gao. 2021. Vinvl: Revisiting visual representations in vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5579--5588.
[21]
Ziqi Zhang, Zhongang Qi, Chunfeng Yuan, Ying Shan, Bing Li, Ying Deng, and Weiming Hu. 2021. Open-book Video Captioning with Retrieve-Copy-Generate Network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9837--9846.
[22]
Ziqi Zhang, Yaya Shi, Chunfeng Yuan, Bing Li, Peijin Wang, Weiming Hu, and Zheng-Jun Zha. 2020. Object relational graph with teacher-recommended learning for video captioning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 13278--13288.
[23]
Qi Zheng, Chaoyue Wang, and Dacheng Tao. 2020. Syntax-aware action targeting for video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13096--13105.

Cited By

View all
  • (2025)VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and DatasetIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.347977647:2(708-724)Online publication date: Feb-2025
  • (2024)GPT-Based Knowledge Guiding Network for Commonsense Video CaptioningIEEE Transactions on Multimedia10.1109/TMM.2023.333007026(5147-5158)Online publication date: 2024
  • (2024)Diffusion-Based Multimodal Video CaptioningComputer Vision – ACCV 202410.1007/978-981-96-0908-6_9(148-165)Online publication date: 7-Dec-2024
  • Show More Cited By

Index Terms

  1. MM21 Pre-training for Video Understanding Challenge: Video Captioning with Pretraining Techniques

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '21: Proceedings of the 29th ACM International Conference on Multimedia
    October 2021
    5796 pages
    ISBN:9781450386517
    DOI:10.1145/3474085
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 17 October 2021

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. pretraining
    2. video captioning

    Qualifiers

    • Short-paper

    Funding Sources

    Conference

    MM '21
    Sponsor:
    MM '21: ACM Multimedia Conference
    October 20 - 24, 2021
    Virtual Event, China

    Acceptance Rates

    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)16
    • Downloads (Last 6 weeks)3
    Reflects downloads up to 20 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and DatasetIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.347977647:2(708-724)Online publication date: Feb-2025
    • (2024)GPT-Based Knowledge Guiding Network for Commonsense Video CaptioningIEEE Transactions on Multimedia10.1109/TMM.2023.333007026(5147-5158)Online publication date: 2024
    • (2024)Diffusion-Based Multimodal Video CaptioningComputer Vision – ACCV 202410.1007/978-981-96-0908-6_9(148-165)Online publication date: 7-Dec-2024
    • (2024)MEERKAT: Audio-Visual Large Language Model for Grounding in Space and TimeComputer Vision – ECCV 202410.1007/978-3-031-73039-9_4(52-70)Online publication date: 31-Oct-2024
    • (2023)Multimodal early fusion operators for temporal video scene segmentation tasksMultimedia Tools and Applications10.1007/s11042-023-14953-682:20(31539-31556)Online publication date: 20-Mar-2023

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media