skip to main content
10.1145/3394171.3416290acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
short-paper

XlanV Model with Adaptively Multi-Modality Feature Fusing for Video Captioning

Published: 12 October 2020 Publication History

Abstract

The dynamic feature extracted by the 3D convolutional network and the static feature extracted by CNN are proved to be beneficial for video captioning. We adaptively fuse these two kinds of features in the X-Linear Attention Network Video and propose XlanV model for video captioning. However, we notice that the dynamic feature is not compatible with vision-language pre-training techniques when the frame length distribution and average pixel difference of training video and test video biases. Consequently, we directly train the XlanV model on the MSR-VTT dataset without pre-training on the GIF dataset in this challenge. The proposed XlanV model reaches the 1st place in the pre-training for video captioning challenge, which shows that substantially exploiting the dynamic feature is more effective than vision-language pre-training in this challenge.

Supplementary Material

MP4 File (3394171.3416290.mp4)
We introduce the XlanV model that reaches the 1st place in the Pre-training for Video Captioning grand challenge.\r\nThe XlanV model effectively exploits both dynamic and static features by adaptively fusing them.\r\nWe find the quality of the dynamic feature of the GIF dataset is not so good by qualitative studies.\r\nTherefore we train the XlanV model with the MSR-VTT dataset only.\r\nIt can be noticed that we outperform the latter two teams remarkably, which further verifies the capacity of the proposed XlanV model.

References

[1]
Nayyer Aafaq, Naveed Akhtar, Wei Liu, Syed Zulqarnain Gilani, and Ajmal Mian. 2019. Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 12487--12496.
[2]
Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. 2016. Spice: Semantic propositional image caption evaluation. In European Conference on Computer Vision. Springer, 382--398.
[3]
Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299--6308.
[4]
Yangyu Chen, Shuhui Wang, Weigang Zhang, and Qingming Huang. 2018. Less is more: Picking informative frames for video captioning. In Proceedings of the European Conference on Computer Vision (ECCV). 358--373.
[5]
Michael J. Denkowski and Alon Lavie. 2014. Meteor Universal: Language Specific Translation Evaluation for Any Target Language. In Proceedings of the 9th workshop on statistical machine translation. 376--380.
[6]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770--778.
[7]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation, Vol. 9, 8 (1997), 1735--1780.
[8]
Lun Huang, Wenmin Wang, Jie Chen, and Xiao-Yong Wei. 2019. Attention on attention for image captioning. In Proceedings of the IEEE International Conference on Computer Vision. 4634--4643.
[9]
Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014). https://arxiv.org/abs/1412.6980
[10]
Guang Li, Linchao Zhu, Ping Liu, and Yi Yang. 2019. Entangled Transformer for Image Captioning. In Proceedings of the IEEE International Conference on Computer Vision. 8928--8937.
[11]
Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Proc. Text Summarization Branches Out. 1--8.
[12]
Yingwei Pan, Yehao Li, Jianjie Luo, Jun Xu, Ting Yao, and Tao Mei. 2020 a. Auto-captions on GIF: A Large-scale Video-sentence Dataset for Vision-language Pre-training. arXiv preprint arXiv:2007.02375 (2020).
[13]
Yingwei Pan, Tao Mei, Ting Yao, Houqiang Li, and Yong Rui. 2016. Jointly modeling embedding and translation to bridge video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4594--4602.
[14]
Yingwei Pan, Ting Yao, Houqiang Li, and Tao Mei. 2017. Video captioning with transferred semantic attributes. In Proceedings of the IEEE conference on computer vision and pattern recognition. 6504--6512.
[15]
Yingwei Pan, Ting Yao, Yehao Li, and Tao Mei. 2020 b. X-Linear Attention Networks for Image Captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 10971--10980.
[16]
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics. 311--318.
[17]
Wenjie Pei, Jiyuan Zhang, Xiangrong Wang, Lei Ke, Xiaoyong Shen, and Yu-Wing Tai. 2019. Memory-attended recurrent network for video captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 8347--8356.
[18]
Steven J Rennie, Etienne Marcheret, Youssef Mroueh, Jerret Ross, and Vaibhava Goel. 2017. Self-critical sequence training for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7008--7024.
[19]
Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision. 4489--4497.
[20]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 5998--6008.
[21]
Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4566--4575.
[22]
Subhashini Venugopalan, Marcus Rohrbach, Jeffrey Donahue, Raymond Mooney, Trevor Darrell, and Kate Saenko. 2015. Sequence to sequence-video to text. In Proceedings of the IEEE international conference on computer vision. 4534--4542.
[23]
Bairui Wang, Lin Ma, Wei Zhang, Wenhao Jiang, Jingwen Wang, and Wei Liu. 2019. Controllable video captioning with pos sequence guidance based on gated fusion network. In Proceedings of the IEEE International Conference on Computer Vision. 2641--2650.
[24]
Xin Wang, Wenhu Chen, Jiawei Wu, Yuan-Fang Wang, and William Yang Wang. 2018. Video captioning via hierarchical reinforcement learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4213--4222.
[25]
Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[26]
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of International Conference on Machine Learning. 2048--2057.

Cited By

View all
  • (2024)Diverse Consensuses Paired with Motion Estimation-Based Multi-Model FittingProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681646(9281-9290)Online publication date: 28-Oct-2024
  • (2023)Video Captioning by Learning from Global Sentence and Looking AheadACM Transactions on Multimedia Computing, Communications, and Applications10.1145/358725219:5s(1-20)Online publication date: 7-Jun-2023
  • (2023)Multimodal-enhanced hierarchical attention network for video captioningMultimedia Systems10.1007/s00530-023-01130-w29:5(2469-2482)Online publication date: 15-Jul-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '20: Proceedings of the 28th ACM International Conference on Multimedia
October 2020
4889 pages
ISBN:9781450379885
DOI:10.1145/3394171
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 October 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. LSTM
  2. attention mechanism
  3. multi-modality fusion
  4. video captioning
  5. vision-language pre-training

Qualifiers

  • Short-paper

Funding Sources

  • National Natural Science Foundation of China

Conference

MM '20
Sponsor:

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)9
  • Downloads (Last 6 weeks)2
Reflects downloads up to 17 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Diverse Consensuses Paired with Motion Estimation-Based Multi-Model FittingProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681646(9281-9290)Online publication date: 28-Oct-2024
  • (2023)Video Captioning by Learning from Global Sentence and Looking AheadACM Transactions on Multimedia Computing, Communications, and Applications10.1145/358725219:5s(1-20)Online publication date: 7-Jun-2023
  • (2023)Multimodal-enhanced hierarchical attention network for video captioningMultimedia Systems10.1007/s00530-023-01130-w29:5(2469-2482)Online publication date: 15-Jul-2023
  • (2022)Variational Stacked Local Attention Networks for Diverse Video Captioning2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV51458.2022.00255(2493-2502)Online publication date: Jan-2022
  • (2021)Semantic Tag Augmented XlanV Model for Video CaptioningProceedings of the 29th ACM International Conference on Multimedia10.1145/3474085.3479228(4818-4822)Online publication date: 17-Oct-2021

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media