Abstract
Since deep learning methods have achieved great success in both computer vision and natural language processing, video captioning tasks based on these two fields have also attracted extensive attention. Video captioning is a challenging task, which aims to present video information in the form of natural language to enhance video intelligibility. Most of the current researches in video captioning focus on the behavioral description of the main objects of the video, especially on the holistic understanding of the content. This trend makes most video captioning efforts ignoring the characteristics of smaller objects in the video, resulting in ambiguous, imprecise, or even fundamentally wrong descriptions. In this paper, a novel video captioning method MSLR is proposed, which improves the accuracy of video description by extracting features of video objects with different granularity and preserving long-range temporal dependencies. Specifically, the proposed method performs convolution operations at different scales to obtain different granular spatial features of videos and then fuses them to generate a unified spatial representation. On this basis, a temporal extraction network is further constructed using non-local blocks to preserve the long-range dependencies of videos. Evaluated on two popular benchmark datasets, the experimental results demonstrate the superiority of MSLR over the previous state-of-the-art methods, and the effectiveness of MSLR components is verified through ablation experiments and text evaluation.



Similar content being viewed by others
Data Availability
All datasets used in this work are publicly available and have been properly referenced in the text.
References
Aafaq N, Mian A, Liu W, Gilani SZ, Shah M (2019) Video description: a survey of methods, datasets, and evaluation metrics. ACM Comput Surv (CSUR) 52(6):1–37
Krishnamoorthy N, Malkarnenkar G, Mooney R, Saenko K, Guadarrama S (2013) Generating natural-language video descriptions using text-mined knowledge. In: Twenty-Seventh AAAI conference on artificial intelligence
Rohrbach M, Qiu W, Titov I, Thater S, Pinkal M, Schiele B (2013) Translating video content to natural language descriptions. In: Proceedings of the IEEE international conference on computer vision, pp 433–440
Guadarrama S, Krishnamoorthy N, Malkarnenkar G, Venugopalan S, Mooney R, Darrell T, Saenko K (2013) YouTube2Text: recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In: Proceedings of the IEEE international conference on computer vision, pp 2712–2719
Thomason J, Venugopalan S, Guadarrama S, Saenko K, Mooney R (2014) Integrating language and vision to generate natural language descriptions of videos in the wild. University of Texas at Austin Austin United States, Tech. Rep
Xu R, Xiong C, Chen W, Corso J (2015) Jointly modeling deep video and compositional text to bridge vision and language in a unified framework. In: Proceedings of the AAAI conference on artificial intelligence, vol 29(1)
Kojima A, Tamura T, Fukunaga K (2002) Natural language description of human activities from video images based on concept hierarchy of actions. Int J Comput Vis 50(2):171–184
Venugopalan S, Rohrbach M, Donahue J, Mooney R, Darrell T, Saenko K (2015) Sequence to sequence-video to text. In: Proceedings of the IEEE international conference on computer vision, pp 4534–4542
Yao L, Torabi A, Cho K, Ballas N, Pal C, Larochelle H, Courville A (2015) Describing videos by exploiting temporal structure. In: Proceedings of the IEEE international conference on computer vision, pp 4507–4515
Pan P, Xu Z, Yang Y, Wu F, Zhuang Y (2016) Hierarchical recurrent neural encoder for video representation with application to captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1029–1038
Pan Y, Mei T, Yao T, Li H, Rui Y (2016) Jointly modeling embedding and translation to bridge video and language. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4594–4602
Zhou B, Andonian A, Oliva A, Torralba A (2018) Temporal relational reasoning in videos. In: Proceedings of the European conference on computer vision (ECCV), pp 803–818
Santoro A, Raposo D, Barrett DG, Malinowski M, Pascanu R, Battaglia P, Lillicrap T (2017) A simple neural network module for relational reasoning. In: Advances in neural information processing systems, vol 30
Bin Y, Yang Y, Shen F, Xie N, Shen HT, Li X (2019) Describing video with attention-based bidirectional LSTM. IEEE Trans Cybern 49(7):2631–2641
Wang X, Girshick R, Gupta A, He K (2018) Non-local neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7794–7803
Lee J, Kim J (2019) Exploring the effects of non-local blocks on video captioning networks. Int J Comput Vis Robot 9(5):502–514
Barbu A, Bridge A, Burchill Z, Coroian D, Dickinson S, Fidler S, Michaux A, Mussman S, Narayanaswamy S, Salvi D et al (2012) Video in sentences out. arXiv preprint arXiv:1204.2742
Khan MUG, Zhang L, Gotoh Y (2011) Human focused video description. In: 2011 IEEE international conference on computer vision workshops (ICCV Workshops). IEEE, pp 1480–1487
Rafiq G, Rafiq M, Choi GS (2023) Video description: a comprehensive survey of deep learning approaches. Artif Intell Rev, pp 1–80
Venugopalan S, Xu H, Donahue J, Rohrbach M, Mooney R, Saenko K (2014) Translating videos to natural language using deep recurrent neural networks. arXiv preprint arXiv:1412.4729
Aafaq, N, Akhtar N, Liu W, Gilani SZ, Mian A (2019) Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 12 487–12 496
Zhang X, Gao K, Zhang Y, Zhang D, Li J, Tian Q (2017) Task-driven dynamic fusion: reducing ambiguity in video description. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3713–3721
Gao L, Guo Z, Zhang H, Xu X, Shen HT (2017) Video captioning with attention-based LSTM and semantic consistency. IEEE Trans Multimed 19(9):2045–2055
Wang J, Wang W, Huang Y, Wang L, Tan T (2018) M3: multimodal memory modelling for video captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7512–7520
Liu F, Wu X, You C, Ge S, Zou Y, Sun X (2021) Aligning source visual and target language domains for unpaired video captioning. IEEE Trans Pattern Anal Mach Intell 44:9255–9268
Wang B, Ma L, Zhang W, Liu W (2018) Reconstruction network for video captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7622–7631
Ji W, Wang R, Tian Y, Wang X (2022) An attention based dual learning approach for video captioning. Appl Soft Comput 117:108332
Gao L, Lei Y, Zeng P, Song J, Wang M, Shen HT (2021) Hierarchical representation network with auxiliary tasks for video captioning and video question answering. IEEE Trans Image Process 31:202–215
Seo PH, Nagrani A, Arnab A, Schmid C (2022) End-to-end generative pretraining for multimodal video captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 17 959–17 968
Madake J, Bhatlawande S, Purandare S, Shilaskar S, Nikhare Y (2022) Dense video captioning using BiLSTM encoder. In: 3rd international conference for emerging technology (INCET). IEEE, pp 1–6
Zhou L, Zhou Y, Corso JJ, Socher JJ, Xiong C (2018) End-to-end dense video captioning with masked transformer. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8739–8748
Chen M, Li Y, Zhang Z, Huang S (2018) TVT: two-view transformer network for video captioning. In: Asian conference on machine learning. PMLR, pp 847–862
Zheng Q, Wang C, Tao D (2020) Syntax-aware action targeting for video captioning. In: 2020 IEEE/CVF conference on computer vision and pattern recognition, CVPR 2020, Seattle, WA, USA, June 13–19, 2020. Computer Vision Foundation/IEEE, pp 13 093–13 102
Zhao H, Chen Z, Guo L, Han Z (2022) Video captioning based on vision transformer and reinforcement learning. PeerJ Comput Sci 8:e916
Xu H, Zeng P, Khan AA (2022) Multimodal interaction fusion network based on transformer for video captioning. In: International symposium on artificial intelligence and robotics. Springer, pp 21–36
Yan C, Tu Y, Wang X, Zhang Y, Hao X, Zhang Y, Dai Q (2019) STAT: spatial-temporal attention mechanism for video captioning. IEEE Trans Multimed 22(1):229–241
Hu M, Li Y, Fang L, Wang S (2021) A2-FPN: attention aggregation based feature pyramid network for instance segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 15343–15352
Ghiasi G, Lin T-Y, Le QV (2019) NAS-FPN: learning scalable feature pyramid architecture for object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 7036–7045 (2019)
He K, Gkioxari G, Dollár P, Girshick R (2017) Mask R-CNN. In: Proceedings of the IEEE international conference on computer vision, pp 2961–2969
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Chung J, Gulcehre C, Cho K, Bengio Y (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555
Xu J, Mei T, Yao T, Rui Y (2016) MSR-VTT: a large video description dataset for bridging video and language. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5288–5296
Chen D, Dolan WB (2011) Collecting highly parallel data for paraphrase evaluation. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, pp 190–200
Hou J, Wu X, Zhao W, Luo J, Jia Y (2019) Joint syntax representation learning and visual cue translation for video captioning. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 8918–8927
Papineni K, Roukos S, Ward T, Zhu W (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the association for computational linguistics, July 6–12, 2002, Philadelphia. ACL, pp 311–318
Banerjee S, Lavie A (2005) METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization@ACL 2005, Ann Arbor, Michigan, USA, June 29, 2005. Association for Computational Linguistics, pp 65–72. [Online]. Available: https://aclanthology.org/W05-0909/
Vedantam, R, Zitnick CL, Parikh D (2015) Cider: consensus-based image description evaluation. In: IEEE conference on computer vision and pattern recognition, CVPR 2015, Boston, MA, USA, June 7–12, 2015. IEEE Computer Society, pp 4566–4575. [Online]. Available: https://doi.org/10.1109/CVPR.2015.7299087
Lin C-Y (2004) ROUGE: a package for automatic evaluation of summaries. In: Text Summarization Branches Out. Barcelona, Spain: Association for Computational Linguistics, pp 74–81. [Online]. Available: https://aclanthology.org/W04-1013
Shetty R, Laaksonen J (2016) Frame- and segment-level features and candidate pool evaluation for video caption generation. In: Proceedings of the 2016 ACM conference on multimedia conference, MM 2016, Amsterdam, The Netherlands, October 15–19, 2016. ACM, pp 1073–1076
Xu J, Yao T, Zhang Y, Mei T (2017) Learning multimodal attention LSTM networks for video captioning. In: Proceedings of the 2017 ACM on multimedia conference, MM 2017, Mountain View, CA, USA, October 23–27, 2017. ACM, pp 537–545
Hori C, Hori T, Lee T, Zhang Z, Harsham B, Hershey JR, Marks TK, Sumi K (2017) Attention-based multimodal fusion for video description. In: IEEE international conference on computer vision, ICCV 2017, Venice, Italy, October 22–29, 2017. IEEE Computer Society, pp 4203–4212
Chen Y, Wang S, Zhang W, Huang Q (2018) Less is more: picking informative frames for video captioning, in Computer Vision—ECCV 2018–15th European Conference, Munich, Germany, September 8–14, (2018) Proceedings, Part XIII, ser. Lecture Notes in Computer Science, vol 11217. Springer, pp 367–384
Wang B, Ma L, Zhang W, Jiang W, Wang J, Liu W (2019) Controllable video captioning with POS sequence guidance based on gated fusion network. In: 2019 IEEE/CVF international conference on computer vision, ICCV 2019, Seoul, Korea (South), October 27–November 2, 2019. IEEE, pp 2641–2650
Chen J, Pan Y, Li Y, Yao T, Chao H, Mei T (2019) Temporal deformable convolutional encoder-decoder networks for video captioning. In: The Thirty-Third AAAI conference on artificial intelligence, AAAI 2019, Honolulu, Hawaii, USA, January 27–February 1, AAAI Press, pp 8167–8174
Gao L, Li X, Song J, Shen HT (2020) Hierarchical LSTMs with adaptive attention for visual captioning. IEEE Trans Pattern Anal Mach Intell 42(5):1112–1131
Deng J, Li L, Zhang B, Wang S, Zha Z, Huang Q (2022) Syntax-guided hierarchical attention network for video captioning. IEEE Trans Circuits Syst Video Technol 32(2):880–892
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein MS, Berg AC, Fei-Fei L (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252. https://doi.org/10.1007/s11263-015-0816-y
Kay W, Carreira J, Simonyan K, Zhang B, Hillier C, Vijayanarasimhan S, Viola F, Green T, Back T, Natsev P, Suleyman M, Zisserman A (2017) The kinetics human action video dataset. [Online]. Available: arxiv:1705.06950
Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. In: 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings. [Online]. Available: arxiv:1412.6980
Jin T, Huang S, Chen M, Li Y, Zhang Z (2020) SBAT: video captioning with sparse boundary-aware transformer. arXiv preprint arXiv:2007.11888
Acknowledgements
This work was supported in part by the National Natural Science Foundation of China under Grant No. 62262009, 61902086, Major R &D Project of Guangxi (AA22068071-3) Guangxi Key Laboratory of Trusted Software (kx202054), Open Foundation of State key Laboratory of Networking and Switching Technology in China (SKLNST-2022-1-04), Innovation Project of GUET Graduate Education (2022YCXS064).
Author information
Authors and Affiliations
Contributions
Z. Zhai, X. Chen, Y. Huang, and L. Zhao wrote the main manuscript text. B. Cheng and Q. He participated in the design of MSLR network. All authors reviewed the manuscript.
Corresponding author
Ethics declarations
Conflict of interest
All authors disclosed no relevant relationships.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhai, Z., Chen, X., Huang, Y. et al. Joint multi-scale information and long-range dependence for video captioning. Int J Multimed Info Retr 12, 37 (2023). https://doi.org/10.1007/s13735-023-00303-7
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s13735-023-00303-7