Skip to main content
Log in

Joint multi-scale information and long-range dependence for video captioning

  • Regular Paper
  • Published:
International Journal of Multimedia Information Retrieval Aims and scope Submit manuscript

Abstract

Since deep learning methods have achieved great success in both computer vision and natural language processing, video captioning tasks based on these two fields have also attracted extensive attention. Video captioning is a challenging task, which aims to present video information in the form of natural language to enhance video intelligibility. Most of the current researches in video captioning focus on the behavioral description of the main objects of the video, especially on the holistic understanding of the content. This trend makes most video captioning efforts ignoring the characteristics of smaller objects in the video, resulting in ambiguous, imprecise, or even fundamentally wrong descriptions. In this paper, a novel video captioning method MSLR is proposed, which improves the accuracy of video description by extracting features of video objects with different granularity and preserving long-range temporal dependencies. Specifically, the proposed method performs convolution operations at different scales to obtain different granular spatial features of videos and then fuses them to generate a unified spatial representation. On this basis, a temporal extraction network is further constructed using non-local blocks to preserve the long-range dependencies of videos. Evaluated on two popular benchmark datasets, the experimental results demonstrate the superiority of MSLR over the previous state-of-the-art methods, and the effectiveness of MSLR components is verified through ablation experiments and text evaluation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Data Availability

All datasets used in this work are publicly available and have been properly referenced in the text.

References

  1. Aafaq N, Mian A, Liu W, Gilani SZ, Shah M (2019) Video description: a survey of methods, datasets, and evaluation metrics. ACM Comput Surv (CSUR) 52(6):1–37

    Article  Google Scholar 

  2. Krishnamoorthy N, Malkarnenkar G, Mooney R, Saenko K, Guadarrama S (2013) Generating natural-language video descriptions using text-mined knowledge. In: Twenty-Seventh AAAI conference on artificial intelligence

  3. Rohrbach M, Qiu W, Titov I, Thater S, Pinkal M, Schiele B (2013) Translating video content to natural language descriptions. In: Proceedings of the IEEE international conference on computer vision, pp 433–440

  4. Guadarrama S, Krishnamoorthy N, Malkarnenkar G, Venugopalan S, Mooney R, Darrell T, Saenko K (2013) YouTube2Text: recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In: Proceedings of the IEEE international conference on computer vision, pp 2712–2719

  5. Thomason J, Venugopalan S, Guadarrama S, Saenko K, Mooney R (2014) Integrating language and vision to generate natural language descriptions of videos in the wild. University of Texas at Austin Austin United States, Tech. Rep

  6. Xu R, Xiong C, Chen W, Corso J (2015) Jointly modeling deep video and compositional text to bridge vision and language in a unified framework. In: Proceedings of the AAAI conference on artificial intelligence, vol 29(1)

  7. Kojima A, Tamura T, Fukunaga K (2002) Natural language description of human activities from video images based on concept hierarchy of actions. Int J Comput Vis 50(2):171–184

    Article  MATH  Google Scholar 

  8. Venugopalan S, Rohrbach M, Donahue J, Mooney R, Darrell T, Saenko K (2015) Sequence to sequence-video to text. In: Proceedings of the IEEE international conference on computer vision, pp 4534–4542

  9. Yao L, Torabi A, Cho K, Ballas N, Pal C, Larochelle H, Courville A (2015) Describing videos by exploiting temporal structure. In: Proceedings of the IEEE international conference on computer vision, pp 4507–4515

  10. Pan P, Xu Z, Yang Y, Wu F, Zhuang Y (2016) Hierarchical recurrent neural encoder for video representation with application to captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1029–1038

  11. Pan Y, Mei T, Yao T, Li H, Rui Y (2016) Jointly modeling embedding and translation to bridge video and language. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4594–4602

  12. Zhou B, Andonian A, Oliva A, Torralba A (2018) Temporal relational reasoning in videos. In: Proceedings of the European conference on computer vision (ECCV), pp 803–818

  13. Santoro A, Raposo D, Barrett DG, Malinowski M, Pascanu R, Battaglia P, Lillicrap T (2017) A simple neural network module for relational reasoning. In: Advances in neural information processing systems, vol 30

  14. Bin Y, Yang Y, Shen F, Xie N, Shen HT, Li X (2019) Describing video with attention-based bidirectional LSTM. IEEE Trans Cybern 49(7):2631–2641

    Article  Google Scholar 

  15. Wang X, Girshick R, Gupta A, He K (2018) Non-local neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7794–7803

  16. Lee J, Kim J (2019) Exploring the effects of non-local blocks on video captioning networks. Int J Comput Vis Robot 9(5):502–514

    Article  MathSciNet  Google Scholar 

  17. Barbu A, Bridge A, Burchill Z, Coroian D, Dickinson S, Fidler S, Michaux A, Mussman S, Narayanaswamy S, Salvi D et al (2012) Video in sentences out. arXiv preprint arXiv:1204.2742

  18. Khan MUG, Zhang L, Gotoh Y (2011) Human focused video description. In: 2011 IEEE international conference on computer vision workshops (ICCV Workshops). IEEE, pp 1480–1487

  19. Rafiq G, Rafiq M, Choi GS (2023) Video description: a comprehensive survey of deep learning approaches. Artif Intell Rev, pp 1–80

  20. Venugopalan S, Xu H, Donahue J, Rohrbach M, Mooney R, Saenko K (2014) Translating videos to natural language using deep recurrent neural networks. arXiv preprint arXiv:1412.4729

  21. Aafaq, N, Akhtar N, Liu W, Gilani SZ, Mian A (2019) Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 12 487–12 496

  22. Zhang X, Gao K, Zhang Y, Zhang D, Li J, Tian Q (2017) Task-driven dynamic fusion: reducing ambiguity in video description. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3713–3721

  23. Gao L, Guo Z, Zhang H, Xu X, Shen HT (2017) Video captioning with attention-based LSTM and semantic consistency. IEEE Trans Multimed 19(9):2045–2055

    Article  Google Scholar 

  24. Wang J, Wang W, Huang Y, Wang L, Tan T (2018) M3: multimodal memory modelling for video captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7512–7520

  25. Liu F, Wu X, You C, Ge S, Zou Y, Sun X (2021) Aligning source visual and target language domains for unpaired video captioning. IEEE Trans Pattern Anal Mach Intell 44:9255–9268

    Article  Google Scholar 

  26. Wang B, Ma L, Zhang W, Liu W (2018) Reconstruction network for video captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7622–7631

  27. Ji W, Wang R, Tian Y, Wang X (2022) An attention based dual learning approach for video captioning. Appl Soft Comput 117:108332

    Article  Google Scholar 

  28. Gao L, Lei Y, Zeng P, Song J, Wang M, Shen HT (2021) Hierarchical representation network with auxiliary tasks for video captioning and video question answering. IEEE Trans Image Process 31:202–215

    Article  Google Scholar 

  29. Seo PH, Nagrani A, Arnab A, Schmid C (2022) End-to-end generative pretraining for multimodal video captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 17 959–17 968

  30. Madake J, Bhatlawande S, Purandare S, Shilaskar S, Nikhare Y (2022) Dense video captioning using BiLSTM encoder. In: 3rd international conference for emerging technology (INCET). IEEE, pp 1–6

  31. Zhou L, Zhou Y, Corso JJ, Socher JJ, Xiong C (2018) End-to-end dense video captioning with masked transformer. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8739–8748

  32. Chen M, Li Y, Zhang Z, Huang S (2018) TVT: two-view transformer network for video captioning. In: Asian conference on machine learning. PMLR, pp 847–862

  33. Zheng Q, Wang C, Tao D (2020) Syntax-aware action targeting for video captioning. In: 2020 IEEE/CVF conference on computer vision and pattern recognition, CVPR 2020, Seattle, WA, USA, June 13–19, 2020. Computer Vision Foundation/IEEE, pp 13 093–13 102

  34. Zhao H, Chen Z, Guo L, Han Z (2022) Video captioning based on vision transformer and reinforcement learning. PeerJ Comput Sci 8:e916

    Article  Google Scholar 

  35. Xu H, Zeng P, Khan AA (2022) Multimodal interaction fusion network based on transformer for video captioning. In: International symposium on artificial intelligence and robotics. Springer, pp 21–36

  36. Yan C, Tu Y, Wang X, Zhang Y, Hao X, Zhang Y, Dai Q (2019) STAT: spatial-temporal attention mechanism for video captioning. IEEE Trans Multimed 22(1):229–241

    Article  Google Scholar 

  37. Hu M, Li Y, Fang L, Wang S (2021) A2-FPN: attention aggregation based feature pyramid network for instance segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 15343–15352

  38. Ghiasi G, Lin T-Y, Le QV (2019) NAS-FPN: learning scalable feature pyramid architecture for object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 7036–7045 (2019)

  39. He K, Gkioxari G, Dollár P, Girshick R (2017) Mask R-CNN. In: Proceedings of the IEEE international conference on computer vision, pp 2961–2969

  40. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778

  41. Chung J, Gulcehre C, Cho K, Bengio Y (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555

  42. Xu J, Mei T, Yao T, Rui Y (2016) MSR-VTT: a large video description dataset for bridging video and language. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5288–5296

  43. Chen D, Dolan WB (2011) Collecting highly parallel data for paraphrase evaluation. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, pp 190–200

  44. Hou J, Wu X, Zhao W, Luo J, Jia Y (2019) Joint syntax representation learning and visual cue translation for video captioning. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 8918–8927

  45. Papineni K, Roukos S, Ward T, Zhu W (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the association for computational linguistics, July 6–12, 2002, Philadelphia. ACL, pp 311–318

  46. Banerjee S, Lavie A (2005) METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization@ACL 2005, Ann Arbor, Michigan, USA, June 29, 2005. Association for Computational Linguistics, pp 65–72. [Online]. Available: https://aclanthology.org/W05-0909/

  47. Vedantam, R, Zitnick CL, Parikh D (2015) Cider: consensus-based image description evaluation. In: IEEE conference on computer vision and pattern recognition, CVPR 2015, Boston, MA, USA, June 7–12, 2015. IEEE Computer Society, pp 4566–4575. [Online]. Available: https://doi.org/10.1109/CVPR.2015.7299087

  48. Lin C-Y (2004) ROUGE: a package for automatic evaluation of summaries. In: Text Summarization Branches Out. Barcelona, Spain: Association for Computational Linguistics, pp 74–81. [Online]. Available: https://aclanthology.org/W04-1013

  49. Shetty R, Laaksonen J (2016) Frame- and segment-level features and candidate pool evaluation for video caption generation. In: Proceedings of the 2016 ACM conference on multimedia conference, MM 2016, Amsterdam, The Netherlands, October 15–19, 2016. ACM, pp 1073–1076

  50. Xu J, Yao T, Zhang Y, Mei T (2017) Learning multimodal attention LSTM networks for video captioning. In: Proceedings of the 2017 ACM on multimedia conference, MM 2017, Mountain View, CA, USA, October 23–27, 2017. ACM, pp 537–545

  51. Hori C, Hori T, Lee T, Zhang Z, Harsham B, Hershey JR, Marks TK, Sumi K (2017) Attention-based multimodal fusion for video description. In: IEEE international conference on computer vision, ICCV 2017, Venice, Italy, October 22–29, 2017. IEEE Computer Society, pp 4203–4212

  52. Chen Y, Wang S, Zhang W, Huang Q (2018) Less is more: picking informative frames for video captioning, in Computer Vision—ECCV 2018–15th European Conference, Munich, Germany, September 8–14, (2018) Proceedings, Part XIII, ser. Lecture Notes in Computer Science, vol 11217. Springer, pp 367–384

  53. Wang B, Ma L, Zhang W, Jiang W, Wang J, Liu W (2019) Controllable video captioning with POS sequence guidance based on gated fusion network. In: 2019 IEEE/CVF international conference on computer vision, ICCV 2019, Seoul, Korea (South), October 27–November 2, 2019. IEEE, pp 2641–2650

  54. Chen J, Pan Y, Li Y, Yao T, Chao H, Mei T (2019) Temporal deformable convolutional encoder-decoder networks for video captioning. In: The Thirty-Third AAAI conference on artificial intelligence, AAAI 2019, Honolulu, Hawaii, USA, January 27–February 1, AAAI Press, pp 8167–8174

  55. Gao L, Li X, Song J, Shen HT (2020) Hierarchical LSTMs with adaptive attention for visual captioning. IEEE Trans Pattern Anal Mach Intell 42(5):1112–1131

  56. Deng J, Li L, Zhang B, Wang S, Zha Z, Huang Q (2022) Syntax-guided hierarchical attention network for video captioning. IEEE Trans Circuits Syst Video Technol 32(2):880–892

    Article  Google Scholar 

  57. Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein MS, Berg AC, Fei-Fei L (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252. https://doi.org/10.1007/s11263-015-0816-y

    Article  MathSciNet  Google Scholar 

  58. Kay W, Carreira J, Simonyan K, Zhang B, Hillier C, Vijayanarasimhan S, Viola F, Green T, Back T, Natsev P, Suleyman M, Zisserman A (2017) The kinetics human action video dataset. [Online]. Available: arxiv:1705.06950

  59. Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. In: 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings. [Online]. Available: arxiv:1412.6980

  60. Jin T, Huang S, Chen M, Li Y, Zhang Z (2020) SBAT: video captioning with sparse boundary-aware transformer. arXiv preprint arXiv:2007.11888

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China under Grant No. 62262009, 61902086, Major R &D Project of Guangxi (AA22068071-3) Guangxi Key Laboratory of Trusted Software (kx202054), Open Foundation of State key Laboratory of Networking and Switching Technology in China (SKLNST-2022-1-04), Innovation Project of GUET Graduate Education (2022YCXS064).

Author information

Authors and Affiliations

Authors

Contributions

Z. Zhai, X. Chen, Y. Huang, and L. Zhao wrote the main manuscript text. B. Cheng and Q. He participated in the design of MSLR network. All authors reviewed the manuscript.

Corresponding author

Correspondence to Lingzhong Zhao.

Ethics declarations

Conflict of interest

All authors disclosed no relevant relationships.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhai, Z., Chen, X., Huang, Y. et al. Joint multi-scale information and long-range dependence for video captioning. Int J Multimed Info Retr 12, 37 (2023). https://doi.org/10.1007/s13735-023-00303-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s13735-023-00303-7

Keywords

Navigation