Skip to main content
Log in

Video captioning: a review of theory, techniques and practices

  • 1179: Multimedia Software Engineering: Challenges and Opportunities
  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

In today’s world, video captioning is extensively used in various applications for specially-abled and, more specifically, visually abled persons. With advancements in technology for object detection and natural processing, there has been an instant surge infusing the above mainstream tasks. One such example of this fusion resulted in the generation of Image captions when an input image is fed to the system, and it gives a short description of what is present in the image. This fusion pertained to images and was further moved to be implemented on the Videos, with some tweaking in the current methods. This paper presents the survey of the state of art techniques of various video captioning methods. There have been many inputs provided by people worldwide in this domain; thus, there was a need to compile, study and analyze all the results and present that in a comprehensive study, which we have done in this paper. The comparison of various video captioning methods on the distinct dataset was evaluated on different parameters, which were most common and mainly used for image and video analysis. This review was done for methods used from the year 2015–2019 (year by year). The most commonly used dataset and evaluation method are also pictorially represented in a bar graph and scatter plot for each year for the respective evaluation parameter. Though a lot of analysis and research has been done on video captioning, our survey shows many problems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18

Similar content being viewed by others

References

  1. Aafaq N, Akhtar N, Liu W, Gilani SZ, Mian A (2019) Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition pp. 12487-12496

  2. Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473

  3. Banerjee S, Lavie A (2005) Meteor: an automatic metric format evaluation with improved correlation with human judgments. In: Proceedings of the ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, volume 29, p 65–72

  4. Baraldi L, Grana C, Cucchiara R (2017) Hierarchical boundary-aware neural encoder for video captioning. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3185–3194, https://doi.org/10.1109/CVPR.2017.339

  5. Barbu A, Bridge A, Burchill Z, Corian D, Dickinson S, Fidler S, Michaux A, Mussman S, Narayanaswamy S, Salvi D, et al. (2012) Video in sentences out. arXiv preprint arXiv:1204.2742,(2012)

  6. Barbu A, Bridge A, Burchill Z, Coroian D, Dickinson S, Fidler S, Michaux A, Mussman S, Narayanaswamy S, Salvi D et al. (2012) Video in sentences out. arXiv preprint arXiv:1204.2742,(2012)

  7. Brand M (1997) The” inverse Hollywood problem”: from video to scripts and storyboards via causal analysis. In: AAAI/IAAI Citeseer, 132–137.

  8. Buch S, Escorcia V, Shen C, Ghanem B, Carlos Niebles J. SST: Single-stream temporal action proposals. In CVPR

  9. Chen DL, Dolan WB (2011) Collecting highly parallel data for paraphrase evaluation. In: Proc. ACL HLT pp. 190–200.

  10. Chen S, Chen J, Jin Q, Hauptmann A (2017) Video captioning with the guidance of multimodal latent topics, in proceedings of the 2017 ACM on multimedia conference, ACM, pp. 1838–1846.

  11. Chen Y, Wang S, Zhang W, Huang Q (2018) Less is more: Picking informative frames for video captioning. In: ECCV

  12. Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y. (2014 Jun 3) Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078

  13. Cho K, van Merrienboer B, Bahdanau D, Bengio Y (2014) On the properties of neural machine translation: Encoder–decoder approaches. In: Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics, and Structure in Statistical Translation

  14. Cho K, van Merrienboer B, Gulcehre C, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: Conference on Empirical Methods in Natural Language Processing

  15. Chollet F (2017) Xception: deep learning with depthwise separable convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1251–1258)

  16. Dong J, Li X, Lan W, Huo Y, Snoek CG (2016) Early embedding and late reranking for video captioning. In: Proceedings of the 2016 ACM on multimedia conference, ACM, page 1082–1086

  17. Gan Z, Gan C, He X, Pu Y, Tran K, Gao J, Carin L, Deng L (2017) Semantic compositional networks for visual captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 5630–5639

  18. Gao L, Guo Z, Zhang H, Xu X, Shen HT (2017) Video captioning with attention-based lstm and semantic consistency. IEEE Trans Multimedia (TMM) 19(9):2045–2055

    Article  Google Scholar 

  19. Gao L, Wang X, Song J, Liu Y (2019 Jul 18) Fused GRU with semantic-temporal attention for video captioning. Neurocomputing. 395:222–228

    Article  Google Scholar 

  20. Guo Z, Gao L, Song J, Xu X, Shao J, Shen HT (2016) Attention Based LSTM with semantic consistency for videos captioning. In: Proc.ACM MM, pp. 357–361

  21. Hanckmann P, Schutte K, Burghouts GJ (2012) Automated textual descriptions for a wide range of video events with 48 human actions. In: IEEE ECCV

  22. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 7

  23. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780

    Article  Google Scholar 

  24. Hori C, Hori T, Lee T-Y, Zhang Z, Harsham B, Hershey JR, Marks TK, Sumi K (2017) Attention-based multimodal fusion for video description. In Proceedings of the IEEE international conference on computer vision, pages 4193–4202

  25. Jin Q, Chen J, Chen S, Xiong Y, Hauptmann A (2016) Describing videos using multi-modal fusion. In: Proceedings of the 2016 ACM on multimedia conference, ACM, pages 1087–1091

  26. Jin T, Li Y, Zhang Z (2019) Recurrent convolutional video captioning with global and local attention. Neurocomputing 370:118–127

    Article  Google Scholar 

  27. Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. CVPR, 1

  28. Khan MUG, Zhang L, Gotoh Y (2011) Human focused video description. In: IEEE international conference on computer vision workshops (ICCV workshops)

  29. Kojima A, Tamura T, Fukunaga K (2002) Natural language description of human activities from video images based on the concept hierarchy of actions. IJCV 50(2):171–184

    Article  Google Scholar 

  30. Kojima A, Tamura T, Fukunaga K (2002) Natural language description of human activities from video images based on concept hierarchy of actions. IJCV 50(2):171–184

    Article  Google Scholar 

  31. Koller D, Heinze N, Nagel H (1991) Algorithmic characterization of vehicle trajectories from image sequences by motion verbs. In: IEEE Computer Society Conference on CVPR 90–95.

  32. Krishna R, Hata K, Ren F, Fei-Fei L, Niebles JC (2017) Dense-Captioning Events in Videos. In: ICCV

  33. Krishnamoorthy N, Malkarnenkar G, Mooney RJ, Saenko K, Guadarrama S (2013) Generating natural-language video descriptions using text-mined knowledge. In: AAAI, Vol. 1. 2.

  34. Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pages 1097–1105

  35. Li L, Gong B (2019 Jan 7) End-to-end video captioning with multitasking reinforcement learning. In: 2019 IEEE winter conference on applications of computer vision (WACV) (pp. 339-348). IEEE

  36. Li G, Ma S, Han Y. (2015) Summarization-based video caption via deep neural networks. In: ACM Multimedia, pp. 1191–1194

  37. Li X, Zhao B, Lu X (2017) Mam-rnn: multi-level attention model based rnn for video captioning. In: Proceedings of International Joint Conferences on Artificial Intelligence (IJCAI), pp. 2208–2214

  38. Li L, Tang S, Zhang Y, Deng L, Tian Q (2018. [Online]) GLA: global-local attention for image description. IEEE Trans. Multimedia 20(3):726–737. https://doi.org/10.1109/TMM.2017.2751140

    Article  Google Scholar 

  39. Li Y, Yao T, Pan Y, Chao H, Mei T (2018) Jointly Localizing and Describing Events for Dense Video Captioning. In: CVPR

  40. Li X, Zhou Z, Chen L, Gao L (2019 Mar 15) Residual attention-based LSTM for video captioning. World Wide Web 22(2):621–636

    Article  Google Scholar 

  41. Lin L-C-Y (2004) Rouge: A package for automatic evaluation of summaries. In-Text summarization branches out: Proceedings of the ACL-04 workshop, volume 8. Barcelona, Spain

  42. Liu Y, Li X, Shi Z (2017) Video captioning with listwise supervision. In: AAAI

  43. Liu A, Xu N, Wong Y, Li J, Su Y, Kankanhalli MS (2017) Hierarchical & multimodal video captioning: Discovering and transferring multimodal knowledge for vision to language. CVIU, 163:113–125

  44. Long X, Gan C, de Melo G (2018) Video captioning with multi-faceted attention. Trans Assoc Comput Linguist 6:173–184

    Article  Google Scholar 

  45. Luong MT, Pham H, Manning CD. Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025. 2015 Aug 17.

  46. Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781

  47. Mun J, Yang L, Ren Z, Xu N, Han B (2019) Streamlined dense video captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6588-6597

  48. Pan P, Xu Z, Yang Y, Wu F, Zhuang Y (2016) Hierarchical recurrent neural encoder for video representation with application to captioning. In: CVPR, pp. 1029–1038

  49. Pan Y, Yao T, Li H, Mei T (2016) Video captioning with transferred semantic attributes.arXiv preprint arXiv:1611.07675

  50. Pan Y, Mei T, Yao T, Li H, Rui Y (2016) Jointly modeling embedding and translation to bridge video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4594–4602

  51. Papineni K, Roukos S, Ward T, Zhu W-J (2002) Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics, p 311–318

  52. Pei W, Zhang J, Wang X, Ke L, Shen X, Tai Y-W (2019) Memory-attended recurrent network for video captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8347–8356

  53. Pennington J, Socher R, Manning CD (2014, October) Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1532-1543).

  54. Quadrant M, Karatzoglou A, Hidasi B, Cremonesi P (2017 Aug 27) Personalizing session-based recommendations with hierarchical recurrent neural networks. In: Proceedings of the Eleventh ACM Conference on Recommender Systems pp. 130–137

  55. Ramanishka V, Das A, Park DH, Venugopalan S, Hendricks LA, Rohrbach M, Saenko K (2016) Multimodal video description. In: Proceedings of the 2016 ACM on multimedia conference. ACM, pages 1092–1096

  56. Rohrbach A, Rohrbach M, Qiu W, Friedrich A, Pinkal M, Schiele B (September 2014) Coherent multi-sentence video description with variable level of detail. German Conference on Pattern Recognition (GCPR)

  57. Rohrbach A, Rohrbach M, Tandon N, Schiele B (2015) A dataset for movie description. In: CVPR

  58. Schuster M, Paliwal KK (1997 Nov) Bidirectional recurrent neural networks. IEEE Trans Signal Process 45(11):2673–2681

    Article  Google Scholar 

  59. Shen Z, Li J, Su Z, Li M, Chen Y, Jiang Y-G, Xue X (2017) Weakly supervised dense video captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2, p. 10

  60. Shetty R, Laaksonen J (2016) Frame-and segment-level features and candidate pool evaluation for video caption generation. In: Proceedings of the 24th ACM international conference on Multimedia pp 1073–1076

  61. Sigurdsson GA, Varol G, Wang X, Farhadi A, Laptev I, Gupta A (2016) Hollywood in homes: crowdsourcing data collection for activity understanding. In: European Conference on Computer Vision

  62. Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In: ICLR

  63. Song J, Gao L, Guo Z, Wu L, Zhang D, Shen HT (2017) Hierarchical lstm with adjusted temporal attention for video captioning. In: IJCAI, pages 2737–2743

  64. Song J, Zhang H, Li X, Gao L, Wang M, Hong R (2018) Self-supervised video hashing with hierarchical binary auto-encoder. IEEE Trans Image Process 27(7):3210–3221

    Article  MathSciNet  Google Scholar 

  65. Song J, Guo Y, Gao L, Li X, Hanjalic A, Shen HT (2018) From deterministic to generative: Multimodal stochastic runs for video captioning. IEEE Trans Neural Netw Learn Syst (TNNLS) 30:3047–3058

    Article  Google Scholar 

  66. Szegedy C, Liu W, Jia Y, Sermanet P, S. Reed,D. Anguelov, D. Erhan, V. Vanhoucke, And A. Rabinovich. Going deeper with convolutions. CVPR

  67. Tan C, Sun F, Kong T, Zhang W, Yang C, Liu C (2018, October) A survey on deep transfer learning. In: International conference on artificial neural networks (pp. 270-279). Springer, Cham

  68. Thomason J, Venugopalan S, Guadarrama S, Saenko K, Mooney RJ (2014) Integrating language and vision to generate natural language descriptions of videos in the wild. In Coling, Vol. 2, 5, 9

  69. Torabi A, Pal C, Larochelle H, Courville A (2015) Using descriptive video services to create a large data source for video annotation research. arXiv preprint arXiv:1503.01070.

  70. Tu Y, Zhang X, Liu B, Yan C (2017) Video description with spatial-temporal attention. In ACM MM, pages 1014–1022. ACM

  71. Venugopalan S, Xu H, Donahue J, Rohrbach M, Mooney R, Saenko K (2014) Translating videos to natural language using deep recurrent neural networks. arXiv preprint arXiv:1412.4729

  72. Venugopalan S, Rohrbach M, Donahue J, Mooney R, Darrell T, Saenko K (2015) Sequence to sequence-video to text. In: IEEE International Conference on Computer Vision, pages 4534–4542

  73. Wang B, Ma L, Zhang W, Liu W. (2018) Reconstruction network for video captioning. In: CVPR

  74. Wang J, Jiang W, Ma L, Liu W, Xu Y (2018) Bidirectional Attentive Fusion with Context Gating for Dense Video Captioning. In: CVPR

  75. Wang X, Chen W, Wu J, Wang YF, Yang Wang W (2018) Video captioning via hierarchical reinforcement learning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4213-4222

  76. Wang J, Wang W, Huang Y, Wang L, Tan T (2018) M3: multimodal memory modeling for video captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7512–7520

  77. Wu A, Han Y(2018) Multi-modal circulant fusion for video-to-language and backward. In: IJCAI pages 1029–1035

  78. Wu X, Li G, Cao Q, Ji Q, Lin L (2018) Interpretable video captioning via trajectory structured localization. In: CVPR, pages 6829–6837

  79. Xiong Y, Dai B, Lin D (2018) Move forward and tell a progressive generator of video descriptions. In: ECCV

  80. Xu C, Wang J, Lu H, Zhang Y (2008) A novel framework for semantic annotation and personalized retrieval of sports video. IEEE Trans Multimedia 10(3):421–436

    Article  Google Scholar 

  81. Xu R, Xiong C, Chen W, Corso JJ (2015) Jointly modeling deep video and compositional text to bridge vision and language in a unified framework. In: AAAI. Citeseer, pp. 2346–2352

  82. Xu H, Venugopalan S, Ramanishka V, Rohrbach M, Saenko K (2015) A multi-scale multiple instance video description network, a workshop on closing the loop between vision and language at ICCV

  83. Xu J, Mei T, Yao T, Rui Y (2016) Msr-vtt: A large video description dataset for bridging video and language. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5288–5296, 1, 6

  84. Xu J, Yao T, Zhang Y, Mei T (2017) Learning multimodal attention lstm networks for video captioning. In: ACM MM, pages 537–545

  85. Xu Y, Han Y, Hong R, Tian Q (2018) Sequential video vlad: training the aggregation locally and temporally. IEEE Trans Image Process (TIP) 27(10):4933–4944

    Article  MathSciNet  Google Scholar 

  86. Xu N, Liu AA, Wong Y, Zhang Y, Nie W, Su Y, Kankanhalli M (2018) Dual-stream recurrent neural network for video captioning. IEEE Trans Circuits Syst Video Technol 29(8):2482–2493

    Article  Google Scholar 

  87. Xu Y, Yang J, Mao K (2019 Sep 10) Semantic-filtered Soft-Split-aware video captioning with audio-augmented features. Neurocomputing. 357:24–35

    Article  Google Scholar 

  88. Yang Z, Han Y, Wang Z (2017) Catching the temporal regions-of-interest for video captioning. In: ACM MM, pages 146–153

  89. Yang Y, Zhou J, Ai J, Bin Y, Hanjalic A, Shen HT, Ji Y (2018) Video captioning by adversarial lstm. IEEE Trans Image Process (TIP) 27(11):5600–5611

    Article  MathSciNet  Google Scholar 

  90. Yao L, Torabi A, Cho K, Ballas N, Pal C, Larochelle H, Courville A (2015) Describing videos by exploiting temporal structure. In: Proceedings of the IEEE international conference on computer vision, pages 4507–4515

  91. You Q, Jin H, Wang Z, Fang C, Luo J (Jul. 2016) Image captioning with semantic attention. In: Proc. Comput. Vis. Pattern Recognit, pp. 4651–4659.

  92. Yu H, Wang J, Huang Z, Yang Y, Xu W (2016) Video paragraph captioning using hierarchical recurrent neural networks. In: Proceedings Of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4584–4593

  93. Yu H, Wang J, Huang Z, Yang Y, Xu W (2016) Video paragraph captioning using hierarchical recurrent neural networks. In: CVPR

  94. Zhang J, Peng Y (2019) Object-aware aggregation with the bidirectional temporal graph for video captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8327–8336

  95. Zhang X, Gao K, Zhang Y, Zhang D, Li J, Tian Q (2017) Task-driven dynamic fusion: reducing ambiguity in the video description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3713–3721

  96. Zhou YZ, Corso JJ, Socher R, Xiong C (2018) End-to-End Dense Video Captioning with Masked Transformer. In: CVPR

  97. Zhu L, Xu Z, Yang Y (2017) Bidirectional multi-rate reconstruction for temporal modeling in videos. In: CVPR, pages 2653–2662

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gopal Chaudhary.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Jain, V., Al-Turjman, F., Chaudhary, G. et al. Video captioning: a review of theory, techniques and practices. Multimed Tools Appl 81, 35619–35653 (2022). https://doi.org/10.1007/s11042-021-11878-w

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-021-11878-w

Keywords

Navigation