Abstract
In today’s world, video captioning is extensively used in various applications for specially-abled and, more specifically, visually abled persons. With advancements in technology for object detection and natural processing, there has been an instant surge infusing the above mainstream tasks. One such example of this fusion resulted in the generation of Image captions when an input image is fed to the system, and it gives a short description of what is present in the image. This fusion pertained to images and was further moved to be implemented on the Videos, with some tweaking in the current methods. This paper presents the survey of the state of art techniques of various video captioning methods. There have been many inputs provided by people worldwide in this domain; thus, there was a need to compile, study and analyze all the results and present that in a comprehensive study, which we have done in this paper. The comparison of various video captioning methods on the distinct dataset was evaluated on different parameters, which were most common and mainly used for image and video analysis. This review was done for methods used from the year 2015–2019 (year by year). The most commonly used dataset and evaluation method are also pictorially represented in a bar graph and scatter plot for each year for the respective evaluation parameter. Though a lot of analysis and research has been done on video captioning, our survey shows many problems.
Similar content being viewed by others
References
Aafaq N, Akhtar N, Liu W, Gilani SZ, Mian A (2019) Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition pp. 12487-12496
Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473
Banerjee S, Lavie A (2005) Meteor: an automatic metric format evaluation with improved correlation with human judgments. In: Proceedings of the ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, volume 29, p 65–72
Baraldi L, Grana C, Cucchiara R (2017) Hierarchical boundary-aware neural encoder for video captioning. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3185–3194, https://doi.org/10.1109/CVPR.2017.339
Barbu A, Bridge A, Burchill Z, Corian D, Dickinson S, Fidler S, Michaux A, Mussman S, Narayanaswamy S, Salvi D, et al. (2012) Video in sentences out. arXiv preprint arXiv:1204.2742,(2012)
Barbu A, Bridge A, Burchill Z, Coroian D, Dickinson S, Fidler S, Michaux A, Mussman S, Narayanaswamy S, Salvi D et al. (2012) Video in sentences out. arXiv preprint arXiv:1204.2742,(2012)
Brand M (1997) The” inverse Hollywood problem”: from video to scripts and storyboards via causal analysis. In: AAAI/IAAI Citeseer, 132–137.
Buch S, Escorcia V, Shen C, Ghanem B, Carlos Niebles J. SST: Single-stream temporal action proposals. In CVPR
Chen DL, Dolan WB (2011) Collecting highly parallel data for paraphrase evaluation. In: Proc. ACL HLT pp. 190–200.
Chen S, Chen J, Jin Q, Hauptmann A (2017) Video captioning with the guidance of multimodal latent topics, in proceedings of the 2017 ACM on multimedia conference, ACM, pp. 1838–1846.
Chen Y, Wang S, Zhang W, Huang Q (2018) Less is more: Picking informative frames for video captioning. In: ECCV
Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y. (2014 Jun 3) Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078
Cho K, van Merrienboer B, Bahdanau D, Bengio Y (2014) On the properties of neural machine translation: Encoder–decoder approaches. In: Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics, and Structure in Statistical Translation
Cho K, van Merrienboer B, Gulcehre C, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: Conference on Empirical Methods in Natural Language Processing
Chollet F (2017) Xception: deep learning with depthwise separable convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1251–1258)
Dong J, Li X, Lan W, Huo Y, Snoek CG (2016) Early embedding and late reranking for video captioning. In: Proceedings of the 2016 ACM on multimedia conference, ACM, page 1082–1086
Gan Z, Gan C, He X, Pu Y, Tran K, Gao J, Carin L, Deng L (2017) Semantic compositional networks for visual captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 5630–5639
Gao L, Guo Z, Zhang H, Xu X, Shen HT (2017) Video captioning with attention-based lstm and semantic consistency. IEEE Trans Multimedia (TMM) 19(9):2045–2055
Gao L, Wang X, Song J, Liu Y (2019 Jul 18) Fused GRU with semantic-temporal attention for video captioning. Neurocomputing. 395:222–228
Guo Z, Gao L, Song J, Xu X, Shao J, Shen HT (2016) Attention Based LSTM with semantic consistency for videos captioning. In: Proc.ACM MM, pp. 357–361
Hanckmann P, Schutte K, Burghouts GJ (2012) Automated textual descriptions for a wide range of video events with 48 human actions. In: IEEE ECCV
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 7
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Hori C, Hori T, Lee T-Y, Zhang Z, Harsham B, Hershey JR, Marks TK, Sumi K (2017) Attention-based multimodal fusion for video description. In Proceedings of the IEEE international conference on computer vision, pages 4193–4202
Jin Q, Chen J, Chen S, Xiong Y, Hauptmann A (2016) Describing videos using multi-modal fusion. In: Proceedings of the 2016 ACM on multimedia conference, ACM, pages 1087–1091
Jin T, Li Y, Zhang Z (2019) Recurrent convolutional video captioning with global and local attention. Neurocomputing 370:118–127
Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. CVPR, 1
Khan MUG, Zhang L, Gotoh Y (2011) Human focused video description. In: IEEE international conference on computer vision workshops (ICCV workshops)
Kojima A, Tamura T, Fukunaga K (2002) Natural language description of human activities from video images based on the concept hierarchy of actions. IJCV 50(2):171–184
Kojima A, Tamura T, Fukunaga K (2002) Natural language description of human activities from video images based on concept hierarchy of actions. IJCV 50(2):171–184
Koller D, Heinze N, Nagel H (1991) Algorithmic characterization of vehicle trajectories from image sequences by motion verbs. In: IEEE Computer Society Conference on CVPR 90–95.
Krishna R, Hata K, Ren F, Fei-Fei L, Niebles JC (2017) Dense-Captioning Events in Videos. In: ICCV
Krishnamoorthy N, Malkarnenkar G, Mooney RJ, Saenko K, Guadarrama S (2013) Generating natural-language video descriptions using text-mined knowledge. In: AAAI, Vol. 1. 2.
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pages 1097–1105
Li L, Gong B (2019 Jan 7) End-to-end video captioning with multitasking reinforcement learning. In: 2019 IEEE winter conference on applications of computer vision (WACV) (pp. 339-348). IEEE
Li G, Ma S, Han Y. (2015) Summarization-based video caption via deep neural networks. In: ACM Multimedia, pp. 1191–1194
Li X, Zhao B, Lu X (2017) Mam-rnn: multi-level attention model based rnn for video captioning. In: Proceedings of International Joint Conferences on Artificial Intelligence (IJCAI), pp. 2208–2214
Li L, Tang S, Zhang Y, Deng L, Tian Q (2018. [Online]) GLA: global-local attention for image description. IEEE Trans. Multimedia 20(3):726–737. https://doi.org/10.1109/TMM.2017.2751140
Li Y, Yao T, Pan Y, Chao H, Mei T (2018) Jointly Localizing and Describing Events for Dense Video Captioning. In: CVPR
Li X, Zhou Z, Chen L, Gao L (2019 Mar 15) Residual attention-based LSTM for video captioning. World Wide Web 22(2):621–636
Lin L-C-Y (2004) Rouge: A package for automatic evaluation of summaries. In-Text summarization branches out: Proceedings of the ACL-04 workshop, volume 8. Barcelona, Spain
Liu Y, Li X, Shi Z (2017) Video captioning with listwise supervision. In: AAAI
Liu A, Xu N, Wong Y, Li J, Su Y, Kankanhalli MS (2017) Hierarchical & multimodal video captioning: Discovering and transferring multimodal knowledge for vision to language. CVIU, 163:113–125
Long X, Gan C, de Melo G (2018) Video captioning with multi-faceted attention. Trans Assoc Comput Linguist 6:173–184
Luong MT, Pham H, Manning CD. Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025. 2015 Aug 17.
Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781
Mun J, Yang L, Ren Z, Xu N, Han B (2019) Streamlined dense video captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6588-6597
Pan P, Xu Z, Yang Y, Wu F, Zhuang Y (2016) Hierarchical recurrent neural encoder for video representation with application to captioning. In: CVPR, pp. 1029–1038
Pan Y, Yao T, Li H, Mei T (2016) Video captioning with transferred semantic attributes.arXiv preprint arXiv:1611.07675
Pan Y, Mei T, Yao T, Li H, Rui Y (2016) Jointly modeling embedding and translation to bridge video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4594–4602
Papineni K, Roukos S, Ward T, Zhu W-J (2002) Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics, p 311–318
Pei W, Zhang J, Wang X, Ke L, Shen X, Tai Y-W (2019) Memory-attended recurrent network for video captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8347–8356
Pennington J, Socher R, Manning CD (2014, October) Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1532-1543).
Quadrant M, Karatzoglou A, Hidasi B, Cremonesi P (2017 Aug 27) Personalizing session-based recommendations with hierarchical recurrent neural networks. In: Proceedings of the Eleventh ACM Conference on Recommender Systems pp. 130–137
Ramanishka V, Das A, Park DH, Venugopalan S, Hendricks LA, Rohrbach M, Saenko K (2016) Multimodal video description. In: Proceedings of the 2016 ACM on multimedia conference. ACM, pages 1092–1096
Rohrbach A, Rohrbach M, Qiu W, Friedrich A, Pinkal M, Schiele B (September 2014) Coherent multi-sentence video description with variable level of detail. German Conference on Pattern Recognition (GCPR)
Rohrbach A, Rohrbach M, Tandon N, Schiele B (2015) A dataset for movie description. In: CVPR
Schuster M, Paliwal KK (1997 Nov) Bidirectional recurrent neural networks. IEEE Trans Signal Process 45(11):2673–2681
Shen Z, Li J, Su Z, Li M, Chen Y, Jiang Y-G, Xue X (2017) Weakly supervised dense video captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2, p. 10
Shetty R, Laaksonen J (2016) Frame-and segment-level features and candidate pool evaluation for video caption generation. In: Proceedings of the 24th ACM international conference on Multimedia pp 1073–1076
Sigurdsson GA, Varol G, Wang X, Farhadi A, Laptev I, Gupta A (2016) Hollywood in homes: crowdsourcing data collection for activity understanding. In: European Conference on Computer Vision
Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In: ICLR
Song J, Gao L, Guo Z, Wu L, Zhang D, Shen HT (2017) Hierarchical lstm with adjusted temporal attention for video captioning. In: IJCAI, pages 2737–2743
Song J, Zhang H, Li X, Gao L, Wang M, Hong R (2018) Self-supervised video hashing with hierarchical binary auto-encoder. IEEE Trans Image Process 27(7):3210–3221
Song J, Guo Y, Gao L, Li X, Hanjalic A, Shen HT (2018) From deterministic to generative: Multimodal stochastic runs for video captioning. IEEE Trans Neural Netw Learn Syst (TNNLS) 30:3047–3058
Szegedy C, Liu W, Jia Y, Sermanet P, S. Reed,D. Anguelov, D. Erhan, V. Vanhoucke, And A. Rabinovich. Going deeper with convolutions. CVPR
Tan C, Sun F, Kong T, Zhang W, Yang C, Liu C (2018, October) A survey on deep transfer learning. In: International conference on artificial neural networks (pp. 270-279). Springer, Cham
Thomason J, Venugopalan S, Guadarrama S, Saenko K, Mooney RJ (2014) Integrating language and vision to generate natural language descriptions of videos in the wild. In Coling, Vol. 2, 5, 9
Torabi A, Pal C, Larochelle H, Courville A (2015) Using descriptive video services to create a large data source for video annotation research. arXiv preprint arXiv:1503.01070.
Tu Y, Zhang X, Liu B, Yan C (2017) Video description with spatial-temporal attention. In ACM MM, pages 1014–1022. ACM
Venugopalan S, Xu H, Donahue J, Rohrbach M, Mooney R, Saenko K (2014) Translating videos to natural language using deep recurrent neural networks. arXiv preprint arXiv:1412.4729
Venugopalan S, Rohrbach M, Donahue J, Mooney R, Darrell T, Saenko K (2015) Sequence to sequence-video to text. In: IEEE International Conference on Computer Vision, pages 4534–4542
Wang B, Ma L, Zhang W, Liu W. (2018) Reconstruction network for video captioning. In: CVPR
Wang J, Jiang W, Ma L, Liu W, Xu Y (2018) Bidirectional Attentive Fusion with Context Gating for Dense Video Captioning. In: CVPR
Wang X, Chen W, Wu J, Wang YF, Yang Wang W (2018) Video captioning via hierarchical reinforcement learning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4213-4222
Wang J, Wang W, Huang Y, Wang L, Tan T (2018) M3: multimodal memory modeling for video captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7512–7520
Wu A, Han Y(2018) Multi-modal circulant fusion for video-to-language and backward. In: IJCAI pages 1029–1035
Wu X, Li G, Cao Q, Ji Q, Lin L (2018) Interpretable video captioning via trajectory structured localization. In: CVPR, pages 6829–6837
Xiong Y, Dai B, Lin D (2018) Move forward and tell a progressive generator of video descriptions. In: ECCV
Xu C, Wang J, Lu H, Zhang Y (2008) A novel framework for semantic annotation and personalized retrieval of sports video. IEEE Trans Multimedia 10(3):421–436
Xu R, Xiong C, Chen W, Corso JJ (2015) Jointly modeling deep video and compositional text to bridge vision and language in a unified framework. In: AAAI. Citeseer, pp. 2346–2352
Xu H, Venugopalan S, Ramanishka V, Rohrbach M, Saenko K (2015) A multi-scale multiple instance video description network, a workshop on closing the loop between vision and language at ICCV
Xu J, Mei T, Yao T, Rui Y (2016) Msr-vtt: A large video description dataset for bridging video and language. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5288–5296, 1, 6
Xu J, Yao T, Zhang Y, Mei T (2017) Learning multimodal attention lstm networks for video captioning. In: ACM MM, pages 537–545
Xu Y, Han Y, Hong R, Tian Q (2018) Sequential video vlad: training the aggregation locally and temporally. IEEE Trans Image Process (TIP) 27(10):4933–4944
Xu N, Liu AA, Wong Y, Zhang Y, Nie W, Su Y, Kankanhalli M (2018) Dual-stream recurrent neural network for video captioning. IEEE Trans Circuits Syst Video Technol 29(8):2482–2493
Xu Y, Yang J, Mao K (2019 Sep 10) Semantic-filtered Soft-Split-aware video captioning with audio-augmented features. Neurocomputing. 357:24–35
Yang Z, Han Y, Wang Z (2017) Catching the temporal regions-of-interest for video captioning. In: ACM MM, pages 146–153
Yang Y, Zhou J, Ai J, Bin Y, Hanjalic A, Shen HT, Ji Y (2018) Video captioning by adversarial lstm. IEEE Trans Image Process (TIP) 27(11):5600–5611
Yao L, Torabi A, Cho K, Ballas N, Pal C, Larochelle H, Courville A (2015) Describing videos by exploiting temporal structure. In: Proceedings of the IEEE international conference on computer vision, pages 4507–4515
You Q, Jin H, Wang Z, Fang C, Luo J (Jul. 2016) Image captioning with semantic attention. In: Proc. Comput. Vis. Pattern Recognit, pp. 4651–4659.
Yu H, Wang J, Huang Z, Yang Y, Xu W (2016) Video paragraph captioning using hierarchical recurrent neural networks. In: Proceedings Of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4584–4593
Yu H, Wang J, Huang Z, Yang Y, Xu W (2016) Video paragraph captioning using hierarchical recurrent neural networks. In: CVPR
Zhang J, Peng Y (2019) Object-aware aggregation with the bidirectional temporal graph for video captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8327–8336
Zhang X, Gao K, Zhang Y, Zhang D, Li J, Tian Q (2017) Task-driven dynamic fusion: reducing ambiguity in the video description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3713–3721
Zhou YZ, Corso JJ, Socher R, Xiong C (2018) End-to-End Dense Video Captioning with Masked Transformer. In: CVPR
Zhu L, Xu Z, Yang Y (2017) Bidirectional multi-rate reconstruction for temporal modeling in videos. In: CVPR, pages 2653–2662
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Jain, V., Al-Turjman, F., Chaudhary, G. et al. Video captioning: a review of theory, techniques and practices. Multimed Tools Appl 81, 35619–35653 (2022). https://doi.org/10.1007/s11042-021-11878-w
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-021-11878-w