Abstract
The generation of natural language descriptions for a video has been reported by many researchers till now. But, it is still the most interesting research topic among the researchers due to the emerging interdisciplinary problem of Computer Vision (CV), Natural Language Processing (NLP) and Deep Learning (DL). The results of a video description are still not convincing due to the redundancy of a large number of similar frames in a video. In this paper, we propose dual-stage based text generation approach in which the first stage is for reducing redundancy due to the similar frames by processing selected sets of frames and keyframe from the shots of a video and in the second stage, the text generator module will generate relevant text for a video using the selected sets of frames and keyframes of each shot. In the first stage, a flexible novel shot boundary detection (SBD or temporal boundaries) approach is proposed which will segment the video into shots and then keyframe and set of frames are selected from each shot using frame selection policy. Then, the spatio-temporal features for each segment and 2D features for each keyframe are extracted respectively using the 3D convolutional network and VGG19. These features are passed to the next stage where these features are embedded with semantic concepts related to video and then text generation will take place using Long Short Term Memory (LSTM) recurrent network. The proposed approach is the amalgamation of classical and modern computer vision techniques. In the first stage, the Noise-Resistant Local Binary Pattern (NRLBP) feature is used for detecting illumination and motion invariant temporal boundaries in a video and processing keyframes and sets of frames for the further text generation. TRECVid 2001 and 2007 datasets are used to validate the exactness of the proposed SBD approach and MSR-VTT (Microsoft Research Video to Text ) and YouTube2text (MSVD) datasets are applied to analyze and validate the performance of proposed video to text generation approach.
Similar content being viewed by others
Notes
Implementation of refined NRLBP available at : https://github.com/Ashwani21/Local-texture-descriptors/
References
Aafaq N, Mian A, Liu W, Gilani SZ, Shah M (2019) Video description: a survey of methods, datasets, and evaluation metrics. ACM Comput Surv (CSUR) 52(6):1–37
Baldi P (2012) Autoencoders, unsupervised learning, and deep architectures. In: Proceedings of ICML workshop on unsupervised and transfer learning, pp 37–49
Baraldi L, Grana C, Cucchiara R (2017) Hierarchical boundary-aware neural encoder for video captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1657–1666
Bin Y, Yang Y, Shen F, Xie N, Shen HT, Li X (2018) Describing video with attention-based bidirectional lstm. IEEE Trans on Cybern 49 (7):2631–2641
Chakraborty S, Singh A, Thounaojam DM (2021) A novel bifold-stage shot boundary detection algorithm: invariant to motion and illumination. Vis Comput, 1–12
Chakraborty S, Thounaojam DM (2019) A novel shot boundary detection system using hybrid optimization technique. Appl Intell, 1–14
Chakraborty S, Thounaojam DM, Sinha N (2021) A shot boundary detection technique based on visual colour information. Multimed Tools Applic 80 (3):4007–4022
Chen Y, Wang S, Zhang W, Huang Q (2018) Less is more: picking informative frames for video captioning. In: Proceedings of the European conference on computer vision (ECCV), pp 358–373
Cherian A, Wang J, Hori C, Marks T (2020) Spatio-temporal ranked-attention networks for video captioning. In: The IEEE Winter conference on applications of computer vision, pp 1617–1626
Daskalakis E, Tzelepi M, Tefas A (2018) Learning deep spatiotemporal features for video captioning. Pattern Recogn Lett 116:143–149
Ding S, Qu S, Xi Y, Wan S (2019) A long video caption generation algorithm for big video data retrieval. Futur Gener Comput Syst 93:583–595
Gao L, Wang X, Song J, Liu Y (2019) Fused gru with semantic-temporal attention for video captioning. Neurocomputing
Hakeem A, Sheikh Y, Shah M (2004) CaseE: a hierarchical event representation for the analysis of videos. In: AAAI, pp 263–268
Hassanien A, Elgharib M, Selim A, Bae SH, Hefeeda M, Matusik W (2017) Large-scale, fast and accurate shot boundary detection through spatio-temporal convolutional neural networks. arXiv:1705.03281
Kar T, Kanungo P (2017) A motion and illumination resilient framework for automatic shot boundary detection. SIViP 11(7):1237–1244
Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1725–1732
Kojima A, Tamura T, Fukunaga K (2002) Natural language description of human activities from video images based on concept hierarchy of actions. Int J Comput Vis 50(2):171–184
Li W, Guo D, Fang X (2018) Multimodal architecture for video captioning with memory networks and an attention mechanism. Pattern Recogn Lett 105:23–29
Liu AA, Xu N, Wong Y, Li J, Su YT, Kankanhalli M (2017) Hierarchical & multimodal video captioning: Discovering and transferring multimodal knowledge for vision to language. Comput Vis Image Underst 163:113–125
Long X, Gan C, de Melo G (2018) Video captioning with multi-faceted attention. Trans Assoc Comput Linguis 6:173–184
Nabati M, Behrad A (2020) Multi-sentence video captioning using content-oriented beam searching and multi-stage refining algorithm. Inform Process Manage 57(6):102302
Nabati M, Behrad A (2020) Video captioning using boosted and parallel long short-term memory networks. Comput Vis Image Underst 190:102840
Nian F, Li T, Wang Y, Wu X, Ni B, Xu C (2017) Learning explicit video attributes from mid-level representation for video captioning. Comput Vis Image Underst 163:126–138
Pini S, Cornia M, Bolelli F, Baraldi L, Cucchiara R (2019) M-vad names: a dataset for video captioning with naming. Multimed Tools Applic 78 (10):14007–14027
Ren J, Jiang X, Yuan J (2013) Noise-resistant local binary pattern with an embedded error-correction mechanism. IEEE Trans Image Process 22 (10):4049–4060. https://doi.org/10.1109/TIP.2013.2268976
Rumelhart DE, Hinton GE, Williams RJ (1986) Learning internal representations by error propagation. MIT Press, Cambridge, pp 318–362
Shetty R, Laaksonen J (2016) Frame-and segment-level features and candidate pool evaluation for video caption generation. In: Proceedings of the 24th ACM international conference on nultimedia, pp 1073–1076
Shin A, Ohnishi K, Harada T (2016) Beyond caption to narrative: video captioning with multiple sentences. In: 2016 IEEE International conference on image processing (ICIP), pp 3364–3368. IEEE
Singh A, Singh TD, Bandyopadhyay S (2020) A comprehensive review on recent methods and challenges of video description. arXiv:2011.14752
Singh A, Singh TD, Bandyopadhyay S (2020) Nits-vc system for vatex video captioning challenge 2020. arXiv:2006.04058
Singh A, Thounaojam DM, Chakraborty S (2019) A novel automatic shot boundary detection algorithm: robust to illumination and motion effect. SIViP, 1–9. https://doi.org/10.1007/s11760-019-01593-3
Tiwari AK, Kanhangad V, Pachori RB (2017) Histogram refinement for texture descriptor based image retrieval. Signal Process Image Commun 53:73–85
Venugopalan S, Hendricks LA, Mooney R, Saenko K (2016) Improving lstm-based video description with linguistic knowledge mined from text. arXiv:1604.01729
Venugopalan S, Xu H, Donahue J, Rohrbach M, Mooney R, Saenko K (2014) Translating videos to natural language using deep recurrent neural networks. arXiv:1412.4729
Wang H, Gao C, Han Y (2020) Sequence in sequence for video captioning. Pattern Recogn Lett 130:327–334
Xiao H, Shi J (2020) Video captioning with text-based dynamic attention and step-by-step learning. Pattern Recognition Letters
Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: neural image caption generation with visual attention. In: International conference on machine learning, pp 2048–2057
Xu Y, Yang J, Mao K (2019) Semantic-filtered soft-split-aware video captioning with audio-augmented feature. Neurocomputing 357:24–35
Acknowledgements
This work is supported by Scheme for Promotion of Academic and Research Collaboration (SPARC) Project Code: P995 of No: SPARC/2018-2019/119/SL (IN) under MHRD, Govt of India.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Singh, A., Singh, T.D. & Bandyopadhyay, S. V2T: video to text framework using a novel automatic shot boundary detection algorithm. Multimed Tools Appl 81, 17989–18009 (2022). https://doi.org/10.1007/s11042-022-12343-y
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-022-12343-y