V2T: video to text framework using a novel automatic shot boundary detection algorithm

Singh, Alok; Singh, Thoudam Doren; Bandyopadhyay, Sivaji

doi:10.1007/s11042-022-12343-y

V2T: video to text framework using a novel automatic shot boundary detection algorithm

Published: 08 March 2022

Volume 81, pages 17989–18009, (2022)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Alok Singh ORCID: orcid.org/0000-0002-2683-0542^1,2,
Thoudam Doren Singh^1,2 &
Sivaji Bandyopadhyay^1,2

542 Accesses
11 Citations
1 Altmetric
Explore all metrics

Abstract

The generation of natural language descriptions for a video has been reported by many researchers till now. But, it is still the most interesting research topic among the researchers due to the emerging interdisciplinary problem of Computer Vision (CV), Natural Language Processing (NLP) and Deep Learning (DL). The results of a video description are still not convincing due to the redundancy of a large number of similar frames in a video. In this paper, we propose dual-stage based text generation approach in which the first stage is for reducing redundancy due to the similar frames by processing selected sets of frames and keyframe from the shots of a video and in the second stage, the text generator module will generate relevant text for a video using the selected sets of frames and keyframes of each shot. In the first stage, a flexible novel shot boundary detection (SBD or temporal boundaries) approach is proposed which will segment the video into shots and then keyframe and set of frames are selected from each shot using frame selection policy. Then, the spatio-temporal features for each segment and 2D features for each keyframe are extracted respectively using the 3D convolutional network and VGG19. These features are passed to the next stage where these features are embedded with semantic concepts related to video and then text generation will take place using Long Short Term Memory (LSTM) recurrent network. The proposed approach is the amalgamation of classical and modern computer vision techniques. In the first stage, the Noise-Resistant Local Binary Pattern (NRLBP) feature is used for detecting illumination and motion invariant temporal boundaries in a video and processing keyframes and sets of frames for the further text generation. TRECVid 2001 and 2007 datasets are used to validate the exactness of the proposed SBD approach and MSR-VTT (Microsoft Research Video to Text ) and YouTube2text (MSVD) datasets are applied to analyze and validate the performance of proposed video to text generation approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Video summarization using deep learning techniques: a detailed analysis and investigation

Article 15 March 2023

Deep Learning Techniques—R-CNN to Mask R-CNN: A Survey

Visual attention network

Article Open access 28 July 2023

Notes

Implementation of refined NRLBP available at : https://github.com/Ashwani21/Local-texture-descriptors/
https://github.com/tylin/coco-caption

References

Aafaq N, Mian A, Liu W, Gilani SZ, Shah M (2019) Video description: a survey of methods, datasets, and evaluation metrics. ACM Comput Surv (CSUR) 52(6):1–37
Article Google Scholar
Baldi P (2012) Autoencoders, unsupervised learning, and deep architectures. In: Proceedings of ICML workshop on unsupervised and transfer learning, pp 37–49
Baraldi L, Grana C, Cucchiara R (2017) Hierarchical boundary-aware neural encoder for video captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1657–1666
Bin Y, Yang Y, Shen F, Xie N, Shen HT, Li X (2018) Describing video with attention-based bidirectional lstm. IEEE Trans on Cybern 49 (7):2631–2641
Article Google Scholar
Chakraborty S, Singh A, Thounaojam DM (2021) A novel bifold-stage shot boundary detection algorithm: invariant to motion and illumination. Vis Comput, 1–12
Chakraborty S, Thounaojam DM (2019) A novel shot boundary detection system using hybrid optimization technique. Appl Intell, 1–14
Chakraborty S, Thounaojam DM, Sinha N (2021) A shot boundary detection technique based on visual colour information. Multimed Tools Applic 80 (3):4007–4022
Article Google Scholar
Chen Y, Wang S, Zhang W, Huang Q (2018) Less is more: picking informative frames for video captioning. In: Proceedings of the European conference on computer vision (ECCV), pp 358–373
Cherian A, Wang J, Hori C, Marks T (2020) Spatio-temporal ranked-attention networks for video captioning. In: The IEEE Winter conference on applications of computer vision, pp 1617–1626
Daskalakis E, Tzelepi M, Tefas A (2018) Learning deep spatiotemporal features for video captioning. Pattern Recogn Lett 116:143–149
Article Google Scholar
Ding S, Qu S, Xi Y, Wan S (2019) A long video caption generation algorithm for big video data retrieval. Futur Gener Comput Syst 93:583–595
Article Google Scholar
Gao L, Wang X, Song J, Liu Y (2019) Fused gru with semantic-temporal attention for video captioning. Neurocomputing
Hakeem A, Sheikh Y, Shah M (2004) Case^E: a hierarchical event representation for the analysis of videos. In: AAAI, pp 263–268
Hassanien A, Elgharib M, Selim A, Bae SH, Hefeeda M, Matusik W (2017) Large-scale, fast and accurate shot boundary detection through spatio-temporal convolutional neural networks. arXiv:1705.03281
Kar T, Kanungo P (2017) A motion and illumination resilient framework for automatic shot boundary detection. SIViP 11(7):1237–1244
Article Google Scholar
Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1725–1732
Kojima A, Tamura T, Fukunaga K (2002) Natural language description of human activities from video images based on concept hierarchy of actions. Int J Comput Vis 50(2):171–184
Article Google Scholar
Li W, Guo D, Fang X (2018) Multimodal architecture for video captioning with memory networks and an attention mechanism. Pattern Recogn Lett 105:23–29
Article Google Scholar
Liu AA, Xu N, Wong Y, Li J, Su YT, Kankanhalli M (2017) Hierarchical & multimodal video captioning: Discovering and transferring multimodal knowledge for vision to language. Comput Vis Image Underst 163:113–125
Article Google Scholar
Long X, Gan C, de Melo G (2018) Video captioning with multi-faceted attention. Trans Assoc Comput Linguis 6:173–184
Article Google Scholar
Nabati M, Behrad A (2020) Multi-sentence video captioning using content-oriented beam searching and multi-stage refining algorithm. Inform Process Manage 57(6):102302
Article Google Scholar
Nabati M, Behrad A (2020) Video captioning using boosted and parallel long short-term memory networks. Comput Vis Image Underst 190:102840
Article Google Scholar
Nian F, Li T, Wang Y, Wu X, Ni B, Xu C (2017) Learning explicit video attributes from mid-level representation for video captioning. Comput Vis Image Underst 163:126–138
Article Google Scholar
Pini S, Cornia M, Bolelli F, Baraldi L, Cucchiara R (2019) M-vad names: a dataset for video captioning with naming. Multimed Tools Applic 78 (10):14007–14027
Article Google Scholar
Ren J, Jiang X, Yuan J (2013) Noise-resistant local binary pattern with an embedded error-correction mechanism. IEEE Trans Image Process 22 (10):4049–4060. https://doi.org/10.1109/TIP.2013.2268976
Article MathSciNet Google Scholar
Rumelhart DE, Hinton GE, Williams RJ (1986) Learning internal representations by error propagation. MIT Press, Cambridge, pp 318–362
Google Scholar
Shetty R, Laaksonen J (2016) Frame-and segment-level features and candidate pool evaluation for video caption generation. In: Proceedings of the 24th ACM international conference on nultimedia, pp 1073–1076
Shin A, Ohnishi K, Harada T (2016) Beyond caption to narrative: video captioning with multiple sentences. In: 2016 IEEE International conference on image processing (ICIP), pp 3364–3368. IEEE
Singh A, Singh TD, Bandyopadhyay S (2020) A comprehensive review on recent methods and challenges of video description. arXiv:2011.14752
Singh A, Singh TD, Bandyopadhyay S (2020) Nits-vc system for vatex video captioning challenge 2020. arXiv:2006.04058
Singh A, Thounaojam DM, Chakraborty S (2019) A novel automatic shot boundary detection algorithm: robust to illumination and motion effect. SIViP, 1–9. https://doi.org/10.1007/s11760-019-01593-3
Tiwari AK, Kanhangad V, Pachori RB (2017) Histogram refinement for texture descriptor based image retrieval. Signal Process Image Commun 53:73–85
Article Google Scholar
Venugopalan S, Hendricks LA, Mooney R, Saenko K (2016) Improving lstm-based video description with linguistic knowledge mined from text. arXiv:1604.01729
Venugopalan S, Xu H, Donahue J, Rohrbach M, Mooney R, Saenko K (2014) Translating videos to natural language using deep recurrent neural networks. arXiv:1412.4729
Wang H, Gao C, Han Y (2020) Sequence in sequence for video captioning. Pattern Recogn Lett 130:327–334
Article Google Scholar
Xiao H, Shi J (2020) Video captioning with text-based dynamic attention and step-by-step learning. Pattern Recognition Letters
Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: neural image caption generation with visual attention. In: International conference on machine learning, pp 2048–2057
Xu Y, Yang J, Mao K (2019) Semantic-filtered soft-split-aware video captioning with audio-augmented feature. Neurocomputing 357:24–35
Article Google Scholar

Download references

Acknowledgements

This work is supported by Scheme for Promotion of Academic and Research Collaboration (SPARC) Project Code: P995 of No: SPARC/2018-2019/119/SL (IN) under MHRD, Govt of India.

Author information

Authors and Affiliations

Center for Natural Language Processing (CNLP), National Institute of Technology Silchar, Silchar, Assam, India
Alok Singh, Thoudam Doren Singh & Sivaji Bandyopadhyay
Department of Computer Science and Engineering, National Institute of Technology Silchar, Silchar, Assam, India
Alok Singh, Thoudam Doren Singh & Sivaji Bandyopadhyay

Authors

Alok Singh
View author publications
You can also search for this author in PubMed Google Scholar
Thoudam Doren Singh
View author publications
You can also search for this author in PubMed Google Scholar
Sivaji Bandyopadhyay
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Alok Singh.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Singh, A., Singh, T.D. & Bandyopadhyay, S. V2T: video to text framework using a novel automatic shot boundary detection algorithm. Multimed Tools Appl 81, 17989–18009 (2022). https://doi.org/10.1007/s11042-022-12343-y

Download citation

Received: 05 November 2020
Revised: 04 June 2021
Accepted: 18 January 2022
Published: 08 March 2022
Issue Date: May 2022
DOI: https://doi.org/10.1007/s11042-022-12343-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

V2T: video to text framework using a novel automatic shot boundary detection algorithm

Abstract

Access this article

Similar content being viewed by others

Video summarization using deep learning techniques: a detailed analysis and investigation

Deep Learning Techniques—R-CNN to Mask R-CNN: A Survey

Visual attention network

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

V2T: video to text framework using a novel automatic shot boundary detection algorithm

Abstract

Access this article

Similar content being viewed by others

Video summarization using deep learning techniques: a detailed analysis and investigation

Deep Learning Techniques—R-CNN to Mask R-CNN: A Survey

Visual attention network

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation