Skip to main content
Log in

V2T: video to text framework using a novel automatic shot boundary detection algorithm

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

The generation of natural language descriptions for a video has been reported by many researchers till now. But, it is still the most interesting research topic among the researchers due to the emerging interdisciplinary problem of Computer Vision (CV), Natural Language Processing (NLP) and Deep Learning (DL). The results of a video description are still not convincing due to the redundancy of a large number of similar frames in a video. In this paper, we propose dual-stage based text generation approach in which the first stage is for reducing redundancy due to the similar frames by processing selected sets of frames and keyframe from the shots of a video and in the second stage, the text generator module will generate relevant text for a video using the selected sets of frames and keyframes of each shot. In the first stage, a flexible novel shot boundary detection (SBD or temporal boundaries) approach is proposed which will segment the video into shots and then keyframe and set of frames are selected from each shot using frame selection policy. Then, the spatio-temporal features for each segment and 2D features for each keyframe are extracted respectively using the 3D convolutional network and VGG19. These features are passed to the next stage where these features are embedded with semantic concepts related to video and then text generation will take place using Long Short Term Memory (LSTM) recurrent network. The proposed approach is the amalgamation of classical and modern computer vision techniques. In the first stage, the Noise-Resistant Local Binary Pattern (NRLBP) feature is used for detecting illumination and motion invariant temporal boundaries in a video and processing keyframes and sets of frames for the further text generation. TRECVid 2001 and 2007 datasets are used to validate the exactness of the proposed SBD approach and MSR-VTT (Microsoft Research Video to Text ) and YouTube2text (MSVD) datasets are applied to analyze and validate the performance of proposed video to text generation approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Notes

  1. Implementation of refined NRLBP available at : https://github.com/Ashwani21/Local-texture-descriptors/

  2. https://github.com/tylin/coco-caption

References

  1. Aafaq N, Mian A, Liu W, Gilani SZ, Shah M (2019) Video description: a survey of methods, datasets, and evaluation metrics. ACM Comput Surv (CSUR) 52(6):1–37

    Article  Google Scholar 

  2. Baldi P (2012) Autoencoders, unsupervised learning, and deep architectures. In: Proceedings of ICML workshop on unsupervised and transfer learning, pp 37–49

  3. Baraldi L, Grana C, Cucchiara R (2017) Hierarchical boundary-aware neural encoder for video captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1657–1666

  4. Bin Y, Yang Y, Shen F, Xie N, Shen HT, Li X (2018) Describing video with attention-based bidirectional lstm. IEEE Trans on Cybern 49 (7):2631–2641

    Article  Google Scholar 

  5. Chakraborty S, Singh A, Thounaojam DM (2021) A novel bifold-stage shot boundary detection algorithm: invariant to motion and illumination. Vis Comput, 1–12

  6. Chakraborty S, Thounaojam DM (2019) A novel shot boundary detection system using hybrid optimization technique. Appl Intell, 1–14

  7. Chakraborty S, Thounaojam DM, Sinha N (2021) A shot boundary detection technique based on visual colour information. Multimed Tools Applic 80 (3):4007–4022

    Article  Google Scholar 

  8. Chen Y, Wang S, Zhang W, Huang Q (2018) Less is more: picking informative frames for video captioning. In: Proceedings of the European conference on computer vision (ECCV), pp 358–373

  9. Cherian A, Wang J, Hori C, Marks T (2020) Spatio-temporal ranked-attention networks for video captioning. In: The IEEE Winter conference on applications of computer vision, pp 1617–1626

  10. Daskalakis E, Tzelepi M, Tefas A (2018) Learning deep spatiotemporal features for video captioning. Pattern Recogn Lett 116:143–149

    Article  Google Scholar 

  11. Ding S, Qu S, Xi Y, Wan S (2019) A long video caption generation algorithm for big video data retrieval. Futur Gener Comput Syst 93:583–595

    Article  Google Scholar 

  12. Gao L, Wang X, Song J, Liu Y (2019) Fused gru with semantic-temporal attention for video captioning. Neurocomputing

  13. Hakeem A, Sheikh Y, Shah M (2004) CaseE: a hierarchical event representation for the analysis of videos. In: AAAI, pp 263–268

  14. Hassanien A, Elgharib M, Selim A, Bae SH, Hefeeda M, Matusik W (2017) Large-scale, fast and accurate shot boundary detection through spatio-temporal convolutional neural networks. arXiv:1705.03281

  15. Kar T, Kanungo P (2017) A motion and illumination resilient framework for automatic shot boundary detection. SIViP 11(7):1237–1244

    Article  Google Scholar 

  16. Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1725–1732

  17. Kojima A, Tamura T, Fukunaga K (2002) Natural language description of human activities from video images based on concept hierarchy of actions. Int J Comput Vis 50(2):171–184

    Article  Google Scholar 

  18. Li W, Guo D, Fang X (2018) Multimodal architecture for video captioning with memory networks and an attention mechanism. Pattern Recogn Lett 105:23–29

    Article  Google Scholar 

  19. Liu AA, Xu N, Wong Y, Li J, Su YT, Kankanhalli M (2017) Hierarchical & multimodal video captioning: Discovering and transferring multimodal knowledge for vision to language. Comput Vis Image Underst 163:113–125

    Article  Google Scholar 

  20. Long X, Gan C, de Melo G (2018) Video captioning with multi-faceted attention. Trans Assoc Comput Linguis 6:173–184

    Article  Google Scholar 

  21. Nabati M, Behrad A (2020) Multi-sentence video captioning using content-oriented beam searching and multi-stage refining algorithm. Inform Process Manage 57(6):102302

    Article  Google Scholar 

  22. Nabati M, Behrad A (2020) Video captioning using boosted and parallel long short-term memory networks. Comput Vis Image Underst 190:102840

    Article  Google Scholar 

  23. Nian F, Li T, Wang Y, Wu X, Ni B, Xu C (2017) Learning explicit video attributes from mid-level representation for video captioning. Comput Vis Image Underst 163:126–138

    Article  Google Scholar 

  24. Pini S, Cornia M, Bolelli F, Baraldi L, Cucchiara R (2019) M-vad names: a dataset for video captioning with naming. Multimed Tools Applic 78 (10):14007–14027

    Article  Google Scholar 

  25. Ren J, Jiang X, Yuan J (2013) Noise-resistant local binary pattern with an embedded error-correction mechanism. IEEE Trans Image Process 22 (10):4049–4060. https://doi.org/10.1109/TIP.2013.2268976

    Article  MathSciNet  Google Scholar 

  26. Rumelhart DE, Hinton GE, Williams RJ (1986) Learning internal representations by error propagation. MIT Press, Cambridge, pp 318–362

    Google Scholar 

  27. Shetty R, Laaksonen J (2016) Frame-and segment-level features and candidate pool evaluation for video caption generation. In: Proceedings of the 24th ACM international conference on nultimedia, pp 1073–1076

  28. Shin A, Ohnishi K, Harada T (2016) Beyond caption to narrative: video captioning with multiple sentences. In: 2016 IEEE International conference on image processing (ICIP), pp 3364–3368. IEEE

  29. Singh A, Singh TD, Bandyopadhyay S (2020) A comprehensive review on recent methods and challenges of video description. arXiv:2011.14752

  30. Singh A, Singh TD, Bandyopadhyay S (2020) Nits-vc system for vatex video captioning challenge 2020. arXiv:2006.04058

  31. Singh A, Thounaojam DM, Chakraborty S (2019) A novel automatic shot boundary detection algorithm: robust to illumination and motion effect. SIViP, 1–9. https://doi.org/10.1007/s11760-019-01593-3

  32. Tiwari AK, Kanhangad V, Pachori RB (2017) Histogram refinement for texture descriptor based image retrieval. Signal Process Image Commun 53:73–85

    Article  Google Scholar 

  33. Venugopalan S, Hendricks LA, Mooney R, Saenko K (2016) Improving lstm-based video description with linguistic knowledge mined from text. arXiv:1604.01729

  34. Venugopalan S, Xu H, Donahue J, Rohrbach M, Mooney R, Saenko K (2014) Translating videos to natural language using deep recurrent neural networks. arXiv:1412.4729

  35. Wang H, Gao C, Han Y (2020) Sequence in sequence for video captioning. Pattern Recogn Lett 130:327–334

    Article  Google Scholar 

  36. Xiao H, Shi J (2020) Video captioning with text-based dynamic attention and step-by-step learning. Pattern Recognition Letters

  37. Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: neural image caption generation with visual attention. In: International conference on machine learning, pp 2048–2057

  38. Xu Y, Yang J, Mao K (2019) Semantic-filtered soft-split-aware video captioning with audio-augmented feature. Neurocomputing 357:24–35

    Article  Google Scholar 

Download references

Acknowledgements

This work is supported by Scheme for Promotion of Academic and Research Collaboration (SPARC) Project Code: P995 of No: SPARC/2018-2019/119/SL (IN) under MHRD, Govt of India.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Alok Singh.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Singh, A., Singh, T.D. & Bandyopadhyay, S. V2T: video to text framework using a novel automatic shot boundary detection algorithm. Multimed Tools Appl 81, 17989–18009 (2022). https://doi.org/10.1007/s11042-022-12343-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-022-12343-y

Keywords

Navigation