Skip to main content
Log in

Video Captioning using Sentence Vector-enabled Convolutional Framework with Short-Connected LSTM

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

The principal objective of video/image captioning is to portray the dynamics of a video clip in plain natural language. Captioning is motivated by its ability to make the video more accessible to deaf and hard-of-hearing individuals, to help people focus on and recall information more readily, and to watch it in sound-sensitive locations. The most frequently utilized design paradigm is the revolutionary structurally improved encoder-decoder configuration. Recent developments emphasize the utilization of various creative structural modifications to maximize efficiency while demonstrating their viability in real-world applications. The utilization of well-known and well-researched technological advancements such as deep Convolutional Neural Networks (CNNs) and Sentence Transformers are trending in encoder-decoders. This paper proposes an approach for efficiently captioning videos using CNN and a short-connected LSTM-based encoder-decoder model blended with a sentence context vector. This sentence context vector emphasizes the relationship between the video and text spaces. Inspired by the human visual system, the attention mechanism is utilized to selectively concentrate on the context of the important frames. Also, a contextual hybrid embedding block is presented for connecting the two vector spaces generated during the encoding and decoding stages. The proposed architecture is investigated through well-known CNN architectures and various word embeddings. It is assessed using two benchmark video captioning datasets, MSVD and MSR-VTT, considering standard evaluation metrics such as BLEU, METEOR, ROUGH, and CIDEr. In accordance with experimental exploration, when the proposed model with NASNet-large alone is viewed across all three embeddings, the BERT findings on MSVD Dataset performed better than the results obtained with the other two embeddings. Inception-v4 outperformed VGG-16, ResNet-152, and NASNet-Large for feature extraction. Considering word embedding initiatives, BERT is far superior to ELMo and GloVe based on the MSR-VTT dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Data Availability

Data sharing not applicable to this article as no datasets were generated or analyzed during the current study.

References

  1. Amirian S, Rasheed K, Taha TR, Arabnia HR (2020) Automatic image and video caption generation with deep learning: A concise review and algorithmic overlap. IEEE Access. 8:218386–400

    Article  Google Scholar 

  2. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural computation. 9(8):1735–80

    Article  Google Scholar 

  3. Su Y, Xia H, Liang Q, Nie W (2021) Exposing DeepFake Videos Using Attention Based Convolutional LSTM Network. Neural Processing Letters. 53(6):4159–75

    Article  Google Scholar 

  4. Gao L, Guo Z, Zhang H, Xu X, Shen HT (2017) Video Captioning With Attention-Based LSTM and Semantic Consistency. IEEE Transactions on Multimedia. 19(9):2045–55

    Article  Google Scholar 

  5. Cao P, Yang Z, Sun L, Liang Y, Yang MQ, Guan R (2019) Image captioning with bidirectional semantic attention-based guiding of long short-term memory. Neural Processing Letters. 50(1):103–19

    Article  Google Scholar 

  6. Zoph B, Vasudevan V, Shlens J, Le QV. Learning Transferable Architectures for Scalable Image Recognition. CoRR. 2017;abs/1707.07012

  7. Szegedy C, Ioffe S, Vanhoucke V, Alemi AA. Inception-v4, inception-resnet and the impact of residual connections on learning. In: Thirty-first AAAI conference on artificial intelligence; 2017.

  8. He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2016. p. 770-8

  9. Simonyan K, Zisserman A. Very Deep Convolutional Networks for Large-Scale Image Recognition. In: The 3rd International Conference on Learning Representations (ICLR2015); 2015. Available from: https://arxiv.org/abs/1409.1556

  10. Devlin J, Chang M, Lee K, Toutanova K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. CoRR. 2018;abs/1810.04805

  11. Peters ME, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, et al. Deep contextualized word representations. CoRR. 2018;abs/1802.05365. Available from: http://arxiv.org/abs/1802.05365

  12. Pennington J, Socher R, Manning CD. GloVe: Global Vectors for Word Representation. In: Empirical Methods in Natural Language Processing (EMNLP); 2014. p. 1532-43. Available from: http://www.aclweb.org/anthology/D14-1162

  13. Papineni K, Roukos S, Ward T, Zhu WJ. Bleu: a Method for Automatic Evaluation of Machine Translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics; 2002. p. 311-8

  14. Banerjee S, Lavie A. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In: Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. Ann Arbor, Michigan: Association for Computational Linguistics; 2005. p. 65-72

  15. Lin CY. ROUGE: A Package for Automatic Evaluation of Summaries. In: Text Summarization Branches Out. Barcelona, Spain: Association for Computational Linguistics; 2004. p. 74-81. Available from: https://www.aclweb.org/anthology/W04-1013

  16. Vedantam R, Zitnick CL, Parikh D. CIDEr: Consensus-based Image Description Evaluation. CoRR. 2014;abs/1411.5726

  17. Vinyals O, Toshev A, Bengio S, Erhan D. Show and Tell: A Neural Image Caption Generator. CoRR. 2014;abs/1411.4555

  18. You Q, Jin H, Wang Z, Fang C, Luo J. Image Captioning with Semantic Attention. CoRR. 2016;abs/1603.03925

  19. Lin JC, Zhang CY. A New Memory Based on Sequence to Sequence Model for Video Captioning. In: 2021 International Conference on Security, Pattern Analysis, and Cybernetics SPAC); 2021. p. 470-6

  20. Zhang Z, Xu D, Ouyang W, Zhou L (2021) Dense Video Captioning Using Graph-Based Sentence Summarization. IEEE Transactions on Multimedia. 23:1799–810

    Article  Google Scholar 

  21. Gao L, Li X, Song J, Shen HT (2020) Hierarchical LSTMs with Adaptive Attention for Visual Captioning. IEEE Transactions on Pattern Analysis and Machine Intelligence. 42(5):1112–31

    Google Scholar 

  22. Liu S, Ren Z, Yuan J (2021) SibNet: Sibling Convolutional Encoder for Video Captioning. IEEE Transactions on Pattern Analysis and Machine Intelligence. 43(9):3259–72

    Article  Google Scholar 

  23. Xu N, Liu A, Nie W, Su Y (2018) Attention-in-Attention Networks for Surveillance Video Understanding in Internet of Things. IEEE Internet of Things Journal. 5(5):3419–29

    Article  Google Scholar 

  24. Bin Y, Yang Y, Shen F, Xie N, Shen HT, Li X (2019) Describing Video With Attention-Based Bidirectional LSTM. IEEE Transactions on Cybernetics. 49(7):2631–41

    Article  Google Scholar 

  25. Song J, Guo Y, Gao L, Li X, Hanjalic A, Shen HT (2019) From Deterministic to Generative: Multimodal Stochastic RNNs for Video Captioning. IEEE Transactions on Neural Networks and Learning Systems. 30(10):3047–58

    Article  Google Scholar 

  26. Yang Y, Zhou J, Ai J, Bin Y, Hanjalic A, Shen HT, et al. Video Captioning by Adversarial LSTM. IEEE Transactions on Image Processing;27(11):5600-11

  27. Zheng Q, Wang C, Tao D. Syntax-Aware Action Targeting for Video Captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2020.

  28. Srinivasu PN, SivaSai JG, Ijaz MF, Bhoi AK, Kim W, Kang JJ. Classification of Skin Disease Using Deep Learning Neural Networks with MobileNet V2 and LSTM. Sensors. 2021;21(8). Available from: https://www.mdpi.com/1424-8220/21/8/2852

  29. Yang Y, Zhang L, Du M, Bo J, Liu H, Ren L, et al. A comparative analysis of eleven neural networks architectures for small datasets of lung images of COVID-19 patients toward improved clinical decisions. Computers in Biology and Medicine. 2021;139:104887. Available from: https://www.sciencedirect.com/science/article/pii/S0010482521006818

  30. Alok N, Krishan K, Chauhan P. Deep learning-Based image classifier for malaria cell detection. Machine Learning for Healthcare Applications. 2021:187-97

  31. Negi A, Kumar K, Chauhan P. Deep neural network-based multi-class image classification for plant diseases. Agricultural informatics: automation using the IoT and machine learning. 2021:117-29

  32. Kumar K, Nishanth P, Singh M, Dahiya S (2022) Image Encoder and Sentence Decoder Based Video Event Description Generating Model: A Storytelling. IETE Journal of Education. 63(2):78–84

    Article  Google Scholar 

  33. Kumar K, Shrimankar DD (2018) F-DES: Fast and Deep Event Summarization. IEEE Transactions on Multimedia. 20(2):323–34

    Article  Google Scholar 

  34. Negi A, Kumar K. Classification and detection of citrus diseases using deep learning. In: Data science and its applications. Chapman and Hall/CRC; 2021. p. 63-85

  35. Vision OOSC. OpenCV -Object Detection,;. Accessed: 12-12-2021. https://docs.opencv.org/3.4.3/df/dfb/group__imgproc__object.html

  36. Zoph B, Vasudevan V, Shlens J, Le QV. Learning transferable architectures for scalable image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2018. p. 8697-710

  37. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention Is All You Need. CoRR. 2017;abs/1706.03762

  38. Chen D, Dolan W. Collecting Highly Parallel Data for Paraphrase Evaluation. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Portland, Oregon, USA: Association for Computational Linguistics; 2011. p. 190-200

  39. Xu J, Mei T, Yao T, Rui Y. Msr-vtt: A large video description dataset for bridging video and language. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2016. p. 5288-96

  40. Venugopalan S, Rohrbach M, Donahue J, Mooney R, Darrell T, Saenko K. Sequence to Sequence – Video to Text. In: 2015 IEEE International Conference on Computer Vision (ICCV); 2015. p. 4534-42

  41. Yan C, Tu Y, Wang X, Zhang Y, Hao X, Zhang Y et al (2020) STAT: Spatial-Temporal Attention Mechanism for Video Captioning. IEEE Transactions on Multimedia. 22(1):229–41

    Article  Google Scholar 

  42. Sah S, Nguyen T, Ptucha R (2020) Understanding temporal structure for video captioning. Pattern Analysis and Applications. 23(1):147–59

    Article  Google Scholar 

  43. Hao X, Zhou F, Li X. Scene-Edge GRU for Video Caption. In: 2020 IEEE 4th Information Technology, Networking, Electronic and Automation Control Conference (ITNEC). vol. 1; 2020. p. 1290-5

  44. Xu J, Wei H, Li L, Fu Q, Guo J (2020) Video Description Model Based on Temporal-Spatial and Channel Multi-Attention Mechanisms. Applied Sciences. 10(12):4312

    Article  Google Scholar 

  45. Wei R, Mi L, Hu Y, Chen Z (2020) Exploiting the local temporal information for video captioning. Journal of Visual Communication and Image Representation. 67:102751

    Article  Google Scholar 

  46. Nabati M, Behrad A (2020) Video captioning using boosted and parallel Long Short-Term Memory networks. Computer Vision and Image Understanding. 190:102840

    Article  Google Scholar 

  47. Aafaq N, Akhtar N, Liu W, Gilani SZ, Mian A. Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2019. p. 12487-96

  48. Chen T, Zhao Q, Song J. Boundary Detector Encoder and Decoder with Soft Attention for Video Captioning. In: Asia-Pacific Web (APWeb) and Web-Age Information Management (WAIM) Joint International Conference on Web and Big Data. Springer; 2019. p. 105-15

  49. Lin JC, Zhang CY. A New Memory Based on Sequence to Sequence Model for Video Captioning. In: 2021 International Conference on Security, Pattern Analysis, and Cybernetics (SPAC). IEEE; 2021. p. 470-6

  50. Pei W, Zhang J, Wang X, Ke L, Shen X, Tai Y. Memory-Attended Recurrent Network for Video Captioning. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2019. p. 8339-48

  51. Yadav N, Generating Naik D, Description Short Video, using Deep-LSTM and Attention Mechanism. In, (2021) 6th International Conference for Convergence in Technology (I2CT). IEEE 2021:1–6

  52. Nabati M, Behrad A (2020) Multi-sentence video captioning using content-oriented beam searching and multi-stage refining algorithm. Information Processing & Management. 57(6):102302

    Article  Google Scholar 

  53. Wang J, Wang W, Huang Y, Wang L, Tan T. M3: Multimodal memory modelling for video captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2018. p. 7512-20

  54. Yang Y, Zhou J, Ai J, Bin Y, Hanjalic A, Shen HT et al (2018) Video captioning by adversarial LSTM. IEEE Transactions on Image Processing. 27(11):5600–11

    Article  MathSciNet  Google Scholar 

  55. Shekhar CC, et al. Domain-specific semantics guided approach to video captioning. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision; 2020. p. 1587-96

Download references

Funding

No funds, grants, or other support was received.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dinesh Naik.

Ethics declarations

Conflicts of interest

The authors have no competing interests to declare.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Naik, D., C D, J. Video Captioning using Sentence Vector-enabled Convolutional Framework with Short-Connected LSTM. Multimed Tools Appl 83, 11187–11213 (2024). https://doi.org/10.1007/s11042-023-15978-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-023-15978-7

Keywords

Navigation