Video Captioning using Sentence Vector-enabled Convolutional Framework with Short-Connected LSTM

Naik, Dinesh; C D, Jaidhar

doi:10.1007/s11042-023-15978-7

Video Captioning using Sentence Vector-enabled Convolutional Framework with Short-Connected LSTM

Published: 27 June 2023

Volume 83, pages 11187–11213, (2024)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

190 Accesses
2 Citations
Explore all metrics

Abstract

The principal objective of video/image captioning is to portray the dynamics of a video clip in plain natural language. Captioning is motivated by its ability to make the video more accessible to deaf and hard-of-hearing individuals, to help people focus on and recall information more readily, and to watch it in sound-sensitive locations. The most frequently utilized design paradigm is the revolutionary structurally improved encoder-decoder configuration. Recent developments emphasize the utilization of various creative structural modifications to maximize efficiency while demonstrating their viability in real-world applications. The utilization of well-known and well-researched technological advancements such as deep Convolutional Neural Networks (CNNs) and Sentence Transformers are trending in encoder-decoders. This paper proposes an approach for efficiently captioning videos using CNN and a short-connected LSTM-based encoder-decoder model blended with a sentence context vector. This sentence context vector emphasizes the relationship between the video and text spaces. Inspired by the human visual system, the attention mechanism is utilized to selectively concentrate on the context of the important frames. Also, a contextual hybrid embedding block is presented for connecting the two vector spaces generated during the encoding and decoding stages. The proposed architecture is investigated through well-known CNN architectures and various word embeddings. It is assessed using two benchmark video captioning datasets, MSVD and MSR-VTT, considering standard evaluation metrics such as BLEU, METEOR, ROUGH, and CIDEr. In accordance with experimental exploration, when the proposed model with NASNet-large alone is viewed across all three embeddings, the BERT findings on MSVD Dataset performed better than the results obtained with the other two embeddings. Inception-v4 outperformed VGG-16, ResNet-152, and NASNet-Large for feature extraction. Considering word embedding initiatives, BERT is far superior to ELMo and GloVe based on the MSR-VTT dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Video summarization using deep learning techniques: a detailed analysis and investigation

Article 15 March 2023

TextConvoNet: a convolutional neural network based architecture for text classification

Article 22 October 2022

Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization

Article 11 October 2019

Data Availability

Data sharing not applicable to this article as no datasets were generated or analyzed during the current study.

References

Amirian S, Rasheed K, Taha TR, Arabnia HR (2020) Automatic image and video caption generation with deep learning: A concise review and algorithmic overlap. IEEE Access. 8:218386–400
Article Google Scholar
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural computation. 9(8):1735–80
Article Google Scholar
Su Y, Xia H, Liang Q, Nie W (2021) Exposing DeepFake Videos Using Attention Based Convolutional LSTM Network. Neural Processing Letters. 53(6):4159–75
Article Google Scholar
Gao L, Guo Z, Zhang H, Xu X, Shen HT (2017) Video Captioning With Attention-Based LSTM and Semantic Consistency. IEEE Transactions on Multimedia. 19(9):2045–55
Article Google Scholar
Cao P, Yang Z, Sun L, Liang Y, Yang MQ, Guan R (2019) Image captioning with bidirectional semantic attention-based guiding of long short-term memory. Neural Processing Letters. 50(1):103–19
Article Google Scholar
Zoph B, Vasudevan V, Shlens J, Le QV. Learning Transferable Architectures for Scalable Image Recognition. CoRR. 2017;abs/1707.07012
Szegedy C, Ioffe S, Vanhoucke V, Alemi AA. Inception-v4, inception-resnet and the impact of residual connections on learning. In: Thirty-first AAAI conference on artificial intelligence; 2017.
He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2016. p. 770-8
Simonyan K, Zisserman A. Very Deep Convolutional Networks for Large-Scale Image Recognition. In: The 3rd International Conference on Learning Representations (ICLR2015); 2015. Available from: https://arxiv.org/abs/1409.1556
Devlin J, Chang M, Lee K, Toutanova K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. CoRR. 2018;abs/1810.04805
Peters ME, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, et al. Deep contextualized word representations. CoRR. 2018;abs/1802.05365. Available from: http://arxiv.org/abs/1802.05365
Pennington J, Socher R, Manning CD. GloVe: Global Vectors for Word Representation. In: Empirical Methods in Natural Language Processing (EMNLP); 2014. p. 1532-43. Available from: http://www.aclweb.org/anthology/D14-1162
Papineni K, Roukos S, Ward T, Zhu WJ. Bleu: a Method for Automatic Evaluation of Machine Translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics; 2002. p. 311-8
Banerjee S, Lavie A. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In: Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. Ann Arbor, Michigan: Association for Computational Linguistics; 2005. p. 65-72
Lin CY. ROUGE: A Package for Automatic Evaluation of Summaries. In: Text Summarization Branches Out. Barcelona, Spain: Association for Computational Linguistics; 2004. p. 74-81. Available from: https://www.aclweb.org/anthology/W04-1013
Vedantam R, Zitnick CL, Parikh D. CIDEr: Consensus-based Image Description Evaluation. CoRR. 2014;abs/1411.5726
Vinyals O, Toshev A, Bengio S, Erhan D. Show and Tell: A Neural Image Caption Generator. CoRR. 2014;abs/1411.4555
You Q, Jin H, Wang Z, Fang C, Luo J. Image Captioning with Semantic Attention. CoRR. 2016;abs/1603.03925
Lin JC, Zhang CY. A New Memory Based on Sequence to Sequence Model for Video Captioning. In: 2021 International Conference on Security, Pattern Analysis, and Cybernetics SPAC); 2021. p. 470-6
Zhang Z, Xu D, Ouyang W, Zhou L (2021) Dense Video Captioning Using Graph-Based Sentence Summarization. IEEE Transactions on Multimedia. 23:1799–810
Article Google Scholar
Gao L, Li X, Song J, Shen HT (2020) Hierarchical LSTMs with Adaptive Attention for Visual Captioning. IEEE Transactions on Pattern Analysis and Machine Intelligence. 42(5):1112–31
Google Scholar
Liu S, Ren Z, Yuan J (2021) SibNet: Sibling Convolutional Encoder for Video Captioning. IEEE Transactions on Pattern Analysis and Machine Intelligence. 43(9):3259–72
Article Google Scholar
Xu N, Liu A, Nie W, Su Y (2018) Attention-in-Attention Networks for Surveillance Video Understanding in Internet of Things. IEEE Internet of Things Journal. 5(5):3419–29
Article Google Scholar
Bin Y, Yang Y, Shen F, Xie N, Shen HT, Li X (2019) Describing Video With Attention-Based Bidirectional LSTM. IEEE Transactions on Cybernetics. 49(7):2631–41
Article Google Scholar
Song J, Guo Y, Gao L, Li X, Hanjalic A, Shen HT (2019) From Deterministic to Generative: Multimodal Stochastic RNNs for Video Captioning. IEEE Transactions on Neural Networks and Learning Systems. 30(10):3047–58
Article Google Scholar
Yang Y, Zhou J, Ai J, Bin Y, Hanjalic A, Shen HT, et al. Video Captioning by Adversarial LSTM. IEEE Transactions on Image Processing;27(11):5600-11
Zheng Q, Wang C, Tao D. Syntax-Aware Action Targeting for Video Captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2020.
Srinivasu PN, SivaSai JG, Ijaz MF, Bhoi AK, Kim W, Kang JJ. Classification of Skin Disease Using Deep Learning Neural Networks with MobileNet V2 and LSTM. Sensors. 2021;21(8). Available from: https://www.mdpi.com/1424-8220/21/8/2852
Yang Y, Zhang L, Du M, Bo J, Liu H, Ren L, et al. A comparative analysis of eleven neural networks architectures for small datasets of lung images of COVID-19 patients toward improved clinical decisions. Computers in Biology and Medicine. 2021;139:104887. Available from: https://www.sciencedirect.com/science/article/pii/S0010482521006818
Alok N, Krishan K, Chauhan P. Deep learning-Based image classifier for malaria cell detection. Machine Learning for Healthcare Applications. 2021:187-97
Negi A, Kumar K, Chauhan P. Deep neural network-based multi-class image classification for plant diseases. Agricultural informatics: automation using the IoT and machine learning. 2021:117-29
Kumar K, Nishanth P, Singh M, Dahiya S (2022) Image Encoder and Sentence Decoder Based Video Event Description Generating Model: A Storytelling. IETE Journal of Education. 63(2):78–84
Article Google Scholar
Kumar K, Shrimankar DD (2018) F-DES: Fast and Deep Event Summarization. IEEE Transactions on Multimedia. 20(2):323–34
Article Google Scholar
Negi A, Kumar K. Classification and detection of citrus diseases using deep learning. In: Data science and its applications. Chapman and Hall/CRC; 2021. p. 63-85
Vision OOSC. OpenCV -Object Detection,;. Accessed: 12-12-2021. https://docs.opencv.org/3.4.3/df/dfb/group__imgproc__object.html
Zoph B, Vasudevan V, Shlens J, Le QV. Learning transferable architectures for scalable image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2018. p. 8697-710
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention Is All You Need. CoRR. 2017;abs/1706.03762
Chen D, Dolan W. Collecting Highly Parallel Data for Paraphrase Evaluation. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Portland, Oregon, USA: Association for Computational Linguistics; 2011. p. 190-200
Xu J, Mei T, Yao T, Rui Y. Msr-vtt: A large video description dataset for bridging video and language. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2016. p. 5288-96
Venugopalan S, Rohrbach M, Donahue J, Mooney R, Darrell T, Saenko K. Sequence to Sequence – Video to Text. In: 2015 IEEE International Conference on Computer Vision (ICCV); 2015. p. 4534-42
Yan C, Tu Y, Wang X, Zhang Y, Hao X, Zhang Y et al (2020) STAT: Spatial-Temporal Attention Mechanism for Video Captioning. IEEE Transactions on Multimedia. 22(1):229–41
Article Google Scholar
Sah S, Nguyen T, Ptucha R (2020) Understanding temporal structure for video captioning. Pattern Analysis and Applications. 23(1):147–59
Article Google Scholar
Hao X, Zhou F, Li X. Scene-Edge GRU for Video Caption. In: 2020 IEEE 4th Information Technology, Networking, Electronic and Automation Control Conference (ITNEC). vol. 1; 2020. p. 1290-5
Xu J, Wei H, Li L, Fu Q, Guo J (2020) Video Description Model Based on Temporal-Spatial and Channel Multi-Attention Mechanisms. Applied Sciences. 10(12):4312
Article Google Scholar
Wei R, Mi L, Hu Y, Chen Z (2020) Exploiting the local temporal information for video captioning. Journal of Visual Communication and Image Representation. 67:102751
Article Google Scholar
Nabati M, Behrad A (2020) Video captioning using boosted and parallel Long Short-Term Memory networks. Computer Vision and Image Understanding. 190:102840
Article Google Scholar
Aafaq N, Akhtar N, Liu W, Gilani SZ, Mian A. Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2019. p. 12487-96
Chen T, Zhao Q, Song J. Boundary Detector Encoder and Decoder with Soft Attention for Video Captioning. In: Asia-Pacific Web (APWeb) and Web-Age Information Management (WAIM) Joint International Conference on Web and Big Data. Springer; 2019. p. 105-15
Lin JC, Zhang CY. A New Memory Based on Sequence to Sequence Model for Video Captioning. In: 2021 International Conference on Security, Pattern Analysis, and Cybernetics (SPAC). IEEE; 2021. p. 470-6
Pei W, Zhang J, Wang X, Ke L, Shen X, Tai Y. Memory-Attended Recurrent Network for Video Captioning. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2019. p. 8339-48
Yadav N, Generating Naik D, Description Short Video, using Deep-LSTM and Attention Mechanism. In, (2021) 6th International Conference for Convergence in Technology (I2CT). IEEE 2021:1–6
Nabati M, Behrad A (2020) Multi-sentence video captioning using content-oriented beam searching and multi-stage refining algorithm. Information Processing & Management. 57(6):102302
Article Google Scholar
Wang J, Wang W, Huang Y, Wang L, Tan T. M3: Multimodal memory modelling for video captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2018. p. 7512-20
Yang Y, Zhou J, Ai J, Bin Y, Hanjalic A, Shen HT et al (2018) Video captioning by adversarial LSTM. IEEE Transactions on Image Processing. 27(11):5600–11
Article MathSciNet Google Scholar
Shekhar CC, et al. Domain-specific semantics guided approach to video captioning. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision; 2020. p. 1587-96

Download references

Funding

No funds, grants, or other support was received.

Author information

Dinesh Naik and Jaidhar C D contributed equally to this work.

Authors and Affiliations

Department of Information Technology, National Institute of Technology Karnataka, Surathkal, Mangalore, 575025, Karnataka, India
Dinesh Naik & Jaidhar C D

Authors

Dinesh Naik
View author publications
You can also search for this author in PubMed Google Scholar
Jaidhar C D
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dinesh Naik.

Ethics declarations

Conflicts of interest

The authors have no competing interests to declare.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Naik, D., C D, J. Video Captioning using Sentence Vector-enabled Convolutional Framework with Short-Connected LSTM. Multimed Tools Appl 83, 11187–11213 (2024). https://doi.org/10.1007/s11042-023-15978-7

Download citation

Received: 01 August 2022
Revised: 09 April 2023
Accepted: 29 May 2023
Published: 27 June 2023
Issue Date: January 2024
DOI: https://doi.org/10.1007/s11042-023-15978-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Video Captioning using Sentence Vector-enabled Convolutional Framework with Short-Connected LSTM

Abstract

Access this article

Similar content being viewed by others

Video summarization using deep learning techniques: a detailed analysis and investigation

TextConvoNet: a convolutional neural network based architecture for text classification

Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization

Data Availability

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflicts of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Video Captioning using Sentence Vector-enabled Convolutional Framework with Short-Connected LSTM

Abstract

Access this article

Similar content being viewed by others

Video summarization using deep learning techniques: a detailed analysis and investigation

TextConvoNet: a convolutional neural network based architecture for text classification

Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization

Data Availability

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflicts of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation