Advancing image captioning with V16HP1365 encoder and dual self-attention network

Jaiswal, Tarun; Pandey, Manju; Tripathi, Priyanka

doi:10.1007/s11042-024-18467-7

Advancing image captioning with V16HP1365 encoder and dual self-attention network

Published: 07 March 2024

Volume 83, pages 80701–80725, (2024)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Tarun Jaiswal¹,
Manju Pandey¹ &
Priyanka Tripathi¹

304 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

Image captioning generates textual description from the corresponding input image with the help of computer vision and natural language processing. In recent years, deep learning approaches have shown promise in image captioning. This research introduces a novel image captioning architecture comprising a dual self-attention fused encoder-decoder framework. The VGG16 Hybrid Places 1365 (V16HP1365) encoder captures diverse visual features from images, enhancing the quality of image representations. In this article, the Gated Recurrent Unit (GRU) is considered as a decoder for conducting word-level language modeling. Additionally, the dual self-attention network embedded in the architecture allows for capturing contextual image information to provide accurate content descriptions and relationship understanding. Experimental evaluations on the COCO dataset showcase superior performance, surpassing existing methods in terms of captioning quality metrics. This approach holds potential for applications such as aiding the visually impaired and advancing content retrieval. Future work aims to extend the model to support multilingual captioning.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

BENet: bi-directional enhanced network for image captioning

Article 29 January 2024

The CAA Captioner–Enhancing Image Captioning with Contrastive Learning and Attention on Attention Mechanism

Relational Attention with Textual Enhanced Transformer for Image Captioning

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Data availability

Data sharing is not applicable to this article as no new data were created or analyzed in this study.

Code availability

Not applicable.

References

Jaiswal T, Pandey M, Tripathi P (2021) Image captioning through cognitive IOT and machine-learning approaches. Turkish J Comput Math Educ 12L:333–351
Google Scholar
Fan Z, Wei Z, Wang S, Huang X (2020) Bridging by word: Image-grounded vocabulary construction for visual captioning. In ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference 6514–6524. https://doi.org/10.18653/v1/p19-1652
Yan C, Hao Y, Li L, Yin J, Liu A, Mao Z, Chen Z, Gao X (2021) Task-adaptive attention for image captioning. IEEE Trans Circuits Syst Video Technol 32:43–51
Article Google Scholar
Stefanini M, Cornia M, Baraldi L, Cascianelli S, Fiameni G, Cucchiara R (2022) From show to tell: A survey on deep learning-based image captioning. IEEE Trans Pattern Anal Mach Intell 45:539–559
Article Google Scholar
Zohourianshahzadi Z, Kalita JK (2022) Neural attention for image captioning: re-view of outstanding methods. Artif Intell Rev 55:3833–3862
Article Google Scholar
Alzubi JA, Jain R, Nagrath P, Satapathy S, Taneja S, Gupta P (2021) Deep image captioning using an ensemble of CNN and LSTM based deep neural net-works. J Intell Fuzzy Syst 40:5761–5769
Article Google Scholar
Maru H, Chandana TSS, Naik D (2021) Comparison of image encoder ar-chitectures for image captioning. In 2021 5th International Conference on Computing Methodologies and Communication (ICCMC) 740–744. IEEE
Smith WHB, Milford M, Mcdonald-Maier KD, Ehsan S (2021) Scene Retrieval for Contextual Visual Mapping. ArXiv 2102:1
Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: A neural image caption generator. Proc. IEEE Comput. Soc Conf Comput Vis Pattern Recognit 07:3156–3164. https://doi.org/10.1109/CVPR.2015.7298935
Article Google Scholar
Xu K et al (2015) Show, attend and tell: Neural image caption generation with visual attention. 32nd Int. Conf. Mach. Learn. ICML 3: 2048–2057
Rennie SJ, Marcheret E, Mroueh Y, Ross J, Goel V (2017) Self-critical sequence training for image captioning. In Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition 1179–1195. https://doi.org/10.1109/CVPR.2017.131
Anderson P et al (2017) Bottom-Up and Top-Down Attention for Image Captioning and VQA. ArXiv 1707:0
Yu L et al (2018) Matt net: modular attention network for referring expression comprehension. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition 1307–1315. https://doi.org/10.1109/CVPR.2018.00142
Faghr F, Fleet DJ, Kiros JR, Fidler S (2017) VSE++: Improved Visual-Semantic Embeddings. ArXiv 1707:0
Sukhbaatar S, Grave E, Bojanowski P, Joulin A (2020) Adaptive attention span in transformers. In ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference 331–335. https://doi.org/10.18653/v1/p19-1032
Vaswani A et al (2017) Attention is all you need. Adv Neural Inf Process Syst 5999–6009. [Online]. Available https://arxiv.org/pdf/1706.03762.pdf. 05/03/2024
You Q, Jin H, Wang Z, Fang C, Luo J (2016) Image captioning with semantic attention. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition 4651–4659. https://doi.org/10.1109/CVPR.2016.503
Huang L, Wang W, Chen J, Wei XY (2019) Attention on attention for image captioning. In Proceedings of the IEEE International Conference on Computer Vision 4633–4642. https://doi.org/10.1109/ICCV.2019.00473
Wan B, Jiang W, Fang Y, Wen W, Liu H (2022) Dual-stream self-attention network for image captioning. In 2022 IEEE International Conference on Visual Communications and Image Processing (VCIP) 1–5. IEEE
Kim JS, Park SW, Kim JY, Park J, Huh JH, Jung SH, Sim CB (2023) E-HRNet: Enhanced semantic segmentation using squeeze and excitation. Electronics 12:3619
Article Google Scholar
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9:1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735
Article Google Scholar
Cho K et al (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. In EMNLP 2014 - 2014 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference 1724–1734. https://doi.org/10.3115/v1/d14-1179
Lin TY et al (2014) Microsoft COCO: Common objects in context. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 8693: 740–755
Karpathy A, Fei-Fei L (2017) Deep visual-semantic alignments for generating image descriptions. IEEE Trans Pattern Anal Mach Intell 39:664–676. https://doi.org/10.1109/TPAMI.2016.2598339
Article Google Scholar
Bhavana D, Krishna KC, Tejaswini K, Vikas NV, Sahithya ANV (2021) Image captioning using deep learning. In Handbook of Research on Innovations and Ap-plications of AI, IoT, and Cognitive Technologies 381–395. IGI Global
Singh D, Kaur M, Alanazi JM, AlZubi AA, Lee HN (2022) Efficient evolving deep ensemble medical image captioning network. IEEE J Biomed Health Inform 27:1016–1025
Article Google Scholar
Lin CY (2004) Rouge: a package for automatic evaluation of summaries. In: Proceedings of the workshop on text summarization branches out (WAS 2004) 25–26. [Online]. Available papers2://publication/uuid/5DDA0BB8-E59F-44C1–88E6–2AD316DAEF85. 05/03/2024
Denkowski M, Lavie A (2014) Meteor universal: Language specific translation evaluation for any target language. In Proceedings of the Annual Meeting of the Association for Computational Linguistics 376–380. https://doi.org/10.3115/v1/w14-3348
Wang J, Xu W, Wang Q, Chan AB (2022) On distinctive image captioning via comparing and reweighting. IEEE Trans Pattern Anal Mach Intell 45:2088–2103
Article Google Scholar
Elhagry A, Kadaoui K (2021) A thorough review on recent deep learning methodol-ogies for image captioning. arXiv preprint arXiv 2107:13114
Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 3128–3137. https://doi.org/10.1109/CVPR.2015.7298932
Katpally H, Bansal A (2020) Ensemble learning on deep neural networks for image caption generation. In Proceedings - 14th IEEE International Conference on Semantic Computing 61–68. https://doi.org/10.1109/ICSC.2020.00016
Pu Y et al (2016) Variational autoencoder for deep learning of images, labels and captions. In Advances in Neural Information Processing Systems 2360–2368
Chu Y, Yue X, Yu L, Sergei M, Wang Z (2020) Automatic Image Captioning Based on ResNet50 and LSTM with Soft Attention. Wirel Commun Mob Comput 8909458:1–8909458
Google Scholar
Amritkar C, Jabade VS (2018) Image caption generation using deep learning technique. 2018 Fourth Int Conf Comput Commun Control Autom1–4
Aneja J, Deshpande A, Schwing AG (2018) Convolutional image captioning. Proc IEEE Comput Soc Conf Comput Vis Pattern Recognit 5561–5570. https://doi.org/10.1109/CVPR.2018.00583
Mao J, Xu W, Yang Y, Wang J, Yuille AL (2014) Explain Images with Multimodal Recurrent Neural Networks. ArXiv 1410:1. [Online]. Available: http://arxiv.org/abs/1410.1090
Li L, Tang S, Zhang Y, Deng L, Tian Q (2018) GLA: Global-Local Attention for Image Description. IEEE Trans Multimed 20:726–737. https://doi.org/10.1109/TMM.2017.2751140
Article Google Scholar
Xiao X, Wang L, Ding K, Xiang S, Pan C (2019) Deep hierarchical encoder–decoder network for image captioning. IEEE Trans Multimed 21:2942–2956. https://doi.org/10.1109/TMM.2019.2915033
Article Google Scholar
Yao T, Pan Y, Li Y, Qiu Z, Mei T (2017) Boosting Image Captioning with Attributes. In 2017 IEEE International Conference on Computer Vision (ICCV) 4904–4912. https://doi.org/10.1109/ICCV.2017.524
Cao P, Yang Z, Sun L, Liang Y, Yang MQ, Guan R (2019) Image captioning with bidi-rectional semantic attention-based guiding of long short-term memory. Neural Process Lett 50:103–119. https://doi.org/10.1007/s11063-018-09973-5
Article Google Scholar
Wang C, Gu X (2023) Learning joint relationship attention network for image captioning. Expert Syst Appl 211:118474. https://doi.org/10.1016/j.eswa.2022.118474
Article Google Scholar
Moral ÖT, Kiliç V, Onan A, Wang W (2022) August. Automated image captioning with multi-layer gated recurrent unit. In 2022 30th European Signal Processing Conference (EUSIPCO) 1160–1164. IEEE
Padate R, Jain A, Kalla M, Sharma A (2023) Image caption generation using a dual attention mechanism. Eng Appl Artif Intell 123:106112
Article Google Scholar
Kim DH (2019) Evaluation of COCO validation 2017 dataset with YO-LOv3. Evaluation 6:10356–10360
Google Scholar
Agarwal V, Sharma S (2022) EMVD: Efficient multitype vehicle detection algorithm using deep learning approach in vehicular communication network for radio resource management. Int J Image Graph Signal Process (IJIGSP) 14:25–37
Article Google Scholar
Agarwal V, Sharma S (2023) DQN Algorithm for network resource management in vehicular communication network. Int J Inf Technol 1–9
Agarwal V, Sharma S (2022) Deep learning techniques to improve radio resource management in vehicular communication network. In Sustainable Advanced Compu-ting: Select Proceedings of ICSAC 2021. 161–171. Singapore: Springer Singapore
Agarwal V, Sharma S, Bansal G (2022) Network resource allocation security techniques and challenges for vehicular communication network management. In New Trends and Applications in Internet of Things (IoT) and Big Data Analytics pp 123–137. Cham: Springer International Publishing

Download references

Author information

Authors and Affiliations

Department of Computer Applications, National Institute of Technology, Raipur, Chhattisgarh, India
Tarun Jaiswal, Manju Pandey & Priyanka Tripathi

Authors

Tarun Jaiswal
View author publications
You can also search for this author inPubMed Google Scholar
Manju Pandey
View author publications
You can also search for this author inPubMed Google Scholar
Priyanka Tripathi
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

TJ agreed on the content of the study. TJ, MP and PT collected all the data for analysis. TJ agreed on the methodology. TJ, MP and PT completed the analysis based on agreed steps. Results and conclusions are discussed and written together. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Tarun Jaiswal.

Ethics declarations

Human and animal rights

This article does not contain any studies with human or animal subjects performed by any of the authors.

Informed consent

Informed consent was obtained from all individual participants included in the study.

Consent to participate

Not applicable.

Consent for publication

Not applicable.

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file1 (RAR 3 KB)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Jaiswal, T., Pandey, M. & Tripathi, P. Advancing image captioning with V16HP1365 encoder and dual self-attention network. Multimed Tools Appl 83, 80701–80725 (2024). https://doi.org/10.1007/s11042-024-18467-7

Download citation

Received: 30 August 2023
Revised: 19 December 2023
Accepted: 29 January 2024
Published: 07 March 2024
Issue Date: October 2024
DOI: https://doi.org/10.1007/s11042-024-18467-7

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Advancing image captioning with V16HP1365 encoder and dual self-attention network

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

BENet: bi-directional enhanced network for image captioning

The CAA Captioner–Enhancing Image Captioning with Contrastive Learning and Attention on Attention Mechanism

Relational Attention with Textual Enhanced Transformer for Image Captioning

Explore related subjects

Data availability

Code availability

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Human and animal rights

Informed consent

Consent to participate

Consent for publication

Conflict of interest

Additional information

Publisher's Note

Supplementary Information

Supplementary file1 (RAR 3 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now