Skip to main content

Advertisement

Log in

Advancing image captioning with V16HP1365 encoder and dual self-attention network

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Image captioning generates textual description from the corresponding input image with the help of computer vision and natural language processing. In recent years, deep learning approaches have shown promise in image captioning. This research introduces a novel image captioning architecture comprising a dual self-attention fused encoder-decoder framework. The VGG16 Hybrid Places 1365 (V16HP1365) encoder captures diverse visual features from images, enhancing the quality of image representations. In this article, the Gated Recurrent Unit (GRU) is considered as a decoder for conducting word-level language modeling. Additionally, the dual self-attention network embedded in the architecture allows for capturing contextual image information to provide accurate content descriptions and relationship understanding. Experimental evaluations on the COCO dataset showcase superior performance, surpassing existing methods in terms of captioning quality metrics. This approach holds potential for applications such as aiding the visually impaired and advancing content retrieval. Future work aims to extend the model to support multilingual captioning.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Algorithm 1
Fig. 4

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data availability

Data sharing is not applicable to this article as no new data were created or analyzed in this study.

Code availability

Not applicable.

References

  1. Jaiswal T, Pandey M, Tripathi P (2021) Image captioning through cognitive IOT and machine-learning approaches. Turkish J Comput Math Educ 12L:333–351

    Google Scholar 

  2. Fan Z, Wei Z, Wang S, Huang X (2020) Bridging by word: Image-grounded vocabulary construction for visual captioning. In ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference 6514–6524. https://doi.org/10.18653/v1/p19-1652

  3. Yan C, Hao Y, Li L, Yin J, Liu A, Mao Z, Chen Z, Gao X (2021) Task-adaptive attention for image captioning. IEEE Trans Circuits Syst Video Technol 32:43–51

    Article  Google Scholar 

  4. Stefanini M, Cornia M, Baraldi L, Cascianelli S, Fiameni G, Cucchiara R (2022) From show to tell: A survey on deep learning-based image captioning. IEEE Trans Pattern Anal Mach Intell 45:539–559

    Article  Google Scholar 

  5. Zohourianshahzadi Z, Kalita JK (2022) Neural attention for image captioning: re-view of outstanding methods. Artif Intell Rev 55:3833–3862

    Article  Google Scholar 

  6. Alzubi JA, Jain R, Nagrath P, Satapathy S, Taneja S, Gupta P (2021) Deep image captioning using an ensemble of CNN and LSTM based deep neural net-works. J Intell Fuzzy Syst 40:5761–5769

    Article  Google Scholar 

  7. Maru H, Chandana TSS, Naik D (2021) Comparison of image encoder ar-chitectures for image captioning. In 2021 5th International Conference on Computing Methodologies and Communication (ICCMC) 740–744. IEEE

  8. Smith WHB, Milford M, Mcdonald-Maier KD, Ehsan S (2021) Scene Retrieval for Contextual Visual Mapping. ArXiv 2102:1

  9. Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: A neural image caption generator. Proc. IEEE Comput. Soc Conf Comput Vis Pattern Recognit 07:3156–3164. https://doi.org/10.1109/CVPR.2015.7298935

    Article  Google Scholar 

  10. Xu K et al (2015) Show, attend and tell: Neural image caption generation with visual attention. 32nd Int. Conf. Mach. Learn. ICML 3: 2048–2057

  11. Rennie SJ, Marcheret E, Mroueh Y, Ross J, Goel V (2017) Self-critical sequence training for image captioning. In Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition 1179–1195. https://doi.org/10.1109/CVPR.2017.131

  12. Anderson P et al (2017) Bottom-Up and Top-Down Attention for Image Captioning and VQA. ArXiv 1707:0

  13. Yu L et al (2018) Matt net: modular attention network for referring expression comprehension. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition 1307–1315. https://doi.org/10.1109/CVPR.2018.00142

  14. Faghr F, Fleet DJ, Kiros JR, Fidler S (2017) VSE++: Improved Visual-Semantic Embeddings. ArXiv 1707:0

  15. Sukhbaatar S, Grave E, Bojanowski P, Joulin A (2020) Adaptive attention span in transformers. In ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference 331–335. https://doi.org/10.18653/v1/p19-1032

  16. Vaswani A et al (2017) Attention is all you need. Adv Neural Inf Process Syst 5999–6009. [Online]. Available https://arxiv.org/pdf/1706.03762.pdf. 05/03/2024

  17. You Q, Jin H, Wang Z, Fang C, Luo J (2016) Image captioning with semantic attention. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition 4651–4659. https://doi.org/10.1109/CVPR.2016.503

  18. Huang L, Wang W, Chen J, Wei XY (2019) Attention on attention for image captioning. In Proceedings of the IEEE International Conference on Computer Vision 4633–4642. https://doi.org/10.1109/ICCV.2019.00473

  19. Wan B, Jiang W, Fang Y, Wen W, Liu H (2022) Dual-stream self-attention network for image captioning. In 2022 IEEE International Conference on Visual Communications and Image Processing (VCIP) 1–5. IEEE

  20. Kim JS, Park SW, Kim JY, Park J, Huh JH, Jung SH, Sim CB (2023) E-HRNet: Enhanced semantic segmentation using squeeze and excitation. Electronics 12:3619

    Article  Google Scholar 

  21. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9:1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735

    Article  Google Scholar 

  22. Cho K et al (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. In EMNLP 2014 - 2014 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference 1724–1734. https://doi.org/10.3115/v1/d14-1179

  23. Lin TY et al (2014) Microsoft COCO: Common objects in context. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 8693: 740–755

  24. Karpathy A, Fei-Fei L (2017) Deep visual-semantic alignments for generating image descriptions. IEEE Trans Pattern Anal Mach Intell 39:664–676. https://doi.org/10.1109/TPAMI.2016.2598339

    Article  Google Scholar 

  25. Bhavana D, Krishna KC, Tejaswini K, Vikas NV, Sahithya ANV (2021) Image captioning using deep learning. In Handbook of Research on Innovations and Ap-plications of AI, IoT, and Cognitive Technologies 381–395. IGI Global

  26. Singh D, Kaur M, Alanazi JM, AlZubi AA, Lee HN (2022) Efficient evolving deep ensemble medical image captioning network. IEEE J Biomed Health Inform 27:1016–1025

    Article  Google Scholar 

  27. Lin CY (2004) Rouge: a package for automatic evaluation of summaries. In: Proceedings of the workshop on text summarization branches out (WAS 2004) 25–26. [Online]. Available papers2://publication/uuid/5DDA0BB8-E59F-44C1–88E6–2AD316DAEF85. 05/03/2024

  28. Denkowski M, Lavie A (2014) Meteor universal: Language specific translation evaluation for any target language. In Proceedings of the Annual Meeting of the Association for Computational Linguistics 376–380. https://doi.org/10.3115/v1/w14-3348

  29. Wang J, Xu W, Wang Q, Chan AB (2022) On distinctive image captioning via comparing and reweighting. IEEE Trans Pattern Anal Mach Intell 45:2088–2103

    Article  Google Scholar 

  30. Elhagry A, Kadaoui K (2021) A thorough review on recent deep learning methodol-ogies for image captioning. arXiv preprint arXiv 2107:13114

  31. Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 3128–3137. https://doi.org/10.1109/CVPR.2015.7298932

  32. Katpally H, Bansal A (2020) Ensemble learning on deep neural networks for image caption generation. In Proceedings - 14th IEEE International Conference on Semantic Computing 61–68. https://doi.org/10.1109/ICSC.2020.00016

  33. Pu Y et al (2016) Variational autoencoder for deep learning of images, labels and captions. In Advances in Neural Information Processing Systems 2360–2368

  34. Chu Y, Yue X, Yu L, Sergei M, Wang Z (2020) Automatic Image Captioning Based on ResNet50 and LSTM with Soft Attention. Wirel Commun Mob Comput 8909458:1–8909458

    Google Scholar 

  35. Amritkar C, Jabade VS (2018) Image caption generation using deep learning technique. 2018 Fourth Int Conf Comput Commun Control Autom1–4

  36. Aneja J, Deshpande A, Schwing AG (2018) Convolutional image captioning. Proc IEEE Comput Soc Conf Comput Vis Pattern Recognit 5561–5570. https://doi.org/10.1109/CVPR.2018.00583

  37. Mao J, Xu W, Yang Y, Wang J, Yuille AL (2014) Explain Images with Multimodal Recurrent Neural Networks. ArXiv 1410:1. [Online]. Available: http://arxiv.org/abs/1410.1090

  38. Li L, Tang S, Zhang Y, Deng L, Tian Q (2018) GLA: Global-Local Attention for Image Description. IEEE Trans Multimed 20:726–737. https://doi.org/10.1109/TMM.2017.2751140

    Article  Google Scholar 

  39. Xiao X, Wang L, Ding K, Xiang S, Pan C (2019) Deep hierarchical encoder–decoder network for image captioning. IEEE Trans Multimed 21:2942–2956. https://doi.org/10.1109/TMM.2019.2915033

    Article  Google Scholar 

  40. Yao T, Pan Y, Li Y, Qiu Z, Mei T (2017) Boosting Image Captioning with Attributes. In 2017 IEEE International Conference on Computer Vision (ICCV) 4904–4912. https://doi.org/10.1109/ICCV.2017.524

  41. Cao P, Yang Z, Sun L, Liang Y, Yang MQ, Guan R (2019) Image captioning with bidi-rectional semantic attention-based guiding of long short-term memory. Neural Process Lett 50:103–119. https://doi.org/10.1007/s11063-018-09973-5

    Article  Google Scholar 

  42. Wang C, Gu X (2023) Learning joint relationship attention network for image captioning. Expert Syst Appl 211:118474. https://doi.org/10.1016/j.eswa.2022.118474

    Article  Google Scholar 

  43. Moral ÖT, Kiliç V, Onan A, Wang W (2022) August. Automated image captioning with multi-layer gated recurrent unit. In 2022 30th European Signal Processing Conference (EUSIPCO) 1160–1164. IEEE

  44. Padate R, Jain A, Kalla M, Sharma A (2023) Image caption generation using a dual attention mechanism. Eng Appl Artif Intell 123:106112

    Article  Google Scholar 

  45. Kim DH (2019) Evaluation of COCO validation 2017 dataset with YO-LOv3. Evaluation 6:10356–10360

    Google Scholar 

  46. Agarwal V, Sharma S (2022) EMVD: Efficient multitype vehicle detection algorithm using deep learning approach in vehicular communication network for radio resource management. Int J Image Graph Signal Process (IJIGSP) 14:25–37

    Article  Google Scholar 

  47. Agarwal V, Sharma S (2023) DQN Algorithm for network resource management in vehicular communication network. Int J Inf Technol 1–9

  48. Agarwal V, Sharma S (2022) Deep learning techniques to improve radio resource management in vehicular communication network. In Sustainable Advanced Compu-ting: Select Proceedings of ICSAC 2021. 161–171. Singapore: Springer Singapore

  49. Agarwal V, Sharma S, Bansal G (2022) Network resource allocation security techniques and challenges for vehicular communication network management. In New Trends and Applications in Internet of Things (IoT) and Big Data Analytics pp 123–137. Cham: Springer International Publishing

Download references

Author information

Authors and Affiliations

Authors

Contributions

TJ agreed on the content of the study. TJ, MP and PT collected all the data for analysis. TJ agreed on the methodology. TJ, MP and PT completed the analysis based on agreed steps. Results and conclusions are discussed and written together. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Tarun Jaiswal.

Ethics declarations

Human and animal rights

This article does not contain any studies with human or animal subjects performed by any of the authors.

Informed consent

Informed consent was obtained from all individual participants included in the study.

Consent to participate

Not applicable.

Consent for publication

Not applicable.

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file1 (RAR 3 KB)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Jaiswal, T., Pandey, M. & Tripathi, P. Advancing image captioning with V16HP1365 encoder and dual self-attention network. Multimed Tools Appl 83, 80701–80725 (2024). https://doi.org/10.1007/s11042-024-18467-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-024-18467-7

Keywords