Skip to main content

Hybrid explainable image caption generation using image processing and natural language processing

  • ORIGINAL ARTICLE
  • Published:
International Journal of System Assurance Engineering and Management Aims and scope Submit manuscript

Abstract

Image caption generation is among the most rapidly growing research areas that combine image processing methodologies with natural language processing (NLP) technique(s). The effectiveness of the combination of image processing and NLP techniques can revolutionaries the areas of content creation, media analysis, and accessibility. The study proposed a novel model to generate automatic image captions by consuming visual and linguistic features. Visual image features are extracted by applying Convolutional Neural Network and linguistic features by Long Short-Term Memory (LSTM) to generate text. Microsoft Common Objects in Context dataset with over 330,000 images having corresponding captions is used to train the proposed model. A comprehensive evaluation of various models, including VGGNet + LSTM, ResNet + LSTM, GoogleNet + LSTM, VGGNet + RNN, AlexNet + RNN, and AlexNet + LSTM, was conducted based on different batch sizes and learning rates. The assessment was performed using metrics such as BLEU-2 Score, METEOR Score, ROUGE-L Score, and CIDEr. The proposed method demonstrated competitive performance, suggesting its potential for further exploration and refinement. These findings underscore the importance of careful parameter tuning and model selection in image captioning tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

References

Download references

Funding

This research received no external funding.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Atul Mishra.

Ethics declarations

Conflicts of interest

The authors declare that there is no conflict of interest regarding the publication of this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mishra, A., Agrawal, A. & Bhasker, S. Hybrid explainable image caption generation using image processing and natural language processing. Int J Syst Assur Eng Manag 15, 4874–4884 (2024). https://doi.org/10.1007/s13198-024-02495-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13198-024-02495-5

Keywords