Skip to main content
Log in

A novel automatic image caption generation using bidirectional long-short term memory framework

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Image Captioning, the process of generating a textual description of an image, has emerged as a hot research due to its practical importance in many domains. It is a challenging task as it uses both Natural Language Processing and Computer Vision related fields to generate the captions. Despite the fact that the literature has reported notable image captioning methodologies, they still lag in accomplishing the substantial performance level for diverse datasets. This paper proposes an image caption generating mechanism based on Optimized Bidirectional Long Short-Term Memory (B-LSTM) model. We propose a variant of Moth Flame Optimization (PMFO), termed here as Proposed Moth Flame Optimization (PMFO), which has logarithmic spiral update based on correlation. The performance of the proposed model is demonstrated on benchmark datasets like Flicker 8 k, Flicker30k, VizWik and COCO datasets using renowned metrics such as CIDEr, BLEU, SPICE and ROUGH. The performance analysis proves that the B-LSTM achieves better performance on caption generation than state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Abbreviations

LSTM:

Long Short Term Memory

B-LSTM:

Bidirectional Long Short Term Memory

PMFO:

Proposed Moth Flame Optimization

AI:

Artificial Intelligence

NLP:

Natural Language Processing

RNN:

Recurrent Neural Network

CNN:

Convolutional Neural Network

NN:

Neural Network

SGC:

Scene Graph Captioner

TA-LSTM:

Triple Attention LSTM

VD-SAN:

Visual-Densely Semantic Attention Network

DenseNet:

Dense Convolutional Network

gLSTM:

guidance LSTM

PIL:

Python Image Library

c-RNN:

character-level RNN

(NLP):

Natural language Processing

References

  1. Amritkar C, Jabade V (2018) Image caption generation using deep learning technique. IEEE 978–1–5386-5257-2/18/$31.00

  2. Anuranji R, Srimathi H (2020) A supervised deep convolutional based bidirectional long short term memory video hashing for large scale video retrieval applications. Digital Signal Process 4(1):102729

    Article  Google Scholar 

  3. Campi A, Guinea S, Spoletini P (2014) An operational semantics for XML fuzzy queries. eval (q, Ti) 1: 1

  4. Chandanapalli SB, Sreenivasa Reddy E, Rajya Lakshmi D (2019) Convolutional neural network for water quality prediction in WSN. J Network Commun Syst 2(3):40–47

    Google Scholar 

  5. Chen X, Zhang M, Wang Z, Zuo L, Li B, Yang Y (2020) Leveraging unpaired out-of-domain data for image captioning. Pattern Recogn Lett 132:132–140

    Article  Google Scholar 

  6. Christie G, Laddha A, Agrawal A, Antol S, Goyal Y, Kochersberger K, Batra D (2017) Resolving vision and language ambiguities together: Joint Segmentation & Prepositional Attachment Resolution in captioned scenes. Comput Vis Image Underst 163:101–112

    Article  Google Scholar 

  7. Fan C, Zhang Z, Crandall DJ (2018) Deepdiary: Lifelogging image captioning and summarization. J Vis Commun Image Represent 55:40–55

    Article  Google Scholar 

  8. Feng Y, Lapata M (2012) Automatic caption generation for news images. IEEE Trans Pattern Anal Mach Intell 35(4):797–812

    Article  Google Scholar 

  9. George A, Rajakumar BR (2013) APOGA: An Adaptive Population Pool Size based Genetic Algorithm. AASRI Procedia - 2013 AASRI Conference on Intelligent Systems and Control,4, pp 288–296.

  10. Guan J, Wang E (2018) Repeated review based image captioning for image evidence review. Signal Process Image Commun 63:141–148

    Article  Google Scholar 

  11. He X, Yang Y, Shi B, Bai X (2019) Vd-san: visual-densely semantic attention network for image caption generation. Neurocomputing 328:48–55

    Article  Google Scholar 

  12. He X, Shi B, Bai X, Xia G-S, Zhang Z, Dong W (2019) Image caption generation with part of speech guidance. Pattern Recogn Lett 119:229–237

    Article  Google Scholar 

  13. Huang G, Hu H (2018) C-Rnn: a fine-grained language model for image captioning. Neural Process Lett 49(2):683–691

    Article  Google Scholar 

  14. Jamieson M, Eskin Y, Fazly A, Stevenson S, Dickinson SJ (2012) Discovering hierarchical object models from captioned images. Comput Vis Image Underst 116(7):842–853

    Article  Google Scholar 

  15. Ji Q, Huang J, He W, Sun Y (2019) 'Optimized Deep Convolutional Neural Networks for Identification of Macular Diseases from Optical Coherence Tomography Images. Algorithms 12(3):51

    Article  MathSciNet  Google Scholar 

  16. Kahn CE, Rubin DL (2009) Automated semantic indexing of figure captions to improve radiology image retrieval. J Am Med Inform Assoc 16(3):380–386

    Article  Google Scholar 

  17. Karpathy A, Joulin A, Fei-Fei LF (2014) Deep fragment embeddings for bidirectional image sentence mapping. In advances in neural information processing systems (pp. 1889-1897)

  18. Kinghorn P, Zhang L, Shao L (2018) A region-based image caption generator with refined descriptions. Neurocomputing 272:416–424

    Article  Google Scholar 

  19. Liu Q, Chen Y, Wang J, Zhang S (2018) Multi-view pedestrian captioning with an attention topic Cnn model. Comput Ind 97:47–53

    Article  Google Scholar 

  20. Liu M, Li L, Hu H, Guan W, Tian J (2020) Image Caption Generation with Dual Attention Mechanism. Inf Process Manag 57(2):102178

    Article  Google Scholar 

  21. Lu X, Wang B, Zheng X, Li X (2017) Exploring models and data for remote sensing image caption generation. IEEE Trans Geosci Remote Sens 56(4):2183–2195

    Article  Google Scholar 

  22. Manti S, Parisi GF, Giacchi V, Sciacca P, Tardino L, Cuppari C, Salpietro C, Chikermane A, Leonardi S (2019) Pilot study shows right ventricular diastolic function impairment in young children with obstructive respiratory disease. Acta Paediatr 108(4):740–744

    Article  Google Scholar 

  23. Mirjalili S (2015) Moth-flame optimization algorithm: a novel nature-inspired heuristic paradigm. Knowl-Based Syst 89:228–249

    Article  Google Scholar 

  24. Nabati M, Behrad A (2020) Video captioning using boosted and parallel Long Short-Term Memory networks. Comput Vis Image Understand 1(190):102840

    Article  Google Scholar 

  25. Parisi GF, Herman T, van Meel ER, Ciet P, Kemner-van de Corput MP, Reiss IK, Jaddoe VWV, de Jongste JC, Tiddens HAWM, Duijts L (2017) Influence of early growth on childhood lung function assessed by magnetic resolution imaging and spirometry. The Generation R Study

  26. Poluru RK, Lokesh Kumar R (2019) Enhancement of ATC by optimizing TCSC configuration using adaptive moth flame optimization algorithm. J Computation Mech Power Syst Control 2(3):1–9

    Article  Google Scholar 

  27. Rajakumar BR (2013) Static and adaptive mutation techniques for genetic algorithm: a systematic comparative analysis. Int J Comput Sci Eng 8(2):180–193

  28. Rajakumar BR (2013) Impact of static and adaptive mutation techniques on the performance of genetic algorithm. In J Hybrid Intell Syst 10(1):11–22

    Google Scholar 

  29. Rajakumar BR, George A (2012) A New Adaptive Mutation Technique for Genetic Algorithm. In: proceedings of IEEE International Conference on Computational Intelligence and Computing Research (ICCIC) pp1–7

  30. Shetty R, Tavakoli HR, Laaksonen J (2018) Image and video captioning with augmented neural architectures. IEEE MultiMedia 25(2):34–46

    Article  Google Scholar 

  31. Swamy SM, Rajakumar BR, Valarmathi IR (2013) Design of Hybrid Wind and Photovoltaic Power System using Opposition-based Genetic Algorithm with Cauchy Mutation. IET Chennai Fourth International Conference on Sustainable Energy and Intelligent Systems, pp 504–510

  32. Tan YH, Chan CS (2019) Phrase-based image caption generator with hierarchical Lstm network. Neurocomputing 333:86–100

    Article  Google Scholar 

  33. Wu C, Wei Y, Chu X, Su F, Wang L (2018) Modeling visual and word-conditional semantic attention for image captioning. Signal Process Image Commun 67:100–107

    Article  Google Scholar 

  34. Wu Q, Shen C, Wang P, Dick A, van den Hengel A (2018) Image captioning and visual question answering based on attributes and external knowledge. IEEE Trans Pattern Anal Mach Intell 40(6):1367–1381

    Article  Google Scholar 

  35. Xu N, Liu A-A, Liu J, Nie W, Su Y (2019) Scene graph Captioner: image captioning based on structural visual representation. J Vis Commun Image Represent 58:477–485

    Article  Google Scholar 

  36. Yuan A, Li X, Lu X (2019) 3g structure for image caption generation. Neurocomputing 330:17–28

    Article  Google Scholar 

  37. Zhao D, Chang Z, Guo S (2019) A multimodal fusion approach for image captioning. Neurocomputing 329:476–485

    Article  Google Scholar 

  38. Zheng H, Wu J, Liang R, Li Y, Li X (2018) Multi-task learning for captioning images with novel words. IET Comput Vis 13(3):294–301

    Article  Google Scholar 

  39. Zhou X, Lin J, Zhang Z, Shao Z, Chen S, Liu H (2020) Improved Itracker combined with bidirectional long short-term memory for 3d gaze estimation using appearance cues. Neurocomputing 390:217–225

    Article  Google Scholar 

  40. Zhu X, Li L, Liu J, Li Z, Peng H, Niu X (2018) Image captioning with triple-attention and stack parallel Lstm. Neurocomputing 319:55–65

    Article  Google Scholar 

Download references

Acknowledgments

This research is supported by the Fundamental Research Funds for the Central Universities (Grant no. WK2350000002).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhongfu Ye.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ye, Z., Khan, R., Naqvi, N. et al. A novel automatic image caption generation using bidirectional long-short term memory framework. Multimed Tools Appl 80, 25557–25582 (2021). https://doi.org/10.1007/s11042-021-10632-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-021-10632-6

Keywords

Navigation