Skip to main content

Generation of Image Caption Using CNN-LSTM Based Approach

  • Conference paper
  • First Online:
Intelligent Systems Design and Applications (ISDA 2018 2018)

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 940))

  • 1675 Accesses

Abstract

Image captioning is gaining attention due to the recent developments in the deep neural architectures. But the gap between semantic concepts and the visual features is a major challenge in image caption generation. In this paper we have developed a method to use both visual features and semantic features for the caption generation. We discuss briefly about the various architectures used for visual feature extraction and Long Short Term Memory (LSTM) for caption generation. An object recognition model has been developed to identify the semantic tags in the images. These tags are encoded along with the visual features for the captioning task. We have developed an Encoder-Decoder architecture using the semantic details along with the language model for the caption generation. We evaluated our model with standard datasets like Flickr8k, Flickr30k and MSCOCO using standard metrics like BLEU and METEOR.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Jaing, W., Ma, L., Chen, X., Zhang, H., Liu, W.: Learning to guide decoding for image captioning. In: Thirty Second AAAI Conference on Artificial Intelligence (AAAI – 2018), pp. 6959–6966 (2018)

    Google Scholar 

  2. Kinghorn, P., Zhang, L., Shao, L.: A hierarchical and regional deep learning architecture for image description generation. Pattern Recogin. Lett. 119, 1–9 (2017)

    Google Scholar 

  3. Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft COCO: common objects in context. In: Computer Vision–ECCV 2014, pp. 740–755 (2014)

    Google Scholar 

  4. Tariq, A., Foroosh, H.: A context - driven extractive framework for generating realistic image descriptions. IEEE Trans. Image Process. 26(2), 619–632 (2017)

    Article  MathSciNet  Google Scholar 

  5. Pennington, J., Socher, R., Manning, C.D.: GloVe: global vectors for word representation (2014)

    Google Scholar 

  6. Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazebnik, S.: Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. IJCV 123(1), 74–93 (2017)

    Article  MathSciNet  Google Scholar 

  7. Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., Oliva, A.: Learning deep features for scene recognition using places database. In: Proceedings Advantages Neural Information Processing Systems, pp. 487–495 (2014)

    Google Scholar 

  8. Gan, Z., Gan, C., He, X., Pu, Y.: Semantic compositional networks for visual captioning. In: CVPR, pp. 1–10 (2017)

    Google Scholar 

  9. Yao, T., Pan, Y., Li, Y., Mei, T.: Incorporating copying mechanism in image captioning for learning novel objects. In: CVPR, pp. 6580–6588 (2017)

    Google Scholar 

  10. Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: CVPR (2015)

    Google Scholar 

  11. Mao, J., Xu, W., Yang, Y., Wang, J., Huang, Z., Yuille, A.: Deep captioning with multimodal recurrent neural networks. In: ICLR (2015)

    Google Scholar 

  12. Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., Zemel, R.S., Bengio, Y.: Show, attend and tell: neural image caption generation with visual attention. In: ICML (2015)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to S. Aravindkumar .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Aravindkumar, S., Varalakshmi, P., Hemalatha, M. (2020). Generation of Image Caption Using CNN-LSTM Based Approach. In: Abraham, A., Cherukuri, A.K., Melin, P., Gandhi, N. (eds) Intelligent Systems Design and Applications. ISDA 2018 2018. Advances in Intelligent Systems and Computing, vol 940. Springer, Cham. https://doi.org/10.1007/978-3-030-16657-1_43

Download citation

Publish with us

Policies and ethics