Skip to main content
Log in

Image Captioning Encoder–Decoder Models Using CNN-RNN Architectures: A Comparative Study

  • Published:
Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

Abstract

An image caption generator produces syntactically and semantically correct sentences to narrate the scene of a natural image. A neural image caption (NIC) generator is a popular deep learning model for automatically generating image captions in plain English. The NIC generator combines a convolutional neural network (CNN) encoder and a long short-term memory (LSTM) decoder. This paper investigates the performance of different CNN encoders and recurrent neural network decoders for finding the best NIC generator model for image captioning. Besides, we test the image caption generators with four image inject models and with decoding strategies such as greedy search and beam search. We conducted experiments on the Flickr8k dataset and analyzed the results qualitatively and quantitatively. Our results validate the automated image caption generator with ResNet-101 encoder, and the LSTM/gated recurrent units decoder outperforms the popular neural image caption NIC generator in the presence of par-inject concatenate conditioning and beam search. For quantitative assessment, we used \(ROUGE_L\), \(CIDEr_D\), and \(BLEU_n\) scores to compare the different models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Data Availability

The data that support the findings in this manuscript are available in the link https://forms.illinois.edu/sec/1713398 on request.

Code Availability

Implementation codes are available in the Github link https://github.com/KRevati/Image-captioning

Notes

  1. Code of our experiment is available at https://github.com/KRevati/Image-captioning.

References

  1. N. Aloysius, M. Geetha, A review on deep convolutional neural networks. In 2017 international conference on communication and signal processing (ICCSP) (IEEE, 2017), pp. 0588–0592

  2. S. Bai, S. An, A survey on automatic image caption generation. Neurocomputing 311, 291–304 (2018). https://doi.org/10.1016/j.neucom.2018.05.080

    Article  Google Scholar 

  3. P. Baldi, P.J. Sadowski, Understanding dropout. Adv. Neural Inf. Process. Syst. 26 (2013)

  4. E. Bisong, Building Machine Learning and Deep Learning Models on Google Cloud Platform (Apress, Berkeley, CA, USA, 2019), pp. 59–64

    Book  Google Scholar 

  5. L. Bottou, Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010 (Physica-Verlag HD, 2010), pp. 177–186

  6. K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, Y. Bengio, Learning Phrase Representations Using RNN Encoder-Decoder for Statistical Machine Translation. arXiv preprint arXiv:1406.1078 (2014)

  7. J. Chung, C. Gulcehre, K., Cho, Y. Bengio, Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. arXiv preprint arXiv:1412.3555 (2014)

  8. B. Dai, D. Lin, Contrastive learning for image captioning. Adv. Neural Inf. Process. Syst. 30 (2017)

  9. N. Dalal, B. Triggs, Histograms of oriented gradients for human detection. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), vol. 1. (IEEE, 2005), pp. 886–893

  10. J.B. Delbrouck, S. Dupont, Bringing Back Simplicity and Lightliness into Neural Image Captioning, arXiv preprint arXiv:1810.06245 (2018)

  11. J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, L. Fei-Fei, Imagenet: a large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2009), pp. 248–255

  12. H. Fang, S. Gupta, F. Iandola, R.K. Srivastava, L. Deng, P. Dollár, G. Zweig, From captions to visual concepts and back. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015), pp. 1473–1482

  13. A. Farhadi, M. Hejrati, M.A. Sadeghi, P. Young, C. Rashtchian, J. Hockenmaier, D. Forsyth, Every picture tells a story: generating sentences from images. In European Conference on Computer Vision (Springer, Berlin, Heidelberg, 2010), pp. 15–29

  14. C. Gan, Z. Gan, X. He, J. Gao, L., Deng, Stylenet: generating attractive visual captions with styles. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017), pp. 3137–3146

  15. A. Géron, Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems (O’Reilly Media, Inc, US, 2019)

    Google Scholar 

  16. Y. Goldberg, Neural network methods for natural language processing. Synth. Lect. Hum. Lang. Technol. 10(1), 1–309 (2017). https://doi.org/10.2200/S00762ED1V01Y201703HLT037

    Article  Google Scholar 

  17. K. Greff, R.K. Srivastava, J. Koutník, B.R. Steunebrink, J. Schmidhuber, LSTM A search space odyssey. IEEE Trans. Neural Netw. Learn. Syst. 28(10), 2222–2232 (2016). https://doi.org/10.1109/TNNLS.2016.2582924

    Article  MathSciNet  Google Scholar 

  18. K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016), pp. 770–778

  19. L.A. Hendricks, S. Venugopalan, M. Rohrbach, R. Mooney, K. Saenko, T. Darrell, Deep compositional captioning: describing novel object categories without paired training data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016), pp. 1–10

  20. S. Hochreiter, The vanishing gradient problem during learning recurrent neural nets and problem solutions. Int. J. Uncertain. Fuzziness Knowl.-Based Syst. 6(02), 107–116 (1998). https://doi.org/10.1142/S0218488598000094

    Article  MATH  Google Scholar 

  21. S. Hochreiter, J. Schmidhuber, Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997). https://doi.org/10.1162/neco.1997.9.8.1735

    Article  Google Scholar 

  22. M. Hodosh, P. Young, J. Hockenmaier, Framing image description as a ranking task: data, models and evaluation metrics. J. Artif. Intell. Res. 47, 853–899 (2013). https://doi.org/10.1613/jair.3994

    Article  MathSciNet  MATH  Google Scholar 

  23. G. Huang, Z. Liu, L. Van Der Maaten, K.Q. Weinberger, Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017), pp. 4700–4708

  24. S. Ioffe, C. Szegedy, Batch normalization: accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning (PMLR, 2015), pp. 448–456

  25. J. Johnson, A. Karpathy, L. Fei-Fei, Densecap: fully convolutional localization networks for dense captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016), pp. 4565–4574

  26. A. Karpathy, A. Joulin, L.F. Fei-Fei, Deep fragment embeddings for bidirectional image sentence mapping. Adv. Neural Inf. Process. Syst. 27 (2014)

  27. A. Khamparia, B. Pandey, S. Tiwari, D. Gupta, A. Khanna, J.J. Rodrigues, An integrated hybrid CNN-RNN model for visual description and generation of captions. Circ. Syst. Signal Process. 39(2), 776–788 (2020). https://doi.org/10.1007/s00034-019-01306-8

    Article  Google Scholar 

  28. R. Kiros, R. Salakhutdinov, R.S. Zemel, Unifying Visual-semantic Embeddings with Multimodal Neural Language Models, arXiv preprint arXiv:1411.2539 (2014)

  29. O. Levy, Y. Goldberg, Neural word embedding as implicit matrix factorization. Adv. Neural Inf. Process. Syst. 27 (2014)

  30. C.Y. Lin, Rouge: a package for automatic evaluation of summaries. In Text Summarization Branches Out (2004), pp. 74–81

  31. Z. C. Lipton, J. Berkowitz, C. Elkan, A Critical Review of Recurrent Neural Networks for Sequence Learning, arXiv preprint arXiv:1506.00019 (2015)

  32. X. Liu, Q. Xu, N. Wang, A survey on deep neural network-based image captioning. Vis. Comput. 35(3), 445–470 (2019). https://doi.org/10.1007/s00371-018-1566-y

    Article  Google Scholar 

  33. D.G. Lowe, Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60(2), 91–110 (2004). https://doi.org/10.1023/B:VISI.0000029664.99615.94

    Article  Google Scholar 

  34. A. Mathews, L. Xie, X. He, Senticap: generating image descriptions with sentiments. In Proceedings of the AAAI Conference on Artificial Intelligence (2016)

  35. T. Ojala, T.M. Pietikäinen, T. Mäenpää, Gray scale and rotation invariant texture classification with local binary patterns. In European Conference on Computer Vision. (Springer, Berlin, Heidelberg, 2000), pp. 404–420

  36. V. Ordonez, G. Kulkarni, T. Berg, Im2text: describing images using 1 million captioned photographs. Adv. Neural Inf. Process. Syst. 24 (2011)

  37. M. Pak, S. Kim, A review of deep learning in image recognition. In 2017 4th International Conference on Computer Applications and Information Processing Technology (CAIPT) (IEEE, 2017). pp. 1–3

  38. K. Papineni, S. Roukos, T. Ward, W.J. Zhu, Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (2002), pp. 311–318

  39. A. Rehman, S. Naz, M.I. Razzak, F. Akram, M. Imran, A deep learning-based framework for automatic brain tumors classification using transfer learning. Circ. Syst. Signal Process. 39(2), 757–775 (2020). https://doi.org/10.1007/s00034-019-01246-3

    Article  Google Scholar 

  40. Z. Ren, X. Wang, N. Zhang, X. Lv, L.J. Li, Deep reinforcement learning-based image captioning with embedding reward. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017), pp. 290–298

  41. S.J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, V. Goel, Self-critical sequence training for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017), pp. 7008–7024

  42. K. Simonyan, A. Zisserman, Very Deep Convolutional Networks for Large-scale Image Recognition, arXiv preprint arXiv:1409.1556 (2014)

  43. C. Sun, C. Gan, R. Nevatia, Automatic concept discovery from parallel text and visual corpora. In Proceedings of the IEEE International Conference on Computer Vision (2015), pp. 2596–2604

  44. C. Szegedy, S. Ioffe, V. Vanhoucke, A.A. Alemi, Inception-v4, inception-resnet and the impact of residual connections on learning. In Thirty-First AAAI Conference on Artificial Intelligence (2017)

  45. C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, Z. Wojna, Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016), pp. 2818–2826

  46. M. Tanti, A. Gatt, K.P. Camilleri, Where to put the image in an image caption generator. Nat. Lang. Eng. 24(3), 467–489 (2018). https://doi.org/10.1017/S1351324918000098

    Article  Google Scholar 

  47. R. Vedantam, C. Lawrence Zitnick, D. Parikh, Cider: consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015), pp. 4566–4575

  48. A.K. Vijayakumar, M. Cogswell, R.R. Selvaraju, Q. Sun, S. Lee, D. Crandall, D. Batra, Diverse Beam Search: Decoding Diverse Solutions from Neural Sequence Models, arXiv preprint arXiv:1610.02424 (2016)

  49. O. Vinyals, A. Toshev, S. Bengio, D., Erhan, Show and tell: a neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015), pp. 3156–3164

  50. C. Wang, H. Yang, C. Bartz, C. Meinel, Image captioning with deep bidirectional LSTMs. In Proceedings of the 24th ACM International Conference on Multimedia (2016), pp. 988–997

  51. Q. Wang, A.B. Chan, Cnn+ Cnn: Convolutional Decoders for Image Captioning, arXiv preprint arXiv:1805.09019 (2018)

  52. Q. Wu, C. Shen, P. Wang, A. Dick, A. Van Den Hengel, Image captioning and visual question answering based on attributes and external knowledge. IEEE Trans. Pattern Anal. Mach. Intell. 40(6), 1367–1381 (2017). https://doi.org/10.1109/TPAMI.2017.2708709

    Article  Google Scholar 

  53. K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, Y. Bengio, Show, attend and tell: neural image caption generation with visual attention. In International Conference on Machine Learning. (PMLR, 2015). pp. 2048–2057

  54. T. Yao, Y. Pan, Y. Li, T. Mei, Incorporating copying mechanism in image captioning for learning novel objects. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017), pp. 6580–6588

  55. Q. You, H. Jin, Z. Wang, C. Fang, J. Luo, Image captioning with semantic attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016), pp. 4651–4659

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to K. Revati Suresh.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Suresh, K.R., Jarapala, A. & Sudeep, P.V. Image Captioning Encoder–Decoder Models Using CNN-RNN Architectures: A Comparative Study. Circuits Syst Signal Process 41, 5719–5742 (2022). https://doi.org/10.1007/s00034-022-02050-2

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00034-022-02050-2

Keywords

Navigation