Image Captioning Encoder–Decoder Models Using CNN-RNN Architectures: A Comparative Study

Suresh, K. Revati; Jarapala, Arun; Sudeep, P. V.

doi:10.1007/s00034-022-02050-2

Image Captioning Encoder–Decoder Models Using CNN-RNN Architectures: A Comparative Study

Published: 16 June 2022

Volume 41, pages 5719–5742, (2022)
Cite this article

Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

K. Revati Suresh¹,
Arun Jarapala¹ &
P. V. Sudeep¹

960 Accesses
7 Citations
Explore all metrics

Abstract

An image caption generator produces syntactically and semantically correct sentences to narrate the scene of a natural image. A neural image caption (NIC) generator is a popular deep learning model for automatically generating image captions in plain English. The NIC generator combines a convolutional neural network (CNN) encoder and a long short-term memory (LSTM) decoder. This paper investigates the performance of different CNN encoders and recurrent neural network decoders for finding the best NIC generator model for image captioning. Besides, we test the image caption generators with four image inject models and with decoding strategies such as greedy search and beam search. We conducted experiments on the Flickr8k dataset and analyzed the results qualitatively and quantitatively. Our results validate the automated image caption generator with ResNet-101 encoder, and the LSTM/gated recurrent units decoder outperforms the popular neural image caption NIC generator in the presence of par-inject concatenate conditioning and beam search. For quantitative assessment, we used \(ROUGE_L\), \(CIDEr_D\), and \(BLEU_n\) scores to compare the different models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Improved Bengali Image Captioning via Deep Convolutional Neural Network Based Encoder-Decoder Model

Attention Is All You Need to Tell: Transformer-Based Image Captioning

Experimenting Encoder-Decoder Architecture for Visual Image Captioning

Data Availability

The data that support the findings in this manuscript are available in the link https://forms.illinois.edu/sec/1713398 on request.

Code Availability

Implementation codes are available in the Github link https://github.com/KRevati/Image-captioning

Notes

Code of our experiment is available at https://github.com/KRevati/Image-captioning.

References

N. Aloysius, M. Geetha, A review on deep convolutional neural networks. In 2017 international conference on communication and signal processing (ICCSP) (IEEE, 2017), pp. 0588–0592
S. Bai, S. An, A survey on automatic image caption generation. Neurocomputing 311, 291–304 (2018). https://doi.org/10.1016/j.neucom.2018.05.080
Article Google Scholar
P. Baldi, P.J. Sadowski, Understanding dropout. Adv. Neural Inf. Process. Syst. 26 (2013)
E. Bisong, Building Machine Learning and Deep Learning Models on Google Cloud Platform (Apress, Berkeley, CA, USA, 2019), pp. 59–64
Book Google Scholar
L. Bottou, Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010 (Physica-Verlag HD, 2010), pp. 177–186
K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, Y. Bengio, Learning Phrase Representations Using RNN Encoder-Decoder for Statistical Machine Translation. arXiv preprint arXiv:1406.1078 (2014)
J. Chung, C. Gulcehre, K., Cho, Y. Bengio, Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. arXiv preprint arXiv:1412.3555 (2014)
B. Dai, D. Lin, Contrastive learning for image captioning. Adv. Neural Inf. Process. Syst. 30 (2017)
N. Dalal, B. Triggs, Histograms of oriented gradients for human detection. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), vol. 1. (IEEE, 2005), pp. 886–893
J.B. Delbrouck, S. Dupont, Bringing Back Simplicity and Lightliness into Neural Image Captioning, arXiv preprint arXiv:1810.06245 (2018)
J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, L. Fei-Fei, Imagenet: a large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2009), pp. 248–255
H. Fang, S. Gupta, F. Iandola, R.K. Srivastava, L. Deng, P. Dollár, G. Zweig, From captions to visual concepts and back. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015), pp. 1473–1482
A. Farhadi, M. Hejrati, M.A. Sadeghi, P. Young, C. Rashtchian, J. Hockenmaier, D. Forsyth, Every picture tells a story: generating sentences from images. In European Conference on Computer Vision (Springer, Berlin, Heidelberg, 2010), pp. 15–29
C. Gan, Z. Gan, X. He, J. Gao, L., Deng, Stylenet: generating attractive visual captions with styles. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017), pp. 3137–3146
A. Géron, Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems (O’Reilly Media, Inc, US, 2019)
Google Scholar
Y. Goldberg, Neural network methods for natural language processing. Synth. Lect. Hum. Lang. Technol. 10(1), 1–309 (2017). https://doi.org/10.2200/S00762ED1V01Y201703HLT037
Article Google Scholar
K. Greff, R.K. Srivastava, J. Koutník, B.R. Steunebrink, J. Schmidhuber, LSTM A search space odyssey. IEEE Trans. Neural Netw. Learn. Syst. 28(10), 2222–2232 (2016). https://doi.org/10.1109/TNNLS.2016.2582924
Article MathSciNet Google Scholar
K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016), pp. 770–778
L.A. Hendricks, S. Venugopalan, M. Rohrbach, R. Mooney, K. Saenko, T. Darrell, Deep compositional captioning: describing novel object categories without paired training data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016), pp. 1–10
S. Hochreiter, The vanishing gradient problem during learning recurrent neural nets and problem solutions. Int. J. Uncertain. Fuzziness Knowl.-Based Syst. 6(02), 107–116 (1998). https://doi.org/10.1142/S0218488598000094
Article MATH Google Scholar
S. Hochreiter, J. Schmidhuber, Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997). https://doi.org/10.1162/neco.1997.9.8.1735
Article Google Scholar
M. Hodosh, P. Young, J. Hockenmaier, Framing image description as a ranking task: data, models and evaluation metrics. J. Artif. Intell. Res. 47, 853–899 (2013). https://doi.org/10.1613/jair.3994
Article MathSciNet MATH Google Scholar
G. Huang, Z. Liu, L. Van Der Maaten, K.Q. Weinberger, Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017), pp. 4700–4708
S. Ioffe, C. Szegedy, Batch normalization: accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning (PMLR, 2015), pp. 448–456
J. Johnson, A. Karpathy, L. Fei-Fei, Densecap: fully convolutional localization networks for dense captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016), pp. 4565–4574
A. Karpathy, A. Joulin, L.F. Fei-Fei, Deep fragment embeddings for bidirectional image sentence mapping. Adv. Neural Inf. Process. Syst. 27 (2014)
A. Khamparia, B. Pandey, S. Tiwari, D. Gupta, A. Khanna, J.J. Rodrigues, An integrated hybrid CNN-RNN model for visual description and generation of captions. Circ. Syst. Signal Process. 39(2), 776–788 (2020). https://doi.org/10.1007/s00034-019-01306-8
Article Google Scholar
R. Kiros, R. Salakhutdinov, R.S. Zemel, Unifying Visual-semantic Embeddings with Multimodal Neural Language Models, arXiv preprint arXiv:1411.2539 (2014)
O. Levy, Y. Goldberg, Neural word embedding as implicit matrix factorization. Adv. Neural Inf. Process. Syst. 27 (2014)
C.Y. Lin, Rouge: a package for automatic evaluation of summaries. In Text Summarization Branches Out (2004), pp. 74–81
Z. C. Lipton, J. Berkowitz, C. Elkan, A Critical Review of Recurrent Neural Networks for Sequence Learning, arXiv preprint arXiv:1506.00019 (2015)
X. Liu, Q. Xu, N. Wang, A survey on deep neural network-based image captioning. Vis. Comput. 35(3), 445–470 (2019). https://doi.org/10.1007/s00371-018-1566-y
Article Google Scholar
D.G. Lowe, Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60(2), 91–110 (2004). https://doi.org/10.1023/B:VISI.0000029664.99615.94
Article Google Scholar
A. Mathews, L. Xie, X. He, Senticap: generating image descriptions with sentiments. In Proceedings of the AAAI Conference on Artificial Intelligence (2016)
T. Ojala, T.M. Pietikäinen, T. Mäenpää, Gray scale and rotation invariant texture classification with local binary patterns. In European Conference on Computer Vision. (Springer, Berlin, Heidelberg, 2000), pp. 404–420
V. Ordonez, G. Kulkarni, T. Berg, Im2text: describing images using 1 million captioned photographs. Adv. Neural Inf. Process. Syst. 24 (2011)
M. Pak, S. Kim, A review of deep learning in image recognition. In 2017 4th International Conference on Computer Applications and Information Processing Technology (CAIPT) (IEEE, 2017). pp. 1–3
K. Papineni, S. Roukos, T. Ward, W.J. Zhu, Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (2002), pp. 311–318
A. Rehman, S. Naz, M.I. Razzak, F. Akram, M. Imran, A deep learning-based framework for automatic brain tumors classification using transfer learning. Circ. Syst. Signal Process. 39(2), 757–775 (2020). https://doi.org/10.1007/s00034-019-01246-3
Article Google Scholar
Z. Ren, X. Wang, N. Zhang, X. Lv, L.J. Li, Deep reinforcement learning-based image captioning with embedding reward. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017), pp. 290–298
S.J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, V. Goel, Self-critical sequence training for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017), pp. 7008–7024
K. Simonyan, A. Zisserman, Very Deep Convolutional Networks for Large-scale Image Recognition, arXiv preprint arXiv:1409.1556 (2014)
C. Sun, C. Gan, R. Nevatia, Automatic concept discovery from parallel text and visual corpora. In Proceedings of the IEEE International Conference on Computer Vision (2015), pp. 2596–2604
C. Szegedy, S. Ioffe, V. Vanhoucke, A.A. Alemi, Inception-v4, inception-resnet and the impact of residual connections on learning. In Thirty-First AAAI Conference on Artificial Intelligence (2017)
C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, Z. Wojna, Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016), pp. 2818–2826
M. Tanti, A. Gatt, K.P. Camilleri, Where to put the image in an image caption generator. Nat. Lang. Eng. 24(3), 467–489 (2018). https://doi.org/10.1017/S1351324918000098
Article Google Scholar
R. Vedantam, C. Lawrence Zitnick, D. Parikh, Cider: consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015), pp. 4566–4575
A.K. Vijayakumar, M. Cogswell, R.R. Selvaraju, Q. Sun, S. Lee, D. Crandall, D. Batra, Diverse Beam Search: Decoding Diverse Solutions from Neural Sequence Models, arXiv preprint arXiv:1610.02424 (2016)
O. Vinyals, A. Toshev, S. Bengio, D., Erhan, Show and tell: a neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015), pp. 3156–3164
C. Wang, H. Yang, C. Bartz, C. Meinel, Image captioning with deep bidirectional LSTMs. In Proceedings of the 24th ACM International Conference on Multimedia (2016), pp. 988–997
Q. Wang, A.B. Chan, Cnn+ Cnn: Convolutional Decoders for Image Captioning, arXiv preprint arXiv:1805.09019 (2018)
Q. Wu, C. Shen, P. Wang, A. Dick, A. Van Den Hengel, Image captioning and visual question answering based on attributes and external knowledge. IEEE Trans. Pattern Anal. Mach. Intell. 40(6), 1367–1381 (2017). https://doi.org/10.1109/TPAMI.2017.2708709
Article Google Scholar
K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, Y. Bengio, Show, attend and tell: neural image caption generation with visual attention. In International Conference on Machine Learning. (PMLR, 2015). pp. 2048–2057
T. Yao, Y. Pan, Y. Li, T. Mei, Incorporating copying mechanism in image captioning for learning novel objects. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017), pp. 6580–6588
Q. You, H. Jin, Z. Wang, C. Fang, J. Luo, Image captioning with semantic attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016), pp. 4651–4659

Download references

Author information

Authors and Affiliations

Department of Electronics and Communication Engineering, National Institute of Technology Calicut, Kozhikode, Kerala, 673601, India
K. Revati Suresh, Arun Jarapala & P. V. Sudeep

Authors

K. Revati Suresh
View author publications
You can also search for this author in PubMed Google Scholar
Arun Jarapala
View author publications
You can also search for this author in PubMed Google Scholar
P. V. Sudeep
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to K. Revati Suresh.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Suresh, K.R., Jarapala, A. & Sudeep, P.V. Image Captioning Encoder–Decoder Models Using CNN-RNN Architectures: A Comparative Study. Circuits Syst Signal Process 41, 5719–5742 (2022). https://doi.org/10.1007/s00034-022-02050-2

Download citation

Received: 22 September 2020
Revised: 01 May 2022
Accepted: 04 May 2022
Published: 16 June 2022
Issue Date: October 2022
DOI: https://doi.org/10.1007/s00034-022-02050-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Image Captioning Encoder–Decoder Models Using CNN-RNN Architectures: A Comparative Study

Abstract

Access this article

Similar content being viewed by others

Improved Bengali Image Captioning via Deep Convolutional Neural Network Based Encoder-Decoder Model

Attention Is All You Need to Tell: Transformer-Based Image Captioning

Experimenting Encoder-Decoder Architecture for Visual Image Captioning

Data Availability

Code Availability

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Image Captioning Encoder–Decoder Models Using CNN-RNN Architectures: A Comparative Study

Abstract

Access this article

Similar content being viewed by others

Improved Bengali Image Captioning via Deep Convolutional Neural Network Based Encoder-Decoder Model

Attention Is All You Need to Tell: Transformer-Based Image Captioning

Experimenting Encoder-Decoder Architecture for Visual Image Captioning

Data Availability

Code Availability

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation