Skip to main content

Advertisement

Log in

A detailed review of prevailing image captioning methods using deep learning techniques

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Image captioning is a challenging task of computer vision and natural language processing. The big challenge lies in obtaining semantic information from images and translating that into the human language using machines. The interaction of computer vision and natural language processing further increases the complexity of image captioning. Notably, research has been carried out in image captioning to narrow down the semantic gap using deep learning techniques effectively. Deep learning techniques are proficient in dealing with the complexities of image captioning. A detailed study is carried out to identify the various state-of-the-art techniques for image captioning. The working algorithm of technique, positive highlights, and weakness of every technique is discussed in this paper. We also discussed the quantitative evaluation measures used for deep learning techniques and available datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

source: Vinyals et al. [45]

Fig. 6
Fig. 7

Source: Xu et al. [48]

Fig. 8

Source: You et al. [50]

Fig. 9

source: Chen et al. [9]

Fig. 10

Source: Cao et al. [7]

Fig. 11

Source: Chen et al. [10]

Fig. 12

Source: Anderson et al. [2]

Similar content being viewed by others

References

  1. Anderson P et al (2017) Bottom-up and Top-down attention for image captioning and visual question answering. [Online] arXiv: 1707.07998. Available: https://arxiv.org/abs/1707.07998v3 (2018)

  2. Anderson P, Fernando B, Johnson M, Gould S (2016) Spice: Semantic propositional image caption evaluation. In Euro Conf Comput Vis 382–398. Springer, Cham

  3. Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 6077–6086

  4. Aneja J, Deshpande A, Schwing AG (2018) Convolutional image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 5561–5570

  5. Bahdanau D, Cho K, Bengio Y (2015) Neural machine translation by jointly learning to align and translate. Proc Int Conf Learn Represent (ICLR)

  6. Bin Y, Yang Y, Zhou J, Huang Z, Shen HT (2017) Adaptively attending to visual attributes and linguistic knowledge for captioning. In Proc ACM Int Conf Multimedia 1345–1353

  7. Cao P, Yang Z, Sun L, Liang Y, Yang MQ, Guan R (2019) Image Captioning with Bidirectional Semantic Attention-Based Guiding of Long Short-Term Memory. Neur Proc Lett 50(1):103–119

  8. Chen F et al (2019) Semantic-aware Image Deblurring. arXiv preprint https://arxiv.org/abs/1910.03853

  9. Chen L, Zhang H, Xiao J, Nie L, Shao J, Liu W, Chua TS (2017) Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 5659–5667

  10. Chen T, Zhang Z, You Q, Fang C, Wang Z, Jin H, Luo J (2018) ``Factual''or``Emotional'': Stylized Image Captioning with Adaptive Learning and Attention. In Proceedings of the European Conference on Computer Vision (ECCV) 519–535

  11. Chen X, Fang H, Lin TY, Vedantam R, Gupta S, Dollár P, Zitnick CL (2015) Microsoft coco captions: Data collection and evaluation server. arXiv preprint https://arxiv.org/abs/1504.00325

  12. Cho K et al (2014) Learning phrase representations using RNN encoder–decoder for statistical machine translation. In Proceedings of the Conference Empirical 1278 Methods of Natural Language Process 1279:1724–1734

  13. Dai B, Fidler S, Urtasun R, Lin D (2017) Towards diverse and natural image descriptions via a conditional gan. In Proceedings of the IEEE International Conference on Computer Vision 2970–2979

  14. Denkowski M, Lavie A (2014) Meteor universal: Language specific translation evaluation for any target language, Baltimore, Maryland 376–380

  15. Donahue J, Anne Hendricks L, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T (2015) Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2625–2634

  16. Dognin P et al (2019) Adversarial semantic alignment for improved image captions. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

  17. Fang H et al (2014) From captions to visual concepts and back. [Online] arXiv: 1411.4952. Available: https://arxiv.org/abs/1411.4952v3 (2015 version)

  18. Giménez J, Màrquez L (2007) Linguistic features for automatic evaluation of heterogenous MT systems. In Proceedings of the Second Workshop on Statistical Machine Translation of the Association for Computational Linguistics 256–264

  19. Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. arXiv preprint https://arxiv.org/abs/1311.2524

  20. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 770–778

  21. Hodosh M, Young P, Hockenmaier J (2013) Framing image description as a ranking task: Data, models and evaluation metrics. J Artif Intell Res 47

  22. Huang Z, Shi Z (2020) Image Caption Combined with GAN Training Method. International Conference on Intelligent Information Processing. Springer, Cham

  23. Jin J, Fu K, Cui R, Sha F, Zhang C (2015) Aligning where to see and what to tell: Image caption with region-based attention and scene factorisation. Available: http://arxiv.org/abs/1506.06272

  24. Johnson J, Karpathy A, Fei-Fei L (2016) Densecap: Fully convolutional localisation networks for dense captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 4565–4574

  25. Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 3128–3137

  26. Klein D, Manning CD (2003) Accurate unlexicalised parsing. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics 1:423–430

  27. Krishna R, Zhu Y, Groth O, Johnson J, Hata K, Kravitz J, Chen S et al (2017) Visual genome: Connecting language and vision using crowdsourced dense image annotations. Inter J Comput Vis 123(1):32–73

  28. Li S, Tao Z, Li K, Yun Fu (2019) Visual to Text: Survey of Image and Video Captioning. IEEE Transactions on Emerging Topics in Computational Intelligence 3(4):297–312

    Article  Google Scholar 

  29. Li X, Jiang S (2019) Know more say less: Image captioning based on scene graphs. IEEE Transactions on Multimedia 21(8):2117–2130

    Article  Google Scholar 

  30. Lin CY, Cao G, Gao J, Nie JY (2006) An information-theoretic approach to automatic evaluation of summaries. In Proceedings of the Main Conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics pp. 463–470

  31. Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollar P, Zitnick CL (2014) Microsoft COCO: Common objects in context. In Proceeding on the European Conference on Computer Vision

  32. Lu J, Xiong C, Parikh D, Socher R (2017) Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 375–383

  33. Mavridis N (2007) Grounded situation models for situated conversational assistants, Ph.D. dissertation. Dept Archit Massachusetts Inst Technol Cambridge, MA, USA

  34. Nivre J, De Marneffe MC, Ginter F, Goldberg Y, Hajic J, Manning CD, McDonald R et al (2016) Universal dependencies v1: A multilingual treebank collection. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16) 1659–1666

  35. Papineni K, Roukos S, Ward T, Zhu WJ (2002) IBM Research Report Bleu: a method for automatic evaluation of machine translation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics 30:311–318

  36. Peters ME, Ammar W, Bhagavatula C, Power R (2017) Semi-supervised sequence tagging with bidirectional language models. arXiv preprint https://arxiv.org/abs/1705.00108

  37. Rashtchian C, Young P, Hodosh M, Hockenmaier J (2010) Collecting image annotations using amazon's mechanical turk. In Procedings of the NAACL Conference on Human Language Technologies

  38. Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: Towards real-time object detection with region proposal networks. In Adv Neur Inform Proc Syst 91–99

  39. Rennie SJ, Marcheret E, Mroueh Y, Ross J, Goel V (2017) Self-critical sequence training for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 7008–7024

  40. Shi H, Li P, Wang B, Wang Z (2018) Image captioning based on deep reinforcement learning. In Proceedings of the 10th International Conference on Internet Multimedia Computing and Service 1–5

  41. Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. Proc Int Conf Learn Represent (ICLR)

  42. Szegedy C (2015) Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 1273:1–9

  43. Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. Adv Neural Inf Process Syst 27:3104–3112

    Google Scholar 

  44. Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: Consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 4566–4575

  45. Vinyals O et al (2014) Show and Tell: a neural image caption generator. [Online] arXiv: 1411.4555. Available: https://arxiv.org/abs/1411.4555v2

  46. Wallraven C, Schultze M, Mohler B, Vatakis A, Pastra K (2011) The Poeticon enacted scenario corpus - A tool for human and computational experiments on action understanding. In Proceeding on the 9th IEEE Conference on Automatic Face & Gesture Recognition 484–491

  47. Wang L, Schwing AG, Lazebnik S (2017) Diverse and accurate image description using a variational auto-encoder with an additive gaussian encoding space. In Advances in Neural Information Processing Systems 5756–5766

  48. Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning 2048–2057

  49. Yan S et al (2020) Image captioning via hierarchical attention mechanism and policy gradient optimization. Sig Proc 167:107329

  50. You Q, Jin H, Wang Z, Fang C, Luo J (2016) Image captioning with semantic attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 4651–4659

  51. Young P, Lai A, Hodosh M, Hockenmaier J (2014) From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Trans Assoc Comput Ling 2

  52. Zhou R, Wang X, Zhang N, Lu X, Li LJ (2017) Deep reinforcement learning-based image captioning with embedding reward. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 290–298

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kalpana Deorukhkar.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Deorukhkar, K., Ket, S. A detailed review of prevailing image captioning methods using deep learning techniques. Multimed Tools Appl 81, 1313–1336 (2022). https://doi.org/10.1007/s11042-021-11293-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-021-11293-1

Keywords

Navigation