Abstract
Image captioning is a challenging task of computer vision and natural language processing. The big challenge lies in obtaining semantic information from images and translating that into the human language using machines. The interaction of computer vision and natural language processing further increases the complexity of image captioning. Notably, research has been carried out in image captioning to narrow down the semantic gap using deep learning techniques effectively. Deep learning techniques are proficient in dealing with the complexities of image captioning. A detailed study is carried out to identify the various state-of-the-art techniques for image captioning. The working algorithm of technique, positive highlights, and weakness of every technique is discussed in this paper. We also discussed the quantitative evaluation measures used for deep learning techniques and available datasets.
Similar content being viewed by others
References
Anderson P et al (2017) Bottom-up and Top-down attention for image captioning and visual question answering. [Online] arXiv: 1707.07998. Available: https://arxiv.org/abs/1707.07998v3 (2018)
Anderson P, Fernando B, Johnson M, Gould S (2016) Spice: Semantic propositional image caption evaluation. In Euro Conf Comput Vis 382–398. Springer, Cham
Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 6077–6086
Aneja J, Deshpande A, Schwing AG (2018) Convolutional image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 5561–5570
Bahdanau D, Cho K, Bengio Y (2015) Neural machine translation by jointly learning to align and translate. Proc Int Conf Learn Represent (ICLR)
Bin Y, Yang Y, Zhou J, Huang Z, Shen HT (2017) Adaptively attending to visual attributes and linguistic knowledge for captioning. In Proc ACM Int Conf Multimedia 1345–1353
Cao P, Yang Z, Sun L, Liang Y, Yang MQ, Guan R (2019) Image Captioning with Bidirectional Semantic Attention-Based Guiding of Long Short-Term Memory. Neur Proc Lett 50(1):103–119
Chen F et al (2019) Semantic-aware Image Deblurring. arXiv preprint https://arxiv.org/abs/1910.03853
Chen L, Zhang H, Xiao J, Nie L, Shao J, Liu W, Chua TS (2017) Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 5659–5667
Chen T, Zhang Z, You Q, Fang C, Wang Z, Jin H, Luo J (2018) ``Factual''or``Emotional'': Stylized Image Captioning with Adaptive Learning and Attention. In Proceedings of the European Conference on Computer Vision (ECCV) 519–535
Chen X, Fang H, Lin TY, Vedantam R, Gupta S, Dollár P, Zitnick CL (2015) Microsoft coco captions: Data collection and evaluation server. arXiv preprint https://arxiv.org/abs/1504.00325
Cho K et al (2014) Learning phrase representations using RNN encoder–decoder for statistical machine translation. In Proceedings of the Conference Empirical 1278 Methods of Natural Language Process 1279:1724–1734
Dai B, Fidler S, Urtasun R, Lin D (2017) Towards diverse and natural image descriptions via a conditional gan. In Proceedings of the IEEE International Conference on Computer Vision 2970–2979
Denkowski M, Lavie A (2014) Meteor universal: Language specific translation evaluation for any target language, Baltimore, Maryland 376–380
Donahue J, Anne Hendricks L, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T (2015) Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2625–2634
Dognin P et al (2019) Adversarial semantic alignment for improved image captions. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Fang H et al (2014) From captions to visual concepts and back. [Online] arXiv: 1411.4952. Available: https://arxiv.org/abs/1411.4952v3 (2015 version)
Giménez J, Màrquez L (2007) Linguistic features for automatic evaluation of heterogenous MT systems. In Proceedings of the Second Workshop on Statistical Machine Translation of the Association for Computational Linguistics 256–264
Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. arXiv preprint https://arxiv.org/abs/1311.2524
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 770–778
Hodosh M, Young P, Hockenmaier J (2013) Framing image description as a ranking task: Data, models and evaluation metrics. J Artif Intell Res 47
Huang Z, Shi Z (2020) Image Caption Combined with GAN Training Method. International Conference on Intelligent Information Processing. Springer, Cham
Jin J, Fu K, Cui R, Sha F, Zhang C (2015) Aligning where to see and what to tell: Image caption with region-based attention and scene factorisation. Available: http://arxiv.org/abs/1506.06272
Johnson J, Karpathy A, Fei-Fei L (2016) Densecap: Fully convolutional localisation networks for dense captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 4565–4574
Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 3128–3137
Klein D, Manning CD (2003) Accurate unlexicalised parsing. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics 1:423–430
Krishna R, Zhu Y, Groth O, Johnson J, Hata K, Kravitz J, Chen S et al (2017) Visual genome: Connecting language and vision using crowdsourced dense image annotations. Inter J Comput Vis 123(1):32–73
Li S, Tao Z, Li K, Yun Fu (2019) Visual to Text: Survey of Image and Video Captioning. IEEE Transactions on Emerging Topics in Computational Intelligence 3(4):297–312
Li X, Jiang S (2019) Know more say less: Image captioning based on scene graphs. IEEE Transactions on Multimedia 21(8):2117–2130
Lin CY, Cao G, Gao J, Nie JY (2006) An information-theoretic approach to automatic evaluation of summaries. In Proceedings of the Main Conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics pp. 463–470
Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollar P, Zitnick CL (2014) Microsoft COCO: Common objects in context. In Proceeding on the European Conference on Computer Vision
Lu J, Xiong C, Parikh D, Socher R (2017) Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 375–383
Mavridis N (2007) Grounded situation models for situated conversational assistants, Ph.D. dissertation. Dept Archit Massachusetts Inst Technol Cambridge, MA, USA
Nivre J, De Marneffe MC, Ginter F, Goldberg Y, Hajic J, Manning CD, McDonald R et al (2016) Universal dependencies v1: A multilingual treebank collection. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16) 1659–1666
Papineni K, Roukos S, Ward T, Zhu WJ (2002) IBM Research Report Bleu: a method for automatic evaluation of machine translation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics 30:311–318
Peters ME, Ammar W, Bhagavatula C, Power R (2017) Semi-supervised sequence tagging with bidirectional language models. arXiv preprint https://arxiv.org/abs/1705.00108
Rashtchian C, Young P, Hodosh M, Hockenmaier J (2010) Collecting image annotations using amazon's mechanical turk. In Procedings of the NAACL Conference on Human Language Technologies
Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: Towards real-time object detection with region proposal networks. In Adv Neur Inform Proc Syst 91–99
Rennie SJ, Marcheret E, Mroueh Y, Ross J, Goel V (2017) Self-critical sequence training for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 7008–7024
Shi H, Li P, Wang B, Wang Z (2018) Image captioning based on deep reinforcement learning. In Proceedings of the 10th International Conference on Internet Multimedia Computing and Service 1–5
Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. Proc Int Conf Learn Represent (ICLR)
Szegedy C (2015) Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 1273:1–9
Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. Adv Neural Inf Process Syst 27:3104–3112
Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: Consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 4566–4575
Vinyals O et al (2014) Show and Tell: a neural image caption generator. [Online] arXiv: 1411.4555. Available: https://arxiv.org/abs/1411.4555v2
Wallraven C, Schultze M, Mohler B, Vatakis A, Pastra K (2011) The Poeticon enacted scenario corpus - A tool for human and computational experiments on action understanding. In Proceeding on the 9th IEEE Conference on Automatic Face & Gesture Recognition 484–491
Wang L, Schwing AG, Lazebnik S (2017) Diverse and accurate image description using a variational auto-encoder with an additive gaussian encoding space. In Advances in Neural Information Processing Systems 5756–5766
Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning 2048–2057
Yan S et al (2020) Image captioning via hierarchical attention mechanism and policy gradient optimization. Sig Proc 167:107329
You Q, Jin H, Wang Z, Fang C, Luo J (2016) Image captioning with semantic attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 4651–4659
Young P, Lai A, Hodosh M, Hockenmaier J (2014) From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Trans Assoc Comput Ling 2
Zhou R, Wang X, Zhang N, Lu X, Li LJ (2017) Deep reinforcement learning-based image captioning with embedding reward. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 290–298
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Deorukhkar, K., Ket, S. A detailed review of prevailing image captioning methods using deep learning techniques. Multimed Tools Appl 81, 1313–1336 (2022). https://doi.org/10.1007/s11042-021-11293-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-021-11293-1