A comprehensive survey on image captioning: from handcrafted to deep learning-based techniques, a taxonomy and open research issues

Sharma, Himanshu; Padha, Devanand

doi:10.1007/s10462-023-10488-2

A comprehensive survey on image captioning: from handcrafted to deep learning-based techniques, a taxonomy and open research issues

Published: 17 April 2023

Volume 56, pages 13619–13661, (2023)
Cite this article

Artificial Intelligence Review Aims and scope Submit manuscript

1583 Accesses
6 Citations
1 Altmetric
Explore all metrics

Abstract

Image captioning is a pretty modern area of the convergence of computer vision and natural language processing and is widely used in a range of applications such as multi-modal search, robotics, security, remote sensing, medical, and visual aid. The image captioning techniques have witnessed a paradigm shift from classical machine-learning-based approaches to the most contemporary deep learning-based techniques. We present an in-depth investigation of image captioning methodologies in this survey using our proposed taxonomy. Furthermore, the study investigates several eras of image captioning advancements, including template-based, retrieval-based, and encoder-decoder-based models. We also explore captioning in languages other than English. A thorough investigation of benchmark image captioning datasets and assessment measures is also discussed. The effectiveness of real-time image captioning is a severe barrier that prevents its use in sensitive applications such as visual aid, security, and medicine. Another observation from our research is the scarcity of personalized domain datasets that limits its adoption into more advanced issues. Despite influential contributions from several academics, further efforts are required to construct substantially robust and reliable image captioning models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Microsoft COCO: Common Objects in Context

ImageNet Large Scale Visual Recognition Challenge

Article 11 April 2015

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Article Open access 06 February 2017

References

Alam M, Samad MD, Vidyaratne L, Glandon A, Iftekharuddin KM (2020) Survey on deep neural networks in speech and vision systems. Neurocomputing 417:302–321
Google Scholar
Amirian S, Rasheed K, Taha TR, Arabnia HR (2020) Automatic image and video caption generation with deep learning: a concise review and algorithmic overlap. IEEE Access 8:218386–218400
Google Scholar
Anderson P, Fernando B, Johnson M, Gould S (2016) Spice: Semantic propositional image caption evaluation. In: European conference on computer vision pp 382–398. Springer
Aneja J, Deshpande A, Schwing AG (2018) Convolutional image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5561–5570
Bai S, An S (2018) A survey on automatic image caption generation. Neurocomputing 311:291–304
Google Scholar
Banerjee S, Lavie A (2005) Meteor: an automatic metric for mt evaluation with improved correlation with human judgments. In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp 65–72
Ben H, Pan Y, Li Y, Yao T, Hong R, Wang M, Mei T (2022) Unpaired image captioning with semantic-constrained self-learning. IEEE Trans Multimedia 24:904–916
Google Scholar
Bernardi R, Cakici R, Elliott D, Erdem A, Erdem E, Ikizler-Cinbis N, Keller F, Muscat A, Plank B (2016) Automatic description generation from images: a survey of models, datasets, and evaluation measures. J Artif Intell Res 55:409–442
Google Scholar
Bhosale YH, Patnaik KS (2022) Application of deep learning techniques in diagnosis of covid-19 (coronavirus): a systematic review. Neural Process Lett 2:1–53
Google Scholar
Bhosale YH, Patnaik KS (2022) Iot deployable lightweight deep learning application for covid-19 detection with lung diseases using raspberrypi. In: 2022 International Conference on IoT and Blockchain Technology (ICIBT), pp 1–6
Bhosale YH, Zanwar S, Ahmed Z, Nakrani M, Bhuyar D, Shinde U (2022) Deep convolutional neural network based covid-19 classification from radiology x-ray images for IOT enabled devices. In: 2022 8th International Conference on Advanced Computing and Communication Systems (ICACCS), vol 1, pp 1398–1402
Bin Y, Ding Y, Peng B, Peng L, Yang Y, Chua T-S (2022) Entity slot filling for visual captioning. IEEE Trans Circuits Syst Video Technol 32(1):52–62
Google Scholar
Bryan CR, Antonio T, Murphy Kevin P, Freeman William T (2008) Labelme: a database and web-based tool for image annotation. Int J Comput Vision 77(1–3):157–173
Google Scholar
Chen X, Zitnick C.L (2015) Mind’s eye: a recurrent visual representation for image caption generation. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 2422–2431, Boston. IEEE
Cheng Q, Zhou Y, Peng F, Yuan X, Zhang L (2021) A deep semantic alignment network for the cross-modal image-text retrieval in remote sensing. IEEE J Select Top Appl Earth Observations Remote Sens 14:4284–4297
Google Scholar
Cornia M, Stefanini M, Baraldi L, Cucchiara R (2020) Meshed-memory transformer for image captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10578–10587
Dai B, Fidler S, Urtasun R, Lin D (2017) Towards diverse and natural image descriptions via a conditional GAN. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp 2989–2998, Venice. IEEE
Devlin J, Cheng H, Fang H, Gupta S, Deng L, He X, Zweig G, Mitchell M (2015) Language models for image captioning: the quirks and what works. In: Annual Meeting of the Association for Computational Linguistics
Dhir R, Mishra SK, Saha S, Bhattacharyya P (2019) A deep attention based framework for image caption generation in Hindi language. Computación y Sistemas 23(3):125
Google Scholar
Effendi J, Sakti S, Nakamura S (2021) End-to-end image-to-speech generation for untranscribed unknown languages. IEEE Access 55(9):55144–55154
Google Scholar
Elliott D, de Vries A (2015) Describing images using inferred visual dependency representations. In: Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing (vol 1: Long Papers), pp 42–52
Elliott D, Frank S, Hasler E (2015) Multilingual image description with neural sequence models. arXiv preprintarXiv:1510.04709
Elliott D, Keller F (2013) Image description using visual dependency representations. In: Proceedings of the 2013 conference on empirical methods in natural language processing, pp 1292–1302
Fang H, Gupta S, Iandola F, Srivastava RK, Deng L, Dollár P, Gao J, He X, Mitchell M, Platt JC, et al (2015) From captions to visual concepts and back. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1473–1482
Fang Z, Wang J, Hu X, Liang L, Gan Z, Wang L, Yang Y, Liu Z (2022) Injecting semantic concepts into end-to-end image captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 18009–18019
Farhadi A, Hejrati M, Sadeghi MA, Young P, Rashtchian C, Hockenmaier J, Forsyth DA (2010) Every picture tells a story: generating sentences from images. In: European Conference on Computer Vision
Fei Z (2021) Partially non-autoregressive image captioning. Proc AAAI Conf Artif Intell 35:1309–1316
Google Scholar
Felzenszwalb PF, Girshick RB, McAllester D, Ramanan D (2009) Object detection with discriminatively trained part-based models. IEEE Trans Pattern Anal Mach Intell 32(9):1627–1645
Google Scholar
Gan Z, Gan C, He X, Pu Y, Tran K, Gao J, Carin L, Deng L (2016) Semantic compositional networks for visual captioning. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR), pp 1141–1150
Gao L, Fan K, Song J, Liu X, Xu X, Shen HT (2019) Deliberate attention networks for image captioning. In: AAAI Conference on Artificial Intelligence
Gilberto L, Ortiz M, Wolff C, Lapata M (2015) Learning to interpret and describe abstract scenes. In: Proceedings of the 2015 conference of the north american chapter of the association for computational linguistics: human language Technologies, pps 1505–1515, Denver, Colorado, 2015. Association for Computational Linguistics
Girish K, Visruth P, Vicente O, Sagnik D, Siming L, Yejin C, Berg Alexander C, Berg Tamara L (2013) Babytalk: understanding and generating simple image descriptions. IEEE Trans Pattern Anal Mach Intell 35(12):2891–2903
Google Scholar
Gong Y, Wang L, Hodosh M, Hockenmaier Julia, Lazebnik Svetlana (2014) Improving image-sentence embeddings using large weakly annotated photo collections. In: European conference on computer vision, pp 529–545. Springer
Grubinger M, Clough PM, Deselaers T (2006) The iapr tc-12 benchmark: a new evaluation resource for visual information systems. In: International workshop ontoImage, volume 2
Gu J, Wang G, Cai J, Chen T (2017) An empirical study of language cnn for image captioning. In: 2017 IEEE international conference on computer vision (ICCV), pp 1231–1240, Venice. IEEE
Guo L, Liu J, Zhu X, He X, Jiang J, Lu H (2021) Non-autoregressive image captioning with counterfactuals-critical multi-agent learning. In: Proceedings of the twenty-ninth international joint conference on artificial intelligence, IJCAI’20, 2021
Guo L, Liu J, Zhu X, Yao P, Lu S, Lu H (2020) Normalized and geometry-aware self-attention network for image captioning. In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 10324–10333, Seattle. IEEE
Gupta A, Verma Y, Jawahar CV (2012) Choosing linguistics over vision to describe images. In: Proceedings of the AAAI conference on artificial intelligence
Haque AUI, Ghani S, Saeed M (2021) Image captioning with positional and geometrical semantics. IEEE Access 9:160917–160925
Google Scholar
Herdade S, Kappeler A, Boakye K, Soares J (2019) Image captioning: Transforming objects into words. In: Conference and workshop on neural information processing systems
Hodosh M, Young P, Hockenmaier J (2013) Framing image description as a ranking task: data, models and evaluation metrics. J Artif Intell Res 47:853–899
MathSciNet MATH Google Scholar
Hossain MZ, Sohel F, Shiratuddin MF, Laga H (2019) A comprehensive survey of deep learning for image captioning. ACM Comput Surveys (CSUR) 51:1–36
Google Scholar
Hossain MZ, Sohel F, Shiratuddin MF, Laga H, Bennamoun M (2021) Text to image synthesis for improved image captioning. IEEE Access 9:64918–64928
Google Scholar
Hou D, Zhao Z, Liu Y, Chang F, Sanyuan H (2021) Automatic report generation for chest X-ray images via adversarial reinforcement learning. IEEE Access 9:21236–21250
Google Scholar
Hoxha G, Melgani F, Demir B (2020) Toward remote sensing image retrieval under a deep image captioning perspective. IEEE J Select Top Appl Earth Observ Remote Sens 13:4462–4475
Google Scholar
Huang Wei, Wang Qi, Li X (2021) Denoising-based multiscale feature fusion for remote sensing image captioning. IEEE Geosci Remote Sens Lett 18(3):436–440
Google Scholar
Huang L, Wang W, Chen J, Wei X-Y (2019) Attention on attention for image captioning. In: 2019 IEEE/cvf international conference on computer vision (ICCV), pp 4633–4642
Jeff D, Anne HL, Marcus R, Subhashini V, Sergio G, Kate S, Trevor D (2017) Long-term recurrent convolutional networks for visual recognition and description. IEEE Trans Pattern Anal Mach Intell 39(4):677–691
Google Scholar
Jia X, Gavves E, Fernando B, Tuytelaars T (2015) Guiding the long-short term memory model for image caption generation. In: 2015 IEEE international conference on computer vision (ICCV)
Jiang W, Li X, Haifeng H, Qiang L, Liu B (2021) Multi-gate attention network for image captioning. IEEE Access 9:69700–69709
Google Scholar
Jie W, Chen T, Hefeng W, Yang Z, Luo G, Lin Liang (2021) Fine-grained image captioning with global-local discriminative objective. IEEE Trans Multimedia 23:2413–2427
Google Scholar
Jin J, Fu K, Cui R, Sha F, Zhang C (2015) Aligning where to see and what to tell: image caption with region-based attention and scene factorization. arXiv:1506.06272 [cs, stat], June 2015
Jing Su, Li Jing (2021) Show auto-adaptive and tell: learned from the SEM image challenge. IEEE Access 9:51494–51500
Google Scholar
Jun Y, Li J, Zhou Y, Huang Q (2020) Multimodal transformer with multi-view visual representation for image captioning. IEEE Trans Circuits Syst Video Technol 30(12):4467–4480
Google Scholar
Karpathy A, Fei-Fei L (2014) Deep visual-semantic alignments for generating image descriptions. IEEE Trans Pattern Anal Mach Intell 39:664–676
Google Scholar
Karpathy A, Joulin A, Fei-Fei LF (2014) Deep fragment embeddings for bidirectional image sentence mapping. Adv Neural Inform Process Syst 27:1
Google Scholar
Kastner MA, Umemura K, Ide I, Kawanishi Y, Hirayama T, Doman Keisuke, Deguchi Daisuke, Murase Hiroshi, Satoh Shin’Ichi (2021) Imageability- and length-controllable image captioning. IEEE Access 9:162951–162961
Google Scholar
Kaur M, Josan G, Kaur J (2021) Automatic Punjabi caption generation for sports images. INFOCOMP J Comput Sci 20(1):2
Google Scholar
Ke L, Pei W, Li R, Shen X, Tai Y-W (2019) Reflective decoding network for image captioning. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 8888–8897
Kiros R, Salakhutdinov R, Zemel RS (2014) Multimodal neural language models. In: International Conference on Machine Learning
Kiros R, Salakhutdinov R, Zemel RS (2014) Unifying visual-semantic embeddings with multimodal neural language models. arXiv:1411.2539 [cs], November 2014
Kumar A, Goel S (2018) A survey of evolution of image captioning techniques. Int J Hybrid Intell Syst 14(3):123–139
Google Scholar
Kusner M, Sun Y, Kolkin N, Weinberger K (2015) From word embeddings to document distances. In: International conference on machine learning, pp 957–966. PMLR
Kuznetsova P, Ordonez V, Berg TL, Choi Y (2014) Treetalk : composition and compression of trees for image descriptions. Trans Assoc Comput Linguist 2:351–362
Google Scholar
Kuznetsova P, Ordonez V, Berg AC, Berg TL, Choi Y (2012) Collective generation of natural image descriptions. In: Annual meeting of the association for computational linguistics
Lan W, Li X, Dong J (2017) Fluency-guided cross-lingual image captioning. In: Proceedings of the 25th ACM international conference on multimedia, pp 1549–1557
Lebret R, Pinheiro PHO, Collobert R (2015) Phrase-based image captioning. In: International conference on machine learning
Lebret R, Pinheiro PO, Collobert R (2015) Simple image description generator via a linear phrase-based approach. arXiv:1412.8419 [cs], April
Li X, Chaoxi X, Wang X, Lan W, Jia Z, Yang G, Jieping X (2019) COCO-CN for cross-lingual image tagging, captioning, and retrieval. IEEE Trans Multimedia 21(9):2347–2360
Google Scholar
Li S, Tao Z, Li K, Yun F (2019) Visual to text: survey of image and video captioning. IEEE Trans Emerging Top Comput Intell 3(4):297–312
Google Scholar
Li J, Yao P, Guo L, Zhang W (2019) Boosted transformer for image captioning. Appl Sci 9(16):3260
Google Scholar
Li X, Zhang X, Huang W, Wang Q (2021) Truncation cross entropy loss for remote sensing image captioning. IEEE Trans Geosci Remote Sens 59(6):5246–5257
Google Scholar
Li W, Zhaowei Q, Song H, Wang P, Xue B (2021) The traffic scene understanding and prediction based on image captioning. IEEE Access 9:1420–1427
Google Scholar
Li G, Zhu L, Liu P, Yang Y (2019) Entangled transformer for image captioning. In: 2019 IEEE/CVF international conference on computer vision (ICCV), pp 8927–8936
Li S, Kulkarni G, Berg T, Berg A, Choi Y (2011) Composing simple image descriptions using web-scale n-grams. In: Proceedings of the fifteenth conference on computational natural language learning, pp 220–228
Li X, Lan W, Dong J, Liu H (2016) Adding Chinese captions to images. In: Proceedings of the 2016 ACM on international conference on multimedia retrieval, pp 271–275
Lin C-Y (2004) Rouge: a package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81
Lin D, Fidler S, Kong C, Urtasun R (2015) Generating multi-sentence natural language descriptions of indoor scenes. In: Proceedings of the British machine vision conference (BMVC), pp 93.1–93.13. BMVA Press, September 2015
Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: Common objects in context. In: European conference on computer vision, pp 740–755. Springer
Lingxiang W, Min X, Sang L, Yao T, Mei T (2021) Noise augmented double-stream graph convolutional networks for image captioning. IEEE Trans Circuits Syst Video Technol 31(8):3118–3127
Google Scholar
Liu X, Qingyang X, Wang N (2019) A survey on deep neural network-based image captioning. Visual Comput 35(3):445–470
Google Scholar
Liu H, Zhang S, Lin K, Wen J, Li J, Xiaolin H (2021) Vocabulary-wide credit assignment for training image captioning models. IEEE Trans Image Process 30:2450–2460
MathSciNet Google Scholar
Liu C, Mao J, Sha F, Yuille A (2017) Attention correctness in neural image captioning. In: Thirty-first AAAI conference on artificial intelligence,
Liu F, Ren X, Liu Y, Lei K, Sun X (2019) Exploring and distilling cross-modal information for image captioning. In: International joint conference on artificial intelligence
Liu S, Zhu Z, Ye N, Guadarrama S, Murphy K (2017) Improved image captioning via policy gradient optimization of SPIDEr. In: 2017 IEEE international conference on computer vision (ICCV), pp 873–881, Venice. IEEE
Lu J, Xiong C, Parikh D, Socher R (2017) Knowing when to look: adaptive attention via a visual sentinel for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 375–383
Lu J, Yang J, Batra D, Parikh D (2018) Neural baby talk. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7219–7228
Luo Y, Ji J, Sun X, Cao L, Yongjian W, Huang F, Lin CW, Ji R (2021) Dual-level collaborative transformer for image captioning. Proc AAAI Conf Artif Intell 35:2286–2293
Google Scholar
Ma X, Zhao R, Shi Z (2021) Multiscale methods for optical remote-sensing image captioning. IEEE Geosci Remote Sens Lett 18(11):2001–2005
Google Scholar
Mao J, Xu W, Yang Y, Wang J, Huang Z, Yuille A (2018) Deep captioning with multimodal recurrent neural networks (m-RNN). In: Proceedings of the international conference on learning representations
Mason R, Charniak E (2014) Nonparametric method for data-driven image captioning. In: Proceedings of the 52nd annual meeting of the association for computational linguistics (Volume 2: Short Papers), pp 592–598, Baltimore, Maryland. Association for Computational Linguistics
Min K, Dang M, Moon H (2021) Deep learning-based short story generation for an image using the encoder-decoder structure. IEEE Access 9:113550–113557
Google Scholar
Mishra SK, Dhir R, Saha S, Bhattacharyya P, Singh AK (2021) Image captioning in Hindi language using transformer networks. Comput Electr Eng 92:107114
Google Scholar
Mitchell M, Dodge J, Goyal A, Yamaguchi K, Stratos K, Han X, Mensch A, Berg A, Berg T, Daumé IIIH (2012) Midge: generating image descriptions from computer vision detections. In: Proceedings of the 13th conference of the European chapter of the association for computational linguistics, pp 747–756
Miyazaki T, Shimizu N (2016) Cross-lingual image caption generation. In: Proceedings of the 54th annual meeting of the association for computational linguistics (volume 1: long papers), pp 1780–1790
Ordonez V, Kulkarni G, Berg T (2011) Im2text: describing images using 1 million captioned photographs. Advn Neural Inform Process Syst 56:25
Google Scholar
Pan Y, Yao T, Li Y, Mei T, (2020) X-linear attention networks for image captioning. In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 10968–10977, 2020
Papineni K, Roukos S, Ward T, Zhu W-Ji(2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the association for computational linguistics, pp 311–318
Park H, Kim K, Park S, Choi J (2021) Medical image captioning model to convey more details: methodological comparison of feature difference generation. IEEE Access 9:150560–150568
Google Scholar
Park CC, Kim B, Kim G (2017) Attend to you: personalized image captioning with context sequence memory networks. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR), pp 6432–6440
Patterson G, Chen X, Hang S, Hays J (2014) The SUN attribute database: beyond categories for deeper scene understanding. Int J Comput Vision 108(1–2):59–81
Google Scholar
Pedersoli M, Lucas T, Schmid C, Verbeek J (2017) Areas of attention for image captioning. In: Proceedings of the IEEE international conference on computer vision, pp 1242–1250
Qi W, Shen C, Wang P, Dick A, Van Den Hengel A (2017) Image captioning and visual question answering based on attributes and external knowledge. IEEE Trans Pattern Anal Mach Intell 40(6):1367–1381
Google Scholar
Qin Y, Du J, Zhang Y, Lu H (2019) Look back and predict forward in image captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8367–8375
Ranjay K, Yuke Z, Oliver G, Justin J, Kenji H, Joshua K, Stephanie C, Yannis K, Li-Jia L, Shamma DA et al (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. Int J Comput Vision 123(1):32–73
MathSciNet Google Scholar
Rashtchian C, Young P, Hodosh M, Hockenmaier J (2010) Collecting image annotations using amazon’s mechanical turk. In: Proceedings of the NAACL HLT 2010 Workshop on creating speech and language data with amazon’s mechanical Turk, pp 139–147, 2010
Rathi A (2020) Deep learning apporach for image captioning in Hindi language. In: 2020 international conference on computer, electrical communication engineering (ICCECE), pp 1–8, 2020
Ren Z, Wang X, Zhang N, Lv X, Li L-J (2017) Deep reinforcement learning-based image captioning with embedding reward. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR), pp 1151–1159, Honolulu. IEEE
Sammani F, Melas-Kyriazi L (2020) Show, edit and tell: a framework for editing image captions. In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 4807–4815, Seattle. IEEE
Shetty R, Rohrbach M, Hendricks LA, Fritz M, Schiele B (2017) Speaking the same language: matching machine to human captions by adversarial training. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp 4155–4164, Venice 2017. IEEE
Socher R, Karpathy A, Le QV, Manning CD, Andrew YN (2014) Grounded compositional semantics for finding and describing images with sentences. Trans Assoc Comput Linguist 2:207–218
Google Scholar
Sumbul G, Nayak S, Demir B (2021) SD-RSIC: summarization-driven deep remote sensing image captioning. IEEE Trans Geosci Remote Sens 59(8):6922–6934
Google Scholar
Tavakoli HR, Shetty R, Borji A, Laaksonen J (2017) Paying attention to descriptions generated by image captioning models. In: Proceedings of the IEEE international conference on computer vision, pp 2487–2496
Tsutsui S, Crandall D (2017) Using artificial tokens to control languages for multilingual image caption generation. arXiv preprintarXiv:1706.06275, 2017
Ushiku Y, Yamaguchi M, Mukuta Y, Harada T (2015) Common subspace for model and similarity: phrase learning for caption generation from images. In: 2015 IEEE international conference on computer vision (ICCV), pp 2668–2676, Santiago, Chile 2015. IEEE
van Miltenburg E, Elliott D, Vossen P (2017) Cross-linguistic differences and similarities in image descriptions. In: International conference on natural language generation, 2017
Vedantam R, Zitnick CL, Parikh D (2015) Cider: consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4566–4575, 2015
Verma Y, Jawahar CV (2014) Im2Text and Text2Im: associating images and texts for cross-modal retrieval. In Proceedings of the British machine vision conference 2014, pp 97.1–97.13, Nottingham, 2014. British Machine Vision Association
Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR), pp 3156–3164, Boston. IEEE
Wang L, Bai Z, Zhang Y, Hongtao L (2020) Show, recall, and tell: image captioning with recall mechanism. Proc AAAI Conf Artif Intell 34(07):12176–12183
Google Scholar
Wang W, Chen Z, Haifeng H (2019) Hierarchical attention network for image captioning. Proc AAAI Conf Artif Intell 33:8957–8964
Google Scholar
Wang Q, Huang W, Zhang X, Li X (2021) Word-sentence framework for remote sensing image captioning. IEEE Trans Geosci Remote Sens 59(12):10532–10543
Google Scholar
Wang Y, Jungang X, Sun Y (2022) End-to-end transformer based model for image captioning. Proc AAAI Conf Artif Intell 36:2585–2594
Google Scholar
Wang B, Zheng X, Bo Q, Xiaoqiang L (2020) Retrieval topic recurrent memory network for remote sensing image captioning. IEEE J Select Top Appl Earth Observ Remote Sens 13:256–270
Google Scholar
Wang C, Gu X (2021) An image captioning approach using dynamical attention. In: 2021 international joint conference on neural networks (IJCNN), pp 1–8, 2021
Wang C, Yang H, Meinel C (2018) Image captioning with deep bidirectional lstms and multi-task learning. In: ACM transactions on multimedia computing, communications, and applications (TOMM), 14(2s):1–20
Wang D, Beck D, Cohn T (2019) On the role of scene graphs in image captioning. In: Proceedings of the beyond vision and language: integrating real-world knowledge (LANTERN)
Wang Q, Chan AB (2018) CNN+CNN: convolutional decoders for image captioning. arXiv:1805.09019 [cs], May
Wang Y, Lin Z, Shen X, Cohen S, Cottrell GW (2017) Skeleton key: image captioning by skeleton-attribute decomposition. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR), pp 7378–7387, Honolulu. IEEE
Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel , Bengio Y (2015) Show, attend and tell: neural image caption generation with visual attention. In: International conference on machine learning, pp 2048–2057. PMLR, 2015
Yang L, Wang H, Tang P, Li Qinyu (2021) CaptionNet: a tailor-made recurrent neural network for generating image descriptions. IEEE Trans Multimedia 23:835–845
Google Scholar
Yang X, Tang K, Zhang H, Cai J (2019) Auto-encoding scene graphs for image captioning. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 10677–10686, Long Beach 2019. IEEE
Yang X, Zhang H, Cai J (2019) Learning to collocate neural modules for image captioning. In: 2019 IEEE/CVF international conference on computer vision (ICCV), pp 4249–4259, Seoul 2019. IEEE
Yang Y, Teo C, Daumé H, Aloimonos Y (2011) Corpus-guided sentence generation of natural images. In: Proceedings of the 2011 conference on empirical methods in natural language processing, pp 444–454
Yao T, Pan Y, Li Y, Mei T (2018) Exploring visual relationship for image captioning. In: Proceedings of the European conference on computer vision (ECCV), pp 684–699, 2018
Yao T, Pan Y, Li Y, Mei T (2019) Hierarchy parsing for image captioning. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 2621–2629
Yatskar M, Galley M, Vanderwende L, Zettlemoyer L (2014) See no evil, say no evil: Description generation from densely labeled images. In : Proceedings of the Third Joint Conference on Lexical and Computational Semantics (* SEM 2014), pp 110–120
Ye S, Han J, Liu N (2018) Attentive linear transformation for image captioning. IEEE Trans Image Process 27(11):5514–5524
MathSciNet Google Scholar
Yoshikawa Y, Shigeto Y, Takeuchi A (2017) STAIR captions: constructing a large-scale Japanese image caption dataset. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp 417–421, Vancouver. Association for Computational Linguistics
Yuan Z, Li X, Wang Q (2020) Exploring multi-level attention and semantic relationship for remote sensing image captioning. IEEE Access 8:2608–2620
Google Scholar
Yumeng Z, Jing Y, Shuo G, Limin L (2021) News image-text matching with news knowledge graph. IEEE Access 9:108017–108027
Google Scholar
Zhang J, Mei K, Zheng Y, Fan J (2021) Integrating part of speech guidance for image captioning. IEEE Trans Multimedia 23:92–104
Google Scholar
Zhang X, Sun X, Luo Y, Ji J, Zhou Y, Wu Y, Huang F, Ji R (2021) Rstnet: Captioning with adaptive attention on visual and non-visual words. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 15465–15474
Zhao W, Xinxiao W, Luo J (2021) Cross-domain image captioning via cross-modal retrieval and model adaptation. IEEE Trans Image Process 30:1180–1192
MathSciNet Google Scholar
Zhou Z, Liang X, Wang C, Xie W, Wang S, Ge S, Zhang Y (2021) An image captioning model based on bidirectional depth residuals and its application. IEEE Access 9:25360–25370
Google Scholar
Zhou Y, Wang M, Liu D, Hu Z, Zhang H (2020) More grounded image captioning by distilling image-text matching model. In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR)
Zohourianshahzadi Z, Kalita JK (2021) Neural attention for image captioning: review of outstanding methods. Artif Intell Rev 55(5):3833–3862
Google Scholar

Download references

Acknowledgements

The authors extend sincere gratitude to the Editor and Reviewers for their insightful remarks and helpful opinions, which contributed to the enhancement of the work.

Author information

Authors and Affiliations

Department of Computer Science and Information Technology, Central University of Jammu, Jammu, Jammu & Kashmir, 181124, India
Himanshu Sharma & Devanand Padha

Authors

Himanshu Sharma
View author publications
You can also search for this author in PubMed Google Scholar
Devanand Padha
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Himanshu Sharma.

Ethics declarations

Conflict of interest

All the authors declare that they do not have any conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

See Tables 10, 11, 12

Table 10 Comprehensive information about the publishers of the research articles that are cited in this survey

Full size table

Table 11 A frequency distribution table of the article publishers cited in this survey

Full size table

Table 12 A frequency distribution table of article types cited in this survey

Full size table

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Sharma, H., Padha, D. A comprehensive survey on image captioning: from handcrafted to deep learning-based techniques, a taxonomy and open research issues. Artif Intell Rev 56, 13619–13661 (2023). https://doi.org/10.1007/s10462-023-10488-2

Download citation

Published: 17 April 2023
Issue Date: November 2023
DOI: https://doi.org/10.1007/s10462-023-10488-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A comprehensive survey on image captioning: from handcrafted to deep learning-based techniques, a taxonomy and open research issues

Abstract

Access this article

Similar content being viewed by others

Microsoft COCO: Common Objects in Context

ImageNet Large Scale Visual Recognition Challenge

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendix

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A comprehensive survey on image captioning: from handcrafted to deep learning-based techniques, a taxonomy and open research issues

Abstract

Access this article

Similar content being viewed by others

Microsoft COCO: Common Objects in Context

ImageNet Large Scale Visual Recognition Challenge

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation