Skip to main content
Log in

A comprehensive survey on image captioning: from handcrafted to deep learning-based techniques, a taxonomy and open research issues

  • Published:
Artificial Intelligence Review Aims and scope Submit manuscript

Abstract

Image captioning is a pretty modern area of the convergence of computer vision and natural language processing and is widely used in a range of applications such as multi-modal search, robotics, security, remote sensing, medical, and visual aid. The image captioning techniques have witnessed a paradigm shift from classical machine-learning-based approaches to the most contemporary deep learning-based techniques. We present an in-depth investigation of image captioning methodologies in this survey using our proposed taxonomy. Furthermore, the study investigates several eras of image captioning advancements, including template-based, retrieval-based, and encoder-decoder-based models. We also explore captioning in languages other than English. A thorough investigation of benchmark image captioning datasets and assessment measures is also discussed. The effectiveness of real-time image captioning is a severe barrier that prevents its use in sensitive applications such as visual aid, security, and medicine. Another observation from our research is the scarcity of personalized domain datasets that limits its adoption into more advanced issues. Despite influential contributions from several academics, further efforts are required to construct substantially robust and reliable image captioning models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  • Alam M, Samad MD, Vidyaratne L, Glandon A, Iftekharuddin KM (2020) Survey on deep neural networks in speech and vision systems. Neurocomputing 417:302–321

    Google Scholar 

  • Amirian S, Rasheed K, Taha TR, Arabnia HR (2020) Automatic image and video caption generation with deep learning: a concise review and algorithmic overlap. IEEE Access 8:218386–218400

    Google Scholar 

  • Anderson P, Fernando B, Johnson M, Gould S (2016) Spice: Semantic propositional image caption evaluation. In: European conference on computer vision pp 382–398. Springer

  • Aneja J, Deshpande A, Schwing AG (2018) Convolutional image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5561–5570

  • Bai S, An S (2018) A survey on automatic image caption generation. Neurocomputing 311:291–304

    Google Scholar 

  • Banerjee S, Lavie A (2005) Meteor: an automatic metric for mt evaluation with improved correlation with human judgments. In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp 65–72

  • Ben H, Pan Y, Li Y, Yao T, Hong R, Wang M, Mei T (2022) Unpaired image captioning with semantic-constrained self-learning. IEEE Trans Multimedia 24:904–916

    Google Scholar 

  • Bernardi R, Cakici R, Elliott D, Erdem A, Erdem E, Ikizler-Cinbis N, Keller F, Muscat A, Plank B (2016) Automatic description generation from images: a survey of models, datasets, and evaluation measures. J Artif Intell Res 55:409–442

    Google Scholar 

  • Bhosale YH, Patnaik KS (2022) Application of deep learning techniques in diagnosis of covid-19 (coronavirus): a systematic review. Neural Process Lett 2:1–53

    Google Scholar 

  • Bhosale YH, Patnaik KS (2022) Iot deployable lightweight deep learning application for covid-19 detection with lung diseases using raspberrypi. In: 2022 International Conference on IoT and Blockchain Technology (ICIBT), pp 1–6

  • Bhosale YH, Zanwar S, Ahmed Z, Nakrani M, Bhuyar D, Shinde U (2022) Deep convolutional neural network based covid-19 classification from radiology x-ray images for IOT enabled devices. In: 2022 8th International Conference on Advanced Computing and Communication Systems (ICACCS), vol 1, pp 1398–1402

  • Bin Y, Ding Y, Peng B, Peng L, Yang Y, Chua T-S (2022) Entity slot filling for visual captioning. IEEE Trans Circuits Syst Video Technol 32(1):52–62

    Google Scholar 

  • Bryan CR, Antonio T, Murphy Kevin P, Freeman William T (2008) Labelme: a database and web-based tool for image annotation. Int J Comput Vision 77(1–3):157–173

    Google Scholar 

  • Chen X, Zitnick C.L (2015) Mind’s eye: a recurrent visual representation for image caption generation. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 2422–2431, Boston. IEEE

  • Cheng Q, Zhou Y, Peng F, Yuan X, Zhang L (2021) A deep semantic alignment network for the cross-modal image-text retrieval in remote sensing. IEEE J Select Top Appl Earth Observations Remote Sens 14:4284–4297

    Google Scholar 

  • Cornia M, Stefanini M, Baraldi L, Cucchiara R (2020) Meshed-memory transformer for image captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10578–10587

  • Dai B, Fidler S, Urtasun R, Lin D (2017) Towards diverse and natural image descriptions via a conditional GAN. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp 2989–2998, Venice. IEEE

  • Devlin J, Cheng H, Fang H, Gupta S, Deng L, He X, Zweig G, Mitchell M (2015) Language models for image captioning: the quirks and what works. In: Annual Meeting of the Association for Computational Linguistics

  • Dhir R, Mishra SK, Saha S, Bhattacharyya P (2019) A deep attention based framework for image caption generation in Hindi language. Computación y Sistemas 23(3):125

    Google Scholar 

  • Effendi J, Sakti S, Nakamura S (2021) End-to-end image-to-speech generation for untranscribed unknown languages. IEEE Access 55(9):55144–55154

    Google Scholar 

  • Elliott D, de Vries A (2015) Describing images using inferred visual dependency representations. In: Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing (vol 1: Long Papers), pp 42–52

  • Elliott D, Frank S, Hasler E (2015) Multilingual image description with neural sequence models. arXiv preprintarXiv:1510.04709

  • Elliott D, Keller F (2013) Image description using visual dependency representations. In: Proceedings of the 2013 conference on empirical methods in natural language processing, pp 1292–1302

  • Fang H, Gupta S, Iandola F, Srivastava RK, Deng L, Dollár P, Gao J, He X, Mitchell M, Platt JC, et al (2015) From captions to visual concepts and back. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1473–1482

  • Fang Z, Wang J, Hu X, Liang L, Gan Z, Wang L, Yang Y, Liu Z (2022) Injecting semantic concepts into end-to-end image captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 18009–18019

  • Farhadi A, Hejrati M, Sadeghi MA, Young P, Rashtchian C, Hockenmaier J, Forsyth DA (2010) Every picture tells a story: generating sentences from images. In: European Conference on Computer Vision

  • Fei Z (2021) Partially non-autoregressive image captioning. Proc AAAI Conf Artif Intell 35:1309–1316

    Google Scholar 

  • Felzenszwalb PF, Girshick RB, McAllester D, Ramanan D (2009) Object detection with discriminatively trained part-based models. IEEE Trans Pattern Anal Mach Intell 32(9):1627–1645

    Google Scholar 

  • Gan Z, Gan C, He X, Pu Y, Tran K, Gao J, Carin L, Deng L (2016) Semantic compositional networks for visual captioning. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR), pp 1141–1150

  • Gao L, Fan K, Song J, Liu X, Xu X, Shen HT (2019) Deliberate attention networks for image captioning. In: AAAI Conference on Artificial Intelligence

  • Gilberto L, Ortiz M, Wolff C, Lapata M (2015) Learning to interpret and describe abstract scenes. In: Proceedings of the 2015 conference of the north american chapter of the association for computational linguistics: human language Technologies, pps 1505–1515, Denver, Colorado, 2015. Association for Computational Linguistics

  • Girish K, Visruth P, Vicente O, Sagnik D, Siming L, Yejin C, Berg Alexander C, Berg Tamara L (2013) Babytalk: understanding and generating simple image descriptions. IEEE Trans Pattern Anal Mach Intell 35(12):2891–2903

    Google Scholar 

  • Gong Y, Wang L, Hodosh M, Hockenmaier Julia, Lazebnik Svetlana (2014) Improving image-sentence embeddings using large weakly annotated photo collections. In: European conference on computer vision, pp 529–545. Springer

  • Grubinger M, Clough PM, Deselaers T (2006) The iapr tc-12 benchmark: a new evaluation resource for visual information systems. In: International workshop ontoImage, volume 2

  • Gu J, Wang G, Cai J, Chen T (2017) An empirical study of language cnn for image captioning. In: 2017 IEEE international conference on computer vision (ICCV), pp 1231–1240, Venice. IEEE

  • Guo L, Liu J, Zhu X, He X, Jiang J, Lu H (2021) Non-autoregressive image captioning with counterfactuals-critical multi-agent learning. In: Proceedings of the twenty-ninth international joint conference on artificial intelligence, IJCAI’20, 2021

  • Guo L, Liu J, Zhu X, Yao P, Lu S, Lu H (2020) Normalized and geometry-aware self-attention network for image captioning. In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 10324–10333, Seattle. IEEE

  • Gupta A, Verma Y, Jawahar CV (2012) Choosing linguistics over vision to describe images. In: Proceedings of the AAAI conference on artificial intelligence

  • Haque AUI, Ghani S, Saeed M (2021) Image captioning with positional and geometrical semantics. IEEE Access 9:160917–160925

    Google Scholar 

  • Herdade S, Kappeler A, Boakye K, Soares J (2019) Image captioning: Transforming objects into words. In: Conference and workshop on neural information processing systems

  • Hodosh M, Young P, Hockenmaier J (2013) Framing image description as a ranking task: data, models and evaluation metrics. J Artif Intell Res 47:853–899

    MathSciNet  MATH  Google Scholar 

  • Hossain MZ, Sohel F, Shiratuddin MF, Laga H (2019) A comprehensive survey of deep learning for image captioning. ACM Comput Surveys (CSUR) 51:1–36

    Google Scholar 

  • Hossain MZ, Sohel F, Shiratuddin MF, Laga H, Bennamoun M (2021) Text to image synthesis for improved image captioning. IEEE Access 9:64918–64928

    Google Scholar 

  • Hou D, Zhao Z, Liu Y, Chang F, Sanyuan H (2021) Automatic report generation for chest X-ray images via adversarial reinforcement learning. IEEE Access 9:21236–21250

    Google Scholar 

  • Hoxha G, Melgani F, Demir B (2020) Toward remote sensing image retrieval under a deep image captioning perspective. IEEE J Select Top Appl Earth Observ Remote Sens 13:4462–4475

    Google Scholar 

  • Huang Wei, Wang Qi, Li X (2021) Denoising-based multiscale feature fusion for remote sensing image captioning. IEEE Geosci Remote Sens Lett 18(3):436–440

    Google Scholar 

  • Huang L, Wang W, Chen J, Wei X-Y (2019) Attention on attention for image captioning. In: 2019 IEEE/cvf international conference on computer vision (ICCV), pp 4633–4642

  • Jeff D, Anne HL, Marcus R, Subhashini V, Sergio G, Kate S, Trevor D (2017) Long-term recurrent convolutional networks for visual recognition and description. IEEE Trans Pattern Anal Mach Intell 39(4):677–691

    Google Scholar 

  • Jia X, Gavves E, Fernando B, Tuytelaars T (2015) Guiding the long-short term memory model for image caption generation. In: 2015 IEEE international conference on computer vision (ICCV)

  • Jiang W, Li X, Haifeng H, Qiang L, Liu B (2021) Multi-gate attention network for image captioning. IEEE Access 9:69700–69709

    Google Scholar 

  • Jie W, Chen T, Hefeng W, Yang Z, Luo G, Lin Liang (2021) Fine-grained image captioning with global-local discriminative objective. IEEE Trans Multimedia 23:2413–2427

    Google Scholar 

  • Jin J, Fu K, Cui R, Sha F, Zhang C (2015) Aligning where to see and what to tell: image caption with region-based attention and scene factorization. arXiv:1506.06272 [cs, stat], June 2015

  • Jing Su, Li Jing (2021) Show auto-adaptive and tell: learned from the SEM image challenge. IEEE Access 9:51494–51500

    Google Scholar 

  • Jun Y, Li J, Zhou Y, Huang Q (2020) Multimodal transformer with multi-view visual representation for image captioning. IEEE Trans Circuits Syst Video Technol 30(12):4467–4480

    Google Scholar 

  • Karpathy A, Fei-Fei L (2014) Deep visual-semantic alignments for generating image descriptions. IEEE Trans Pattern Anal Mach Intell 39:664–676

    Google Scholar 

  • Karpathy A, Joulin A, Fei-Fei LF (2014) Deep fragment embeddings for bidirectional image sentence mapping. Adv Neural Inform Process Syst 27:1

    Google Scholar 

  • Kastner MA, Umemura K, Ide I, Kawanishi Y, Hirayama T, Doman Keisuke, Deguchi Daisuke, Murase Hiroshi, Satoh Shin’Ichi (2021) Imageability- and length-controllable image captioning. IEEE Access 9:162951–162961

    Google Scholar 

  • Kaur M, Josan G, Kaur J (2021) Automatic Punjabi caption generation for sports images. INFOCOMP J Comput Sci 20(1):2

    Google Scholar 

  • Ke L, Pei W, Li R, Shen X, Tai Y-W (2019) Reflective decoding network for image captioning. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 8888–8897

  • Kiros R, Salakhutdinov R, Zemel RS (2014) Multimodal neural language models. In: International Conference on Machine Learning

  • Kiros R, Salakhutdinov R, Zemel RS (2014) Unifying visual-semantic embeddings with multimodal neural language models. arXiv:1411.2539 [cs], November 2014

  • Kumar A, Goel S (2018) A survey of evolution of image captioning techniques. Int J Hybrid Intell Syst 14(3):123–139

    Google Scholar 

  • Kusner M, Sun Y, Kolkin N, Weinberger K (2015) From word embeddings to document distances. In: International conference on machine learning, pp 957–966. PMLR

  • Kuznetsova P, Ordonez V, Berg TL, Choi Y (2014) Treetalk : composition and compression of trees for image descriptions. Trans Assoc Comput Linguist 2:351–362

    Google Scholar 

  • Kuznetsova P, Ordonez V, Berg AC, Berg TL, Choi Y (2012) Collective generation of natural image descriptions. In: Annual meeting of the association for computational linguistics

  • Lan W, Li X, Dong J (2017) Fluency-guided cross-lingual image captioning. In: Proceedings of the 25th ACM international conference on multimedia, pp 1549–1557

  • Lebret R, Pinheiro PHO, Collobert R (2015) Phrase-based image captioning. In: International conference on machine learning

  • Lebret R, Pinheiro PO, Collobert R (2015) Simple image description generator via a linear phrase-based approach. arXiv:1412.8419 [cs], April

  • Li X, Chaoxi X, Wang X, Lan W, Jia Z, Yang G, Jieping X (2019) COCO-CN for cross-lingual image tagging, captioning, and retrieval. IEEE Trans Multimedia 21(9):2347–2360

    Google Scholar 

  • Li S, Tao Z, Li K, Yun F (2019) Visual to text: survey of image and video captioning. IEEE Trans Emerging Top Comput Intell 3(4):297–312

    Google Scholar 

  • Li J, Yao P, Guo L, Zhang W (2019) Boosted transformer for image captioning. Appl Sci 9(16):3260

    Google Scholar 

  • Li X, Zhang X, Huang W, Wang Q (2021) Truncation cross entropy loss for remote sensing image captioning. IEEE Trans Geosci Remote Sens 59(6):5246–5257

    Google Scholar 

  • Li W, Zhaowei Q, Song H, Wang P, Xue B (2021) The traffic scene understanding and prediction based on image captioning. IEEE Access 9:1420–1427

    Google Scholar 

  • Li G, Zhu L, Liu P, Yang Y (2019) Entangled transformer for image captioning. In: 2019 IEEE/CVF international conference on computer vision (ICCV), pp 8927–8936

  • Li S, Kulkarni G, Berg T, Berg A, Choi Y (2011) Composing simple image descriptions using web-scale n-grams. In: Proceedings of the fifteenth conference on computational natural language learning, pp 220–228

  • Li X, Lan W, Dong J, Liu H (2016) Adding Chinese captions to images. In: Proceedings of the 2016 ACM on international conference on multimedia retrieval, pp 271–275

  • Lin C-Y (2004) Rouge: a package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81

  • Lin D, Fidler S, Kong C, Urtasun R (2015) Generating multi-sentence natural language descriptions of indoor scenes. In: Proceedings of the British machine vision conference (BMVC), pp 93.1–93.13. BMVA Press, September 2015

  • Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: Common objects in context. In: European conference on computer vision, pp 740–755. Springer

  • Lingxiang W, Min X, Sang L, Yao T, Mei T (2021) Noise augmented double-stream graph convolutional networks for image captioning. IEEE Trans Circuits Syst Video Technol 31(8):3118–3127

    Google Scholar 

  • Liu X, Qingyang X, Wang N (2019) A survey on deep neural network-based image captioning. Visual Comput 35(3):445–470

    Google Scholar 

  • Liu H, Zhang S, Lin K, Wen J, Li J, Xiaolin H (2021) Vocabulary-wide credit assignment for training image captioning models. IEEE Trans Image Process 30:2450–2460

    MathSciNet  Google Scholar 

  • Liu C, Mao J, Sha F, Yuille A (2017) Attention correctness in neural image captioning. In: Thirty-first AAAI conference on artificial intelligence,

  • Liu F, Ren X, Liu Y, Lei K, Sun X (2019) Exploring and distilling cross-modal information for image captioning. In: International joint conference on artificial intelligence

  • Liu S, Zhu Z, Ye N, Guadarrama S, Murphy K (2017) Improved image captioning via policy gradient optimization of SPIDEr. In: 2017 IEEE international conference on computer vision (ICCV), pp 873–881, Venice. IEEE

  • Lu J, Xiong C, Parikh D, Socher R (2017) Knowing when to look: adaptive attention via a visual sentinel for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 375–383

  • Lu J, Yang J, Batra D, Parikh D (2018) Neural baby talk. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7219–7228

  • Luo Y, Ji J, Sun X, Cao L, Yongjian W, Huang F, Lin CW, Ji R (2021) Dual-level collaborative transformer for image captioning. Proc AAAI Conf Artif Intell 35:2286–2293

    Google Scholar 

  • Ma X, Zhao R, Shi Z (2021) Multiscale methods for optical remote-sensing image captioning. IEEE Geosci Remote Sens Lett 18(11):2001–2005

    Google Scholar 

  • Mao J, Xu W, Yang Y, Wang J, Huang Z, Yuille A (2018) Deep captioning with multimodal recurrent neural networks (m-RNN). In: Proceedings of the international conference on learning representations

  • Mason R, Charniak E (2014) Nonparametric method for data-driven image captioning. In: Proceedings of the 52nd annual meeting of the association for computational linguistics (Volume 2: Short Papers), pp 592–598, Baltimore, Maryland. Association for Computational Linguistics

  • Min K, Dang M, Moon H (2021) Deep learning-based short story generation for an image using the encoder-decoder structure. IEEE Access 9:113550–113557

    Google Scholar 

  • Mishra SK, Dhir R, Saha S, Bhattacharyya P, Singh AK (2021) Image captioning in Hindi language using transformer networks. Comput Electr Eng 92:107114

    Google Scholar 

  • Mitchell M, Dodge J, Goyal A, Yamaguchi K, Stratos K, Han X, Mensch A, Berg A, Berg T, Daumé IIIH (2012) Midge: generating image descriptions from computer vision detections. In: Proceedings of the 13th conference of the European chapter of the association for computational linguistics, pp 747–756

  • Miyazaki T, Shimizu N (2016) Cross-lingual image caption generation. In: Proceedings of the 54th annual meeting of the association for computational linguistics (volume 1: long papers), pp 1780–1790

  • Ordonez V, Kulkarni G, Berg T (2011) Im2text: describing images using 1 million captioned photographs. Advn Neural Inform Process Syst 56:25

    Google Scholar 

  • Pan Y, Yao T, Li Y, Mei T, (2020) X-linear attention networks for image captioning. In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 10968–10977, 2020

  • Papineni K, Roukos S, Ward T, Zhu W-Ji(2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the association for computational linguistics, pp 311–318

  • Park H, Kim K, Park S, Choi J (2021) Medical image captioning model to convey more details: methodological comparison of feature difference generation. IEEE Access 9:150560–150568

    Google Scholar 

  • Park CC, Kim B, Kim G (2017) Attend to you: personalized image captioning with context sequence memory networks. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR), pp 6432–6440

  • Patterson G, Chen X, Hang S, Hays J (2014) The SUN attribute database: beyond categories for deeper scene understanding. Int J Comput Vision 108(1–2):59–81

    Google Scholar 

  • Pedersoli M, Lucas T, Schmid C, Verbeek J (2017) Areas of attention for image captioning. In: Proceedings of the IEEE international conference on computer vision, pp 1242–1250

  • Qi W, Shen C, Wang P, Dick A, Van Den Hengel A (2017) Image captioning and visual question answering based on attributes and external knowledge. IEEE Trans Pattern Anal Mach Intell 40(6):1367–1381

    Google Scholar 

  • Qin Y, Du J, Zhang Y, Lu H (2019) Look back and predict forward in image captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8367–8375

  • Ranjay K, Yuke Z, Oliver G, Justin J, Kenji H, Joshua K, Stephanie C, Yannis K, Li-Jia L, Shamma DA et al (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. Int J Comput Vision 123(1):32–73

    MathSciNet  Google Scholar 

  • Rashtchian C, Young P, Hodosh M, Hockenmaier J (2010) Collecting image annotations using amazon’s mechanical turk. In: Proceedings of the NAACL HLT 2010 Workshop on creating speech and language data with amazon’s mechanical Turk, pp 139–147, 2010

  • Rathi A (2020) Deep learning apporach for image captioning in Hindi language. In: 2020 international conference on computer, electrical communication engineering (ICCECE), pp 1–8, 2020

  • Ren Z, Wang X, Zhang N, Lv X, Li L-J (2017) Deep reinforcement learning-based image captioning with embedding reward. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR), pp 1151–1159, Honolulu. IEEE

  • Sammani F, Melas-Kyriazi L (2020) Show, edit and tell: a framework for editing image captions. In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 4807–4815, Seattle. IEEE

  • Shetty R, Rohrbach M, Hendricks LA, Fritz M, Schiele B (2017) Speaking the same language: matching machine to human captions by adversarial training. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp 4155–4164, Venice 2017. IEEE

  • Socher R, Karpathy A, Le QV, Manning CD, Andrew YN (2014) Grounded compositional semantics for finding and describing images with sentences. Trans Assoc Comput Linguist 2:207–218

    Google Scholar 

  • Sumbul G, Nayak S, Demir B (2021) SD-RSIC: summarization-driven deep remote sensing image captioning. IEEE Trans Geosci Remote Sens 59(8):6922–6934

    Google Scholar 

  • Tavakoli HR, Shetty R, Borji A, Laaksonen J (2017) Paying attention to descriptions generated by image captioning models. In: Proceedings of the IEEE international conference on computer vision, pp 2487–2496

  • Tsutsui S, Crandall D (2017) Using artificial tokens to control languages for multilingual image caption generation. arXiv preprintarXiv:1706.06275, 2017

  • Ushiku Y, Yamaguchi M, Mukuta Y, Harada T (2015) Common subspace for model and similarity: phrase learning for caption generation from images. In: 2015 IEEE international conference on computer vision (ICCV), pp 2668–2676, Santiago, Chile 2015. IEEE

  • van Miltenburg E, Elliott D, Vossen P (2017) Cross-linguistic differences and similarities in image descriptions. In: International conference on natural language generation, 2017

  • Vedantam R, Zitnick CL, Parikh D (2015) Cider: consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4566–4575, 2015

  • Verma Y, Jawahar CV (2014) Im2Text and Text2Im: associating images and texts for cross-modal retrieval. In Proceedings of the British machine vision conference 2014, pp 97.1–97.13, Nottingham, 2014. British Machine Vision Association

  • Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR), pp 3156–3164, Boston. IEEE

  • Wang L, Bai Z, Zhang Y, Hongtao L (2020) Show, recall, and tell: image captioning with recall mechanism. Proc AAAI Conf Artif Intell 34(07):12176–12183

    Google Scholar 

  • Wang W, Chen Z, Haifeng H (2019) Hierarchical attention network for image captioning. Proc AAAI Conf Artif Intell 33:8957–8964

    Google Scholar 

  • Wang Q, Huang W, Zhang X, Li X (2021) Word-sentence framework for remote sensing image captioning. IEEE Trans Geosci Remote Sens 59(12):10532–10543

    Google Scholar 

  • Wang Y, Jungang X, Sun Y (2022) End-to-end transformer based model for image captioning. Proc AAAI Conf Artif Intell 36:2585–2594

    Google Scholar 

  • Wang B, Zheng X, Bo Q, Xiaoqiang L (2020) Retrieval topic recurrent memory network for remote sensing image captioning. IEEE J Select Top Appl Earth Observ Remote Sens 13:256–270

    Google Scholar 

  • Wang C, Gu X (2021) An image captioning approach using dynamical attention. In: 2021 international joint conference on neural networks (IJCNN), pp 1–8, 2021

  • Wang C, Yang H, Meinel C (2018) Image captioning with deep bidirectional lstms and multi-task learning. In: ACM transactions on multimedia computing, communications, and applications (TOMM), 14(2s):1–20

  • Wang D, Beck D, Cohn T (2019) On the role of scene graphs in image captioning. In: Proceedings of the beyond vision and language: integrating real-world knowledge (LANTERN)

  • Wang Q, Chan AB (2018) CNN+CNN: convolutional decoders for image captioning. arXiv:1805.09019 [cs], May

  • Wang Y, Lin Z, Shen X, Cohen S, Cottrell GW (2017) Skeleton key: image captioning by skeleton-attribute decomposition. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR), pp 7378–7387, Honolulu. IEEE

  • Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel , Bengio Y (2015) Show, attend and tell: neural image caption generation with visual attention. In: International conference on machine learning, pp 2048–2057. PMLR, 2015

  • Yang L, Wang H, Tang P, Li Qinyu (2021) CaptionNet: a tailor-made recurrent neural network for generating image descriptions. IEEE Trans Multimedia 23:835–845

    Google Scholar 

  • Yang X, Tang K, Zhang H, Cai J (2019) Auto-encoding scene graphs for image captioning. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 10677–10686, Long Beach 2019. IEEE

  • Yang X, Zhang H, Cai J (2019) Learning to collocate neural modules for image captioning. In: 2019 IEEE/CVF international conference on computer vision (ICCV), pp 4249–4259, Seoul 2019. IEEE

  • Yang Y, Teo C, Daumé H, Aloimonos Y (2011) Corpus-guided sentence generation of natural images. In: Proceedings of the 2011 conference on empirical methods in natural language processing, pp 444–454

  • Yao T, Pan Y, Li Y, Mei T (2018) Exploring visual relationship for image captioning. In: Proceedings of the European conference on computer vision (ECCV), pp 684–699, 2018

  • Yao T, Pan Y, Li Y, Mei T (2019) Hierarchy parsing for image captioning. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 2621–2629

  • Yatskar M, Galley M, Vanderwende L, Zettlemoyer L (2014) See no evil, say no evil: Description generation from densely labeled images. In : Proceedings of the Third Joint Conference on Lexical and Computational Semantics (* SEM 2014), pp 110–120

  • Ye S, Han J, Liu N (2018) Attentive linear transformation for image captioning. IEEE Trans Image Process 27(11):5514–5524

    MathSciNet  Google Scholar 

  • Yoshikawa Y, Shigeto Y, Takeuchi A (2017) STAIR captions: constructing a large-scale Japanese image caption dataset. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp 417–421, Vancouver. Association for Computational Linguistics

  • Yuan Z, Li X, Wang Q (2020) Exploring multi-level attention and semantic relationship for remote sensing image captioning. IEEE Access 8:2608–2620

    Google Scholar 

  • Yumeng Z, Jing Y, Shuo G, Limin L (2021) News image-text matching with news knowledge graph. IEEE Access 9:108017–108027

    Google Scholar 

  • Zhang J, Mei K, Zheng Y, Fan J (2021) Integrating part of speech guidance for image captioning. IEEE Trans Multimedia 23:92–104

    Google Scholar 

  • Zhang X, Sun X, Luo Y, Ji J, Zhou Y, Wu Y, Huang F, Ji R (2021) Rstnet: Captioning with adaptive attention on visual and non-visual words. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 15465–15474

  • Zhao W, Xinxiao W, Luo J (2021) Cross-domain image captioning via cross-modal retrieval and model adaptation. IEEE Trans Image Process 30:1180–1192

    MathSciNet  Google Scholar 

  • Zhou Z, Liang X, Wang C, Xie W, Wang S, Ge S, Zhang Y (2021) An image captioning model based on bidirectional depth residuals and its application. IEEE Access 9:25360–25370

    Google Scholar 

  • Zhou Y, Wang M, Liu D, Hu Z, Zhang H (2020) More grounded image captioning by distilling image-text matching model. In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR)

  • Zohourianshahzadi Z, Kalita JK (2021) Neural attention for image captioning: review of outstanding methods. Artif Intell Rev 55(5):3833–3862

    Google Scholar 

Download references

Acknowledgements

The authors extend sincere gratitude to the Editor and Reviewers for their insightful remarks and helpful opinions, which contributed to the enhancement of the work.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Himanshu Sharma.

Ethics declarations

Conflict of interest

All the authors declare that they do not have any conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

See Tables 10, 11, 12

Table 10 Comprehensive information about the publishers of the research articles that are cited in this survey
Table 11 A frequency distribution table of the article publishers cited in this survey
Table 12 A frequency distribution table of article types cited in this survey

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sharma, H., Padha, D. A comprehensive survey on image captioning: from handcrafted to deep learning-based techniques, a taxonomy and open research issues. Artif Intell Rev 56, 13619–13661 (2023). https://doi.org/10.1007/s10462-023-10488-2

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10462-023-10488-2

Keywords

Navigation