A survey on deep neural network-based image captioning

Liu, Xiaoxiao; Xu, Qingyang; Wang, Ning

doi:10.1007/s00371-018-1566-y

A survey on deep neural network-based image captioning

Survey
Published: 09 June 2018

Volume 35, pages 445–470, (2019)
Cite this article

The Visual Computer Aims and scope Submit manuscript

3144 Accesses
6 Altmetric
Explore all metrics

Abstract

Image captioning is a hot topic of image understanding, and it is composed of two natural parts (“look” and “language expression”) which correspond to the two most important fields of artificial intelligence (“machine vision” and “natural language processing”). With the development of deep neural networks and better labeling database, the image captioning techniques have developed quickly. In this survey, the image captioning approaches and improvements based on deep neural network are introduced, including the characteristics of the specific techniques. The early image captioning approach based on deep neural network is the retrieval-based method. The retrieval method makes use of a searching technique to find an appropriate image description. The template-based method separates the image captioning process into object detection and sentence generation. Recently, end-to-end learning-based image captioning method has been verified effective at image captioning. The end-to-end learning techniques can generate more flexible and fluent sentence. In this survey, the image captioning methods are reviewed in detail. Furthermore, some remaining challenges are discussed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Comprehensive Comparative Study on Several Image Captioning Techniques Based on Deep Learning Algorithm

An Efficient Technique for Image Captioning Using Deep Neural Network

A Comprehensive Review on Automatic Image Captioning Using Deep Learning

Discover the latest articles and news from researchers in related subjects, suggested using machine learning.

References

Yan, R., Hauptmann, A.G.: A review of text and image retrieval approaches for broadcast news video. Inf. Retr. 10, 445–484 (2007)
Article Google Scholar
Bernardi, R., Cakici, R., Elliott, D., Erdem, A., Erdem, E., Ikizler-Cinbis, N., Keller, F., Muscat, A., Plank, B.: Automatic description generation from images: a survey of models, datasets, and evaluation measures. J. Artif. Intell. Res. 55, 409–442 (2016)
Article Google Scholar
Aloimonos, Y., Aloimonos, Y., Aloimonos, Y.: Computer vision and natural language processing: recent approaches in multimedia and robotics. ACM Comput. Surv. 49, 71 (2016)
MATH Google Scholar
Kuznetsova, P., Ordonez, V., Berg, A.C., Berg, T.L., Choi, Y.: Collective generation of natural image descriptions. In: Meeting of the Association for Computational Linguistics: Long Papers, Korea, Jeju Island, pp. 359–368 (2012)
Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: a deep visual-semantic embedding model. In: International Conference on Neural Information Processing Systems, Neural Information Processing Systems Foundation, Lake Tahoe, pp. 2121–2129 (2013)
Socher, R., Karpathy, A., Le, Q.V., Manning, C.D., Ng, A.Y.: Grounded compositional semantics for finding and describing images with sentences. Trans. Assoc. Comput. Linguist. 2, 207–218 (2014)
Article Google Scholar
Srivastava, N., Salakhutdinov, R.: Multimodal learning with deep Boltzmann machines. J. Mach. Learn. Res. 15, 2949–2980 (2014)
MathSciNet MATH Google Scholar
Norouzi, M., Mikolov, T., Bengio, S., Singer, Y., Shlens, J., Frome, A., Corrado, G.S., Dean, J.: Zero-shot learning by convex combination of semantic embeddings. In: International Conference on Learning Representations ICLR2014, Banff, Canada (2014)
Hodosh, M., Young, P., Hockenmaier, J.: Framing image description as a ranking task: data, models and evaluation metrics. J. Artif. Intell. Res. 47, 853–899 (2013)
Article MathSciNet MATH Google Scholar
Ordonez, V., Kulkarni, G., Berg, T.L.: Im2Text: describing images using 1 million captioned photographs. Adv. Neural Inf. Process. Syst. 25, 1143–1151 (2012)
Google Scholar
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115, 211–252 (2014)
Article MathSciNet Google Scholar
Farhadi, A., Hejrati, M., Sadeghi, M.A., Young, P., Rashtchian, C., Hockenmaier, J., Forsyth, D.: Every picture tells a story: Generating sentences from images. In: 11th European Conference on Computer Vision (ECCV 2010), Crete, Greece, 2010, pp. 15–29 (2010)
Kulkarni, G., Premraj, V., Dhar, S., Li, S., Choi, Y., Berg, A.C., Berg, T.L.: Baby talk: understanding and generating simple image descriptions. In: 2011 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2011, IEEE, Colorado Springs, CO, pp. 1601–1608 (2011)
Li, S., Kulkarni, G., Berg, L.B., Berg, C.A., Choi, Y.: Composing simple image descriptions using web-scale N-grams. In: 15th Conference on Computational Natural Language Learning, Portland, USA, 2011, pp. 220–228 (2011)
Fang, H., Gupta, S., Iandola, F., Srivastava, R.K., Deng, L., Dollár, P., Gao, J., He, X., Mitchell, M., Platt, J.C., Zitnick, C.L., Zweig, G.: From captions to visual concepts and back. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Boston, MA, USA, pp. 1473–1482 (2015)
Maron, O., Lozano-Perez, T.: A framework for multiple-instance learning. In: 11th Annual Conference on Neural Information Processing Systems, NIPS 1997, Neural information processing systems foundation, Denver, CO, pp. 570–576 (1998)
Viola, P., Platt, J.C., Zhang, C.: Multiple instance boosting for object detection. In: International Conference on Neural Information Processing Systems, MIT, Vancouver, British Columbia, Canada, pp. 1417–1424 (2005)
Song, H.O., Girshick, R., Jegelka, S., Mairal, J., Harchaoui, Z., Darrell, T.: On learning to localize objects with minimal supervision. In: 2016 International Conference on Machine Learning, Beijing, China, pp. 1611–1619 (2014)
Zitnick, C.L., Dollár, P.: Edge Boxes: locating object proposals from edges. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) 13th European Conference on Computer Vision, pp. 391–405. Springer, Zurich (2014)
Google Scholar
Uijlings, J.R., Sande, K.E., Gevers, T., Smeulders, A.W.: Selective search for object recognition. Int. J. Comput. Vis. 104, 154–171 (2013)
Article Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: International Conference on Neural Information Processing Systems, Curran Associates Inc, Lake Tahoe, pp. 1097–1105 (2012)
D.J., D.W., S.R., J.L. L., L. Kai, F. Li, ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Miami, Florida, USA, pp. 248–255 (2009)
Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2014), OH, USA, 2014, pp. 580–587 (2014)
van de Sande K.E.A., Uijlings, J.R.R., Gevers, T., Smeulders, A.W.M.: Segmentation as selective search for object recognition. In: 2011 International Conference on Computer Vision, IEEE, Barcelona, Spain, pp. 1879–1886 (2011)
He, K., Zhang, X., Ren, S., Sun, J.: Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. 37, 1904–1916 (2015)
Article Google Scholar
Girshick, R.: Fast R-CNN. In: IEEE International Conference on Computer Vision, IEEE, Santiago, Chile, pp. 1440–1448 (2015)
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39, 1137 (2017)
Article Google Scholar
Zhou, L., Hovy, E.: Template-filtered headline summarization. In: The Proceedings of The ACL Workshop Text Summarization Branches Out, pp. 56–60 (2004)
Channarukul, S., Mcroy, S.W., Ali, S.S.: DOGHED: a template-based generator for multimodal dialog systems targeting heterogeneous devices. In: Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology (2003)
Chisholm, M., Tadepalli, P.: Learning decision rules by randomized iterative local search. In: Nineteenth International Conference on Machine Learning, Morgan Kaufmann, pp. 75–82 (2002)
White, M., Cardie, C.: Selecting sentences for multidocument summaries using randomized local search. Proc. ACL Summ. Workshop 4, 9–18 (2002)
Google Scholar
Klein, D., Manning, C.D.: Accurate Unlexicalized Parsing. In: Annual Meeting of the Association for Computational Linguistics, Sapporo, Japan, pp. 423–430 (2003)
Berger, A.L., Pietra, V.J.D., Pietra, S.A.D.: A maximum entropy approach to natural language processing. Comput. Linguist. 22, 39–71 (2002)
Google Scholar
Yang, Y., Teo, C.L., Aloimonos, Y.: Corpus-guided sentence generation of natural images. In: Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Edinburgh, United Kingdom, pp. 444–454 (2011)
Carlson, A., Betteridge, J., Kisiel, B., Settles, B., Mitchell, T.M.: Toward an architecture for never-ending language learning. In: Twenty-Fourth AAAI Conference on Artificial Intelligence, AAAI, Atlanta, Georgia, pp. 1306–1313 (2010)
Graff, D., Kong, J., Chen, K., Maeda, K.: English Gigaword, 3rd edn. LDC2007T07. Web Download. Linguistic Data consortium, Philadelphia (2007)
Mikolov, T., Karafiát, M., Burget, L., Jan, C., Khudanpur, S.: Recurrent neural network based language model. In: 11th Annual Conference of the International Speech Communication Association: Spoken Language Processing for All, INTERSPEECH 2010, ISCA, Makuhari, Chiba, Japan, pp. 1045–1048 (2010)
Boden, M.: A Guide to Recurrent Neural Networks and Backpropagation, Dallas Project Sics Technical Report T Sics (2002)
Sutskever, I., Martens, J., Hinton, G.: Generating text with recurrent neural networks. In: 28th International Conference on Machine Learning, ICML 2011, DBLP, Bellevue, Washington, USA, pp. 1017–1024 (2011)
Schuster, M., Paliwal, K.K.: Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 45, 2673–2681 (1997)
Article Google Scholar
Graves, A., Mohamed, A.R., Hinton, G.: Speech recognition with deep recurrent neural networks. In: 2013 38th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2013, IEEE, Vancouver, BC, Canada, pp. 6645–6649 (2013)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9, 1735–1780 (1997)
Article Google Scholar
Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y.: Learning phrase representations using RNN encoder–decoder for statistical machine translation. In: 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, Association for Computational Linguistics (ACL), Doha, Qatar, pp. 1724–1734 (2014)
Mitchell, M., Han, X., Dodge, J., Mensch, A., Goyal, A., Berg, A., Yamaguchi, K., Berg, T., Stratos, K., Hal Daumé, I.: Midge: generating image descriptions from computer vision detections. In: Conference of the European Chapter of the Association for Computational Linguistics, Association for Computational Linguistics, Avignon, France, pp. 747–756 (2012)
Verma, Y., Gupta, A., Mannem, P., Jawahar, C.V.: Generating image descriptions using semantic similarities in the output space. In: 2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops, IEEE, Portland, USA, pp. 288–293 (2013)
Kuznetsova, P., Ordonez, V., Berg, T., Choi, Y.: TREETALK: composition and compression of trees for image descriptions. TACL 2, 351–362 (2014)
Google Scholar
Elliott, D., Keller, F.: Image description using visual dependency representations. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Washington, USA, 2013, pp. 1292–1302 (2013)
Aker, A., Gaizauskas, R.: Generating image descriptions using dependency relational patterns. In: Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Uppsala, Sweden, pp. 1250–1258 (2013)
Cho, K., Merrienboer, B.V., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y.: Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation, Computer Science, pp. 1724–1734. arXiv:1406.1078 (2014)
Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: International Conference on Neural Information Processing Systems, MIT, Montreal, pp. 3104–3112 (2014)
Johnson, R., Zhang, T.: Effective Use of Word Order for Text Categorization with Convolutional Neural Networks, pp. 103–112. Eprint Arxiv arXiv:1412.1058 (2014)
Bahdanau, D., Cho, K., Bengio, Y.: Neural Machine Translation by Jointly Learning to Align and Translate. arXiv preprint arXiv:1409.0473 (2014)
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: 2015 Computer Vision and Pattern Recognition, IEEE, Boston, pp. 3156–3164 (2015)
Oriol, V., Alexander, T., Samy, B., Dumitru, E.: Show and tell: lessons learned from the 2015 MSCOCO image captioning challenge. IEEE Trans Pattern Anal. 39(2017), 652–663 (2015)
Google Scholar
Ioffe, S., Szegedy, C., Bach, F., Blei, D.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: 32nd International Conference on Machine Learning, ICML 2015, International Machine Learning Society (IMLS), Lille, France, pp. 448–456 (2015)
Lint, R., Liu, S., Yang, M., Li, M., Zhou, M., Li, S.: Hierarchical recurrent neural network for document modeling. In: Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, Association for Computational Linguistics (ACL), Lisbon, Portugal, pp. 899–907 (2015)
Kiros, R., Salakhutdinov, R., Zemel, R.: Multimodal neural language models. In: 31st International Conference on Machine Learning, ICML 2014, International Machine Learning Society (IMLS), Bejing, China, pp. 595–603 (2014)
Mao, J., Xu, W., Yang, Y., Wang, J., Yuille, A.L.: Explain Images with Multimodal Recurrent Neural Networks. arXiv preprint arXiv:1410.1090 (2014)
Donahue, J., Hendricks, L.A., Guadarrama, S., Rohrbach, M., Venugopalan, S., Darrell, T., Saenko, K.: Long-term recurrent convolutional networks for visual recognition and description In: Computer Vision and Pattern Recognition, IEEE, Boston, MA, USA, p. 677 (2015)
Kiros, R., Salakhutdinov, R., Zemel, R.S.: Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models, arXiv preprint arXiv:1411.2539 (2014)
Tanti, M., Gatt, A., Camilleri, K.P.: Where to put the Image in an Image Caption Generator, arXiv preprint arXiv:1703.09137 (2017)
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient Estimation of Word Representations in Vector Space, arXiv preprint arXiv:1301.3781 (2013)
Zaremba, W., Sutskever, I., Vinyals, O.: Recurrent Neural Network Regularization, arXiv preprint arXiv:1409.2329 (2014)
Er, M.J., Zhang, Y., Wang, N., Pratama, M.: Attention pooling-based convolutional neural network for sentence modelling. Inf. Sci. 373, 388–403 (2016)
Article Google Scholar
Mnih, V., Heess, N., Graves, A., Kavukcuoglu, K.: Recurrent models of visual attention. In: Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N.D., Weinberger, K.Q., Bottou, L. (eds.) Advances in Neural Information Processing Systems, vol. 27, pp. 2204–2212. Curran Associates Inc, Red Hook (2014)
Google Scholar
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., Zemel, R., Bengio, Y.: Show, Attend and Tell: Neural Image Caption Generation with Visual Attention, Computer Science, pp. 2048–2057 (2015)
Pascanu, R., Gulcehre, C., Cho, K., Bengio, Y.: How to Construct Deep Recurrent Neural Networks, arXiv preprint arXiv:1312.6026 (2014)
You, Q., Jin, H., Wang, Z., Fang, C., Luo, J.: Image captioning with semantic attention. In: IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Las Vegas, NV, USA, pp. 4651–4659 (2016)
Karpathy, A., Li, F.F.: Deep visual-semantic alignments for generating image descriptions. In: Computer Vision and Pattern Recognition, IEEE, Boston, MA, USA, pp. 3128–3137 (2015)
Chen, X., Zitnick, C.L.: Mind’s eye: a recurrent visual representation for image caption generation. In: Computer Vision and Pattern Recognition (CVPR), IEEE, Boston, MA, USA, pp. 2422–2431 (2015)
Mao, J., Xu, W., Yang, Y., Wang, J., Huang, Z., Yuille, A.: Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN), arXiv preprint arXiv:1412.6632 (2014)
Mao, J., Wei, X., Yang, Y., Wang, J., Huang, Z., Yuille, A.L.: Learning like a child: fast novel visual concept learning from sentence descriptions of images In: IEEE International Conference on Computer Vision, IEEE, Santiago, Chile, pp. 2533–2541 (2016)
Lebret, R., Pinheiro, P.O., Collobert, R.: Simple Image Description Generator via a Linear Phrase-Based Approach, arXiv preprint arXiv:1412.8419 (2014)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations ofwords and phrases and their compositionality. In: 27th Annual Conference on Neural Information Processing Systems, NIPS 2013, Neural Information Processing Systems Foundation, Lake Tahoe, NV (2013)
Lu, J., Xiong, C., Parikh, D., Socher, R.: Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning, arXiv preprint arXiv:1612.01887 (2016)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, USA 2016, pp. 770–778 (2016)
Merity, S., Xiong, C., Bradbury, J., Socher, R.: Pointer Sentinel Mixture Models, arXiv preprint arXiv:1609.07843 (2016)
Johnson, J., Karpathy, A., Li, F.F.: DenseCap: fully convolutional localization networks for dense captioning. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 2015, pp. 4565–4574 (2015)
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Machine Intell. 39(6), 1137–1149 (2017)
Article Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition, Computer Science arXiv preprint arXiv:1409.1556 (2014)
Gregor, K., Danihelka, I., Graves, A., Rezende, D.J., Wierstra, D.: DRAW: a recurrent neural network for image generation. In: Computer Science, pp. 1462–1471 (2015)
Jaderberg, M., Simonyan, K., Zisserman, A., Kavukcuoglu, K.: Spatial transformer networks. arXiv preprint arXiv:1506.02025 (2015)
Yang, L., Tang, K., Yang, J., Li, L.J.: Dense Captioning with Joint Inference and Visual. Context 2017, 1978–1987 (2017)
Google Scholar
Krause, J., Johnson, J., Krishna, R., Li, F.F.: A Hierarchical Approach for Generating Descriptive Image Paragraphs, arXiv preprint arXiv:1611.06607 (2016)
Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft COCO: Common Objects in Context. arXiv preprint arXiv:1405.0312 (2014)
Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L.J., Shamma, D.A.: Visual genome: connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. 123, 32–73 (2016)
Article MathSciNet Google Scholar
Li, J., Luong, M.T., Dan, J.: A Hierarchical Neural Autoencoder for Paragraphs and Documents, pp. 1057–1506 (2015)
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: IBM Research Report Bleu: a method for automatic evaluation of machine translation. In: ACL Proceedings of Annual Meeting of the Association for Computational Linguistics, vol. 30, pp. 311–318 (2002)
Denkowski, M., Lavie, A.: Meteor Universal: Language Specific Translation Evaluation for Any Target Language, pp. 376–380. Baltimore, Maryland (2014)
Google Scholar
Flick, C.: ROUGE: A Package for Automatic Evaluation of summaries. In: Proceedings of the Workshop on Text Summarization Branches Out, Barcelona, Spain, 2004, p. 10 (2004)
Vedantam, R., Zitnick, C.L., Parikh, D.: CIDEr: consensus-based image description evaluation. In: Computer Vision and Pattern Recognition (CVPR), IEEE, Boston, MA, USA, pp. 4566–4575 (2014)
Wu, Q., Shen, C., Liu, L., Dick, A., Hengel, A.V.D.: What value do explicit high level concepts have in vision to language problems? In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016), Las Vegas, NV, United States, 2016, pp. 203–212 (2016)
Chen, W., Lucchi, A., Hofmann, T.: Bootstrap, Review, Decode: Using Out-of-Domain Textual Data to Improve Image Captioning, arXiv:1611.05321v1 (2016)
Aditya, S., Yang, Y., Baral, C., Aloimonos, Y., Fermüller, C.: Image understanding using vision and reasoning through scene description graph. In: Computer Vision and Image Understanding (2017)
Mahendran, A., Vedaldi, A.: Understanding deep image representations by inverting them. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 2015, pp. 5188–5196 (2015)
Dosovitskiy, A., Brox, T.: Inverting visual representations with convolutional networks. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016), Las Vegas, NV, United States, 2016, pp. 4829–4837 (2016)
Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Object Detectors Emerge in Deep Scene CNNs, arXiv e-print arXiv:1412.6856 (2014)
Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling, arXiv e-print arXiv:1412.3555 (2014)
Karpathy, A., Johnson, J., Fei-Fei, L.: Visualizing and Understanding Recurrent Networks, arXiv e-print arXiv:1506.02078 (2015)
Dong, Y., Su, H., Zhu, J., Zhang, B.: Improving interpretability of deep neural networks with semantic information. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017), Honolulu, Hawaii, USA, 2017, pp. 975–983 (2017)
Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R., Saenko, K.: Translating Videos to Natural Language Using Deep Recurrent Neural Networks, arXiv e-print arXiv:1412.4729 (2014)

Download references

Acknowledgements

The authors would like to thank the two anonymous reviewers and the editor-in-chief for their comment to improve the paper. This work is supported by National Nature Science Foundation of China (under Grants 61603214, 61573213, 51009017 and 51379002), Shandong Provincial Key Research and Development Plan (2018GGX101039), Shandong Provincial Natural Science Foundation (ZR2015PF009, 2016ZRE2703), the Fund for Dalian Distinguished Young Scholars (under Grant 2016RJ10), the Innovation Support Plan for Dalian High-level Talents (under Grant 2015R065), and the Fundamental Research Funds for the Central Universities (under Grant 3132016314 and 3132018126).

Author information

Authors and Affiliations

School of Mechanical, Electrical and Information Engineering, Shandong University, Weihai, 264209, People’s Republic of China
Xiaoxiao Liu & Qingyang Xu
Marine Engineering College, Dalian Maritime University, Dalian, 116026, People’s Republic of China
Ning Wang

Authors

Xiaoxiao Liu
View author publications
You can also search for this author inPubMed Google Scholar
Qingyang Xu
View author publications
You can also search for this author inPubMed Google Scholar
Ning Wang
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding authors

Correspondence to Qingyang Xu or Ning Wang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Liu, X., Xu, Q. & Wang, N. A survey on deep neural network-based image captioning. Vis Comput 35, 445–470 (2019). https://doi.org/10.1007/s00371-018-1566-y

Download citation

Published: 09 June 2018
Issue Date: 13 March 2019
DOI: https://doi.org/10.1007/s00371-018-1566-y

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A survey on deep neural network-based image captioning

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Comprehensive Comparative Study on Several Image Captioning Techniques Based on Deep Learning Algorithm

An Efficient Technique for Image Captioning Using Deep Neural Network

A Comprehensive Review on Automatic Image Captioning Using Deep Learning

Explore related subjects

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now