An image caption method based on object detection

Cao, Danyang; Zhu, Menggui; Gao, Lei

doi:10.1007/s11042-019-08116-9

An image caption method based on object detection

Published: 03 September 2019

Volume 78, pages 35329–35350, (2019)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

619 Accesses
7 Citations
Explore all metrics

Abstract

How to represent image information more effectively is the key to the task of image caption. In the existing research, a large number of image caption methods are proposed. Most of them use the global information of the image, and the information in the image that is not related to the caption generation also participates in the calculation, caused a certain amount of waste of resources. In order to solve this problem, a method of generating image caption based on object detection is proposed in this paper. Firstly, the object detection algorithm is used to extract image feature, only the features of meaningful regions in the image are used, and then image caption is generated by combining the spatial attention mechanism with the caption generation network. Experiments show that the image feature of the object region and the salient region are sufficient to represent the information of the entire image in the image caption task. For better convergence of the model, this paper also uses a new strategy for model training. The experimental results show that the proposed model in this paper work well on the test dataset of image caption, and it has created a precedent for new technology to a large extent.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Attention mechanisms in computer vision: A survey

Article Open access 15 March 2022

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Article Open access 06 February 2017

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

Article 15 September 2023

References

Bahdanau D, Cho K, Bengio Y (2014) neural machine translation by jointly learning to align and translate. Computer Science
Bin J, Gardiner B, Liu Z et al (2019) Attention-based multi-modal fusion for improved real estate appraisal: a case study in Los Angeles. Multimed Tools Appl: 1–22. doi:https://doi.org/10.1007/s11042-019-07895-5
Article Google Scholar
Chen L, Zhang H, Xiao J, et al (2017) SCA-CNN: Spatial and Channel-wise Attention in Convolutional Networks for Image Captioning: 6298–6306
Cho K, Merrienboer B V, Gulcehre C, et al (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. Computer Science
Fang F, Li Q, Wang H, et al (2018) Refining attention: a sequential attention model for image captioning. 2018 IEEE international conference on multimedia and expo (ICME): 1–6
Ge H, Yan Z, Yu W et al (2019) An attention mechanism based convolutional LSTM network for video action recognition. Multimed Tools Appl 78(14):20533–20556. https://doi.org/10.1007/s11042-019-7404-z
Article Google Scholar
Guo Y, Liu Y, De Boer MHT, Liu L, Michael S (2018) A dual prediction network for image captioning, 2018 IEEE international conference on multimedia and expo (ICME): 1–6
He K, Zhang X, Ren S et al (2015) Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans Pattern Anal Mach Intell 37(9):1904–1916
Article Google Scholar
He K, Zhang X, Ren S, et al (2015) Deep residual learning for image recognition: 770–778
Jia X, Gavves E, Fernando B, et al (2015) Guiding long-short term memory for image caption generation
Karpathy A, Li FF (2015) Deep visual-semantic alignments for generating image descriptions. Computer Vision and Pattern Recognition IEEE: 3128–3137
Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural networks. International conference on neural information processing systems. Curran Associates Inc: 1097–1105
Kulkarni G, Premraj V, Ordonez V et al (2013) Babytalk: understanding and generating simple image descriptions. IEEE Trans Pattern Anal Mach Intell 35(12):2891–2903
Article Google Scholar
Kuznetsova P, Ordonez V, Berg AC, et al (2012) Collective generation of natural image descriptions. Meeting of the Association for Computational Linguistics: Long Papers Association for Computational Linguistics: 359–368
Kuznetsova P, Ordonez V, Berg A, et al (2013) Generalizing image captions for image-text parallel Corpus. Meeting of the Association for Computational Linguistics: 790–796
Lecun Y, Boser B, Denker JS et al (1989) Back propagation applied to handwritten zip code recognition. Neural Comput 1(4):541–551
Article Google Scholar
Lipton Z C, Berkowitz J, Elkan C (2015) A critical review of recurrent neural networks for sequence learning. Computer Science
Lu J, Xiong C, Parikh D, et al (2016) Knowing when to look: adaptive attention via a visual sentinel for image captioning: 3242–3250
Mitchell M, Han X, Dodge J, et al (2012) Midge: generating image descriptions from computer vision detections. Conference of the European chapter of the Association for Computational Linguistics. Association for Computational Linguistics: 747–756
Ren S, He K, Girshick R, et al (2015) Faster R-CNN: towards real-time object detection with region proposal networks. International conference on neural information processing systems. MIT Press: 91–99
Sadeghi MA, Sadeghi MA, Sadeghi MA, et al (2010) Every picture tells a story: generating sentences from images. European conference on computer vision. Springer-Verlag: 15–29
Sak H, Senior A, Beaufays F (2014) Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition. Computer Science: 338–342
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. Computer Science
Sundermeyer M, Schlüter R, Ney H (2012) LSTM neural networks for language modeling. Interspeech: 601–608
Szegedy C, Liu W, Jia Y, et al (2014) Going deeper with convolutions: 1–9
Vinyals O, Toshev A, Bengio S, et al (2014) Show and tell: a neural image caption generator: 3156–3164
Wu Q, Shen C, Liu L, et al (2016) What value do explicit high level concepts have in vision to language problems?, Computer Science: 203–212
Xu K, Ba J, Kiros R, et al (2015) Show, attend and tell: neural image caption generation with visual attention. Computer Science: 2048–2057
Yang Y, Teo CL, Aloimonos Y (2011) Corpus-guided sentence generation of natural images. Conference on empirical methods in natural language processing. Association for Computational Linguistics: 444–454
Yang Z, Yuan Y, Wu Y, et al (2016) Encode, review, and decode: reviewer module for caption generation
Yao T, Pan Y, Li Y, et al (2016) Boosting image captioning with attributes: 4904–4912
Ye S, Liu N, Han J (2018) Attentive linear transformation for image captioning. IEEE Trans Image Process: 5514–5524
Article MathSciNet Google Scholar
Zhou Y, Zhenzhen H, Ye Z, Liu X, Hong R (2018) Enhanced text-guided attention model for image captioning. 2018 IEEE fourth international conference on multimedia big data (BigMM): 1–5
Zhu Z, Xue Z, Yuan Z (2018) Topic-guided attention for image captioning: 2615–2619

Download references

Acknowledgements

The work was supported by Yuyou Talent Support Plan of North China University of Technology (107051360019XN132/017), the Fundamental Research Funds for Beijing Universities (110052971803/037), Special Research Foundation of North China University of Technology (PXM2017_014212_000014).

Author information

Danyang Cao, Menggui Zhu and Lei Gao contributed equally to this work.

Authors and Affiliations

School of Information Science and Technology, North China University of Technology, No 5, Jin Yuan Zhuang Road, Beijing, 100144, China
Danyang Cao, Menggui Zhu & Lei Gao
Beijing Key Laboratory on Integration and Analysis of Large-scale Stream Data, Beijing, 100144, China
Danyang Cao

Authors

Danyang Cao
View author publications
You can also search for this author in PubMed Google Scholar
Menggui Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Lei Gao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Danyang Cao.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Cao, D., Zhu, M. & Gao, L. An image caption method based on object detection. Multimed Tools Appl 78, 35329–35350 (2019). https://doi.org/10.1007/s11042-019-08116-9

Download citation

Received: 13 September 2018
Revised: 14 July 2019
Accepted: 13 August 2019
Published: 03 September 2019
Issue Date: December 2019
DOI: https://doi.org/10.1007/s11042-019-08116-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An image caption method based on object detection

Abstract

Access this article

Similar content being viewed by others

Attention mechanisms in computer vision: A survey

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

An image caption method based on object detection

Abstract

Access this article

Similar content being viewed by others

Attention mechanisms in computer vision: A survey

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation