Abstract
How to represent image information more effectively is the key to the task of image caption. In the existing research, a large number of image caption methods are proposed. Most of them use the global information of the image, and the information in the image that is not related to the caption generation also participates in the calculation, caused a certain amount of waste of resources. In order to solve this problem, a method of generating image caption based on object detection is proposed in this paper. Firstly, the object detection algorithm is used to extract image feature, only the features of meaningful regions in the image are used, and then image caption is generated by combining the spatial attention mechanism with the caption generation network. Experiments show that the image feature of the object region and the salient region are sufficient to represent the information of the entire image in the image caption task. For better convergence of the model, this paper also uses a new strategy for model training. The experimental results show that the proposed model in this paper work well on the test dataset of image caption, and it has created a precedent for new technology to a large extent.
Similar content being viewed by others
References
Bahdanau D, Cho K, Bengio Y (2014) neural machine translation by jointly learning to align and translate. Computer Science
Bin J, Gardiner B, Liu Z et al (2019) Attention-based multi-modal fusion for improved real estate appraisal: a case study in Los Angeles. Multimed Tools Appl: 1–22. doi:https://doi.org/10.1007/s11042-019-07895-5
Chen L, Zhang H, Xiao J, et al (2017) SCA-CNN: Spatial and Channel-wise Attention in Convolutional Networks for Image Captioning: 6298–6306
Cho K, Merrienboer B V, Gulcehre C, et al (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. Computer Science
Fang F, Li Q, Wang H, et al (2018) Refining attention: a sequential attention model for image captioning. 2018 IEEE international conference on multimedia and expo (ICME): 1–6
Ge H, Yan Z, Yu W et al (2019) An attention mechanism based convolutional LSTM network for video action recognition. Multimed Tools Appl 78(14):20533–20556. https://doi.org/10.1007/s11042-019-7404-z
Guo Y, Liu Y, De Boer MHT, Liu L, Michael S (2018) A dual prediction network for image captioning, 2018 IEEE international conference on multimedia and expo (ICME): 1–6
He K, Zhang X, Ren S et al (2015) Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans Pattern Anal Mach Intell 37(9):1904–1916
He K, Zhang X, Ren S, et al (2015) Deep residual learning for image recognition: 770–778
Jia X, Gavves E, Fernando B, et al (2015) Guiding long-short term memory for image caption generation
Karpathy A, Li FF (2015) Deep visual-semantic alignments for generating image descriptions. Computer Vision and Pattern Recognition IEEE: 3128–3137
Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural networks. International conference on neural information processing systems. Curran Associates Inc: 1097–1105
Kulkarni G, Premraj V, Ordonez V et al (2013) Babytalk: understanding and generating simple image descriptions. IEEE Trans Pattern Anal Mach Intell 35(12):2891–2903
Kuznetsova P, Ordonez V, Berg AC, et al (2012) Collective generation of natural image descriptions. Meeting of the Association for Computational Linguistics: Long Papers Association for Computational Linguistics: 359–368
Kuznetsova P, Ordonez V, Berg A, et al (2013) Generalizing image captions for image-text parallel Corpus. Meeting of the Association for Computational Linguistics: 790–796
Lecun Y, Boser B, Denker JS et al (1989) Back propagation applied to handwritten zip code recognition. Neural Comput 1(4):541–551
Lipton Z C, Berkowitz J, Elkan C (2015) A critical review of recurrent neural networks for sequence learning. Computer Science
Lu J, Xiong C, Parikh D, et al (2016) Knowing when to look: adaptive attention via a visual sentinel for image captioning: 3242–3250
Mitchell M, Han X, Dodge J, et al (2012) Midge: generating image descriptions from computer vision detections. Conference of the European chapter of the Association for Computational Linguistics. Association for Computational Linguistics: 747–756
Ren S, He K, Girshick R, et al (2015) Faster R-CNN: towards real-time object detection with region proposal networks. International conference on neural information processing systems. MIT Press: 91–99
Sadeghi MA, Sadeghi MA, Sadeghi MA, et al (2010) Every picture tells a story: generating sentences from images. European conference on computer vision. Springer-Verlag: 15–29
Sak H, Senior A, Beaufays F (2014) Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition. Computer Science: 338–342
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. Computer Science
Sundermeyer M, Schlüter R, Ney H (2012) LSTM neural networks for language modeling. Interspeech: 601–608
Szegedy C, Liu W, Jia Y, et al (2014) Going deeper with convolutions: 1–9
Vinyals O, Toshev A, Bengio S, et al (2014) Show and tell: a neural image caption generator: 3156–3164
Wu Q, Shen C, Liu L, et al (2016) What value do explicit high level concepts have in vision to language problems?, Computer Science: 203–212
Xu K, Ba J, Kiros R, et al (2015) Show, attend and tell: neural image caption generation with visual attention. Computer Science: 2048–2057
Yang Y, Teo CL, Aloimonos Y (2011) Corpus-guided sentence generation of natural images. Conference on empirical methods in natural language processing. Association for Computational Linguistics: 444–454
Yang Z, Yuan Y, Wu Y, et al (2016) Encode, review, and decode: reviewer module for caption generation
Yao T, Pan Y, Li Y, et al (2016) Boosting image captioning with attributes: 4904–4912
Ye S, Liu N, Han J (2018) Attentive linear transformation for image captioning. IEEE Trans Image Process: 5514–5524
Zhou Y, Zhenzhen H, Ye Z, Liu X, Hong R (2018) Enhanced text-guided attention model for image captioning. 2018 IEEE fourth international conference on multimedia big data (BigMM): 1–5
Zhu Z, Xue Z, Yuan Z (2018) Topic-guided attention for image captioning: 2615–2619
Acknowledgements
The work was supported by Yuyou Talent Support Plan of North China University of Technology (107051360019XN132/017), the Fundamental Research Funds for Beijing Universities (110052971803/037), Special Research Foundation of North China University of Technology (PXM2017_014212_000014).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Cao, D., Zhu, M. & Gao, L. An image caption method based on object detection. Multimed Tools Appl 78, 35329–35350 (2019). https://doi.org/10.1007/s11042-019-08116-9
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-019-08116-9