Skip to main content
Log in

An image caption method based on object detection

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

How to represent image information more effectively is the key to the task of image caption. In the existing research, a large number of image caption methods are proposed. Most of them use the global information of the image, and the information in the image that is not related to the caption generation also participates in the calculation, caused a certain amount of waste of resources. In order to solve this problem, a method of generating image caption based on object detection is proposed in this paper. Firstly, the object detection algorithm is used to extract image feature, only the features of meaningful regions in the image are used, and then image caption is generated by combining the spatial attention mechanism with the caption generation network. Experiments show that the image feature of the object region and the salient region are sufficient to represent the information of the entire image in the image caption task. For better convergence of the model, this paper also uses a new strategy for model training. The experimental results show that the proposed model in this paper work well on the test dataset of image caption, and it has created a precedent for new technology to a large extent.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

References

  1. Bahdanau D, Cho K, Bengio Y (2014) neural machine translation by jointly learning to align and translate. Computer Science

  2. Bin J, Gardiner B, Liu Z et al (2019) Attention-based multi-modal fusion for improved real estate appraisal: a case study in Los Angeles. Multimed Tools Appl: 1–22. doi:https://doi.org/10.1007/s11042-019-07895-5

    Article  Google Scholar 

  3. Chen L, Zhang H, Xiao J, et al (2017) SCA-CNN: Spatial and Channel-wise Attention in Convolutional Networks for Image Captioning: 6298–6306

  4. Cho K, Merrienboer B V, Gulcehre C, et al (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. Computer Science

  5. Fang F, Li Q, Wang H, et al (2018) Refining attention: a sequential attention model for image captioning. 2018 IEEE international conference on multimedia and expo (ICME): 1–6

  6. Ge H, Yan Z, Yu W et al (2019) An attention mechanism based convolutional LSTM network for video action recognition. Multimed Tools Appl 78(14):20533–20556. https://doi.org/10.1007/s11042-019-7404-z

    Article  Google Scholar 

  7. Guo Y, Liu Y, De Boer MHT, Liu L, Michael S (2018) A dual prediction network for image captioning, 2018 IEEE international conference on multimedia and expo (ICME): 1–6

  8. He K, Zhang X, Ren S et al (2015) Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans Pattern Anal Mach Intell 37(9):1904–1916

    Article  Google Scholar 

  9. He K, Zhang X, Ren S, et al (2015) Deep residual learning for image recognition: 770–778

  10. Jia X, Gavves E, Fernando B, et al (2015) Guiding long-short term memory for image caption generation

  11. Karpathy A, Li FF (2015) Deep visual-semantic alignments for generating image descriptions. Computer Vision and Pattern Recognition IEEE: 3128–3137

  12. Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural networks. International conference on neural information processing systems. Curran Associates Inc: 1097–1105

  13. Kulkarni G, Premraj V, Ordonez V et al (2013) Babytalk: understanding and generating simple image descriptions. IEEE Trans Pattern Anal Mach Intell 35(12):2891–2903

    Article  Google Scholar 

  14. Kuznetsova P, Ordonez V, Berg AC, et al (2012) Collective generation of natural image descriptions. Meeting of the Association for Computational Linguistics: Long Papers Association for Computational Linguistics: 359–368

  15. Kuznetsova P, Ordonez V, Berg A, et al (2013) Generalizing image captions for image-text parallel Corpus. Meeting of the Association for Computational Linguistics: 790–796

  16. Lecun Y, Boser B, Denker JS et al (1989) Back propagation applied to handwritten zip code recognition. Neural Comput 1(4):541–551

    Article  Google Scholar 

  17. Lipton Z C, Berkowitz J, Elkan C (2015) A critical review of recurrent neural networks for sequence learning. Computer Science

  18. Lu J, Xiong C, Parikh D, et al (2016) Knowing when to look: adaptive attention via a visual sentinel for image captioning: 3242–3250

  19. Mitchell M, Han X, Dodge J, et al (2012) Midge: generating image descriptions from computer vision detections. Conference of the European chapter of the Association for Computational Linguistics. Association for Computational Linguistics: 747–756

  20. Ren S, He K, Girshick R, et al (2015) Faster R-CNN: towards real-time object detection with region proposal networks. International conference on neural information processing systems. MIT Press: 91–99

  21. Sadeghi MA, Sadeghi MA, Sadeghi MA, et al (2010) Every picture tells a story: generating sentences from images. European conference on computer vision. Springer-Verlag: 15–29

  22. Sak H, Senior A, Beaufays F (2014) Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition. Computer Science: 338–342

  23. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. Computer Science

  24. Sundermeyer M, Schlüter R, Ney H (2012) LSTM neural networks for language modeling. Interspeech: 601–608

  25. Szegedy C, Liu W, Jia Y, et al (2014) Going deeper with convolutions: 1–9

  26. Vinyals O, Toshev A, Bengio S, et al (2014) Show and tell: a neural image caption generator: 3156–3164

  27. Wu Q, Shen C, Liu L, et al (2016) What value do explicit high level concepts have in vision to language problems?, Computer Science: 203–212

  28. Xu K, Ba J, Kiros R, et al (2015) Show, attend and tell: neural image caption generation with visual attention. Computer Science: 2048–2057

  29. Yang Y, Teo CL, Aloimonos Y (2011) Corpus-guided sentence generation of natural images. Conference on empirical methods in natural language processing. Association for Computational Linguistics: 444–454

  30. Yang Z, Yuan Y, Wu Y, et al (2016) Encode, review, and decode: reviewer module for caption generation

  31. Yao T, Pan Y, Li Y, et al (2016) Boosting image captioning with attributes: 4904–4912

  32. Ye S, Liu N, Han J (2018) Attentive linear transformation for image captioning. IEEE Trans Image Process: 5514–5524

    Article  MathSciNet  Google Scholar 

  33. Zhou Y, Zhenzhen H, Ye Z, Liu X, Hong R (2018) Enhanced text-guided attention model for image captioning. 2018 IEEE fourth international conference on multimedia big data (BigMM): 1–5

  34. Zhu Z, Xue Z, Yuan Z (2018) Topic-guided attention for image captioning: 2615–2619

Download references

Acknowledgements

The work was supported by Yuyou Talent Support Plan of North China University of Technology (107051360019XN132/017), the Fundamental Research Funds for Beijing Universities (110052971803/037), Special Research Foundation of North China University of Technology (PXM2017_014212_000014).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Danyang Cao.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Cao, D., Zhu, M. & Gao, L. An image caption method based on object detection. Multimed Tools Appl 78, 35329–35350 (2019). https://doi.org/10.1007/s11042-019-08116-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-019-08116-9

Keywords

Navigation