Skip to main content
Log in

Object semantic analysis for image captioning

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Although existing image captioning models can produce sentences through attention mechanisms and recurrent neural networks, it is difficult to generate multiple sentences to describe different important objects. Most image captioning models lack description diversity, whereas the diversity models often describe unimportant objects, resulting in low accuracy. In this paper, we propose a novel approach to balancing accuracy and diversity. To achieve this, we designed a novel model which combines saliency information and objects’ relative position information to assess the semantic importance of all detected objects. By maintaining the features of important objects and making the network able to describe important objects by operating on the features of unimportant objects, our model can generate sentences with more diversity or accuracy. Experiments demonstrate the characteristics of our model on the MSCOCO and Flickr 30K datasets. In this dataset, our model can provide a set of accurate or diverse descriptions. Compared with the state-of-art models by standard captioning metrics and human evaluation metrics, our model outperforms these works in being able to generate more diverse or accuracy sentences.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Algorithm 1
Fig. 5

Similar content being viewed by others

References

  1. Anderson P, Fernando B, Johnson M, Gould S (2016) Spice: semantic propositional image caption evaluation. In: European conference on computer vision, Springer, pp 382–398

  2. Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6077–6086

  3. Banerjee S, Lavie A (2005) Meteor: an automatic metric for mt evaluation with improved correlation with human judgments. In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp 65–72

  4. Chen X, Fang H, Lin T Y, Vedantam R, Gupta S, Dollár P, Zitnick C L (2015) Microsoft coco captions: Data collection and evaluation server. arXiv:1504.00325

  5. Chen S, Jin Q, Wang P, Wu Q (2020) Say as you wish: fine-grained control of image caption generation with abstract scene graphs. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 9962–9971

  6. Dai B, Fidler S, Urtasun R, Lin D (2017) Towards diverse and natural image descriptions via a conditional gan. In: Proceedings of the IEEE international conference on computer vision, pp 2970–2979

  7. Deng J, Dong W, Socher R, Li L J, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition, Ieee, pp 248–255

  8. Deshpande A, Aneja J, Wang L, Schwing A G, Forsyth D (2019) Fast, diverse and accurate image captioning guided by part-of-speech. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10695–10704

  9. Ding X, Guo Y, Ding G, Han J (2019) Acnet: strengthening the kernel skeletons for powerful cnn via asymmetric convolution blocks. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 1911–1920

  10. Farhadi A, Hejrati M, Sadeghi MA, Young P, Rashtchian C, Hockenmaier J, Forsyth D (2010) Every picture tells a story: generating sentences from images. In: European conference on computer vision, Springer, pp 15–29

  11. Feng Q, Wu Y, Fan H, Yan C, Xu M, Yang Y (2020) Cascaded revision network for novel object captioning. IEEE Trans Circ Syst Video Technol 30(10):3413–3421

    Article  Google Scholar 

  12. Gan C, Gan Z, He X, Gao J, Deng L (2017) Stylenet: generating attractive visual captions with styles. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3137–3146

  13. Gan Z, Gan C, He X, Pu Y, Tran K, Gao J, Carin L, Deng L (2017) Semantic compositional networks for visual captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5630–5639

  14. Goferman S, Zelnik-Manor L, Tal A (2011) Context-aware saliency detection. IEEE transactions on pattern analysis and machine intelligence 34(10):1915–1926

    Article  Google Scholar 

  15. Gupta A, Verma Y, Jawahar C (2012) Choosing linguistics over vision to describe images. In: Proceedings of the AAAI conference on artificial intelligence, vol 26

  16. Heidari M, Ghatee M, Nickabadi A, Nezhad AP (2020) Diverse and styled image captioning using svd-based mixture of recurrent experts. arXiv:2007.03338

  17. Hodosh M, Young P, Hockenmaier J (2013) Framing image description as a ranking task: data, models and evaluation metrics. J Artif Intell Res 47:853–899

    Article  MathSciNet  MATH  Google Scholar 

  18. Hou X, Zhang L (2007) Saliency detection: a spectral residual approach. In: 2007 IEEE conference on computer vision and pattern recognition, Ieee, pp 1–8

  19. Huang L , Wang W, Chen J, Wei XY (2019) Attention on attention for image captioning. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 4634–4643

  20. Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3128–3137

  21. Krishna R, Zhu Y, Groth O, Johnson J, Hata K, Kravitz J, Chen S, Kalantidis Y, Li LJ, Shamma DA et al (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. Int J Comput Vis 123(1):32–73

    Article  MathSciNet  Google Scholar 

  22. Kulkarni G, Premraj V, Ordonez V, Dhar S, Li S, Choi Y, Berg AC, Berg TL (2013) Babytalk: understanding and generating simple image descriptions. IEEE Trans Pattern Anal Mach Intell 35(12):2891–2903

    Article  Google Scholar 

  23. Kuznetsova P, Ordonez V, Berg TL, Choi Y (2014) Treetalk: composition and compression of trees for image descriptions. Trans. Assoc Comput Linguist 2:351–362

    Article  Google Scholar 

  24. Li D, Huang Q, He X, Zhang L, Sun MT (2018) Generating diverse and accurate visual captions by comparative adversarial learning. arXiv:1804.00861

  25. Liu JJ, Hou Q, Cheng MM, Feng J, Jiang J (2019) A simple pooling-based design for real-time salient object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3917–3926

  26. Liu T, Yuan Z, Sun J, Wang J, Zheng N, Tang X, Shum HY (2010) Learning to detect a salient object. IEEE Trans Pattern Anal Mach Intell 33:353–367

    Google Scholar 

  27. Liu S, Zhu Z, Ye N, Guadarrama S, Murphy K (2017) Improved image captioning via policy gradient optimization of spider. In: Proceedings of the IEEE international conference on computer vision, pp 873–881

  28. Liu X, Li H, Shao J, Chen D, Wang X (2018) Show, tell and discriminate: image captioning by self-retrieval with partially labeled data. In: Proceedings of the european conference on computer vision (ECCV), pp 338–354

  29. Lu J, Yang J, Batra D, Parikh D (2018) Neural baby talk. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7219–7228

  30. Luo R, Price B, Cohen S, Shakhnarovich G (2018) Discriminability objective for training descriptive captions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6964–6974

  31. Mao Y, Zhou C, Wang X, Li R (2018) Show and tell more: topic-oriented multi-sentence image captioning. In: IJCAI, pp 4258–4264

  32. Mathews A, Xie L, He X (2016) Senticap: generating image descriptions with sentiments. In: Proceedings of the AAAI conference on artificial intelligence, vol 30

  33. Mathews A, Xie L, He X (2018) Semstyle: learning to generate stylised image captions using unaligned text. In: 2018 IEEE/CVF conference on computer vision and pattern recognition, pp 8591–8600. https://doi.org/10.1109/CVPR.2018.00896

  34. Mitchell M, Dodge J, Goyal A, Yamaguchi K, Stratos K, Han X, Mensch A, Berg A, Berg T, Daumé III H (2012) Midge : generating image descriptions from computer vision detections. In: Proceedings of the 13th conference of the european chapter of the association for computational linguistics, pp 747–756

  35. Pan Y, Yao T, Li Y, Mei T (2020) X-linear attention networks for image captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10971–10980

  36. Qin X, Zhang Z, Huang C, Dehghan M, Zaiane O R, Jagersand M (2020) U2-net: going deeper with nested u-structure for salient object detection. Pattern Recognit 106:107404

    Article  Google Scholar 

  37. Ranzato M, Chopra S, Auli M, Zaremba W (2015) Sequence level training with recurrent neural networks. arXiv:1511.06732

  38. Redmon J, Farhadi A (2018) Yolov3: an incremental improvement. arXiv:1804.02767

  39. Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: towards real-time object detection with region proposal networks. Adv Neural Inf Process Syst 28:91–99

    Google Scholar 

  40. Rennie SJ, Marcheret E, Mroueh Y, Ross J, Goel V (2017) Self-critical sequence training for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7008–7024

  41. Shuster K, Humeau S, Bordes A, Weston J (2018) Image chat: engaging grounded conversations. arXiv:1811.00945

  42. Siris A, Jiao J, Tam GK, Xie X, Lau RW (2021) Scene context-aware salient object detection. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 4156–4166

  43. Sun H, Bian Y, Liu N, Zhou H (2021) Multi-scale edge-based u-shape network for salient object detection. In: Pacific rim international conference on artificial intelligence, Springer, pp 501–514

  44. Tian P, Mo H, Jiang L (2021) Improved image captioning via semantic feature update. In: 2021 40Th chinese control conference (CCC), pp 7938–7943. https://doi.org/10.23919/CCC52363.2021.9549991

  45. Ushiku Y, Harada T, Kuniyoshi Y (2012) Efficient image annotation for automatic sentence generation. In: Proceedings of the 20th ACM international conference on Multimedia, pp 549–558

  46. Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4566–4575

  47. Vijayakumar AK, Cogswell M, Selvaraju RR, Sun Q, Lee S, Crandall D, Batra D (2016) Diverse beam search: decoding diverse solutions from neural sequence models. arXiv:1610.02424

  48. Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3156–3164

  49. Walther D, Koch C (2006) Modeling attention to salient proto-objects. Neural Netw 19(9):1395–1407

    Article  MATH  Google Scholar 

  50. Wang L, Schwing A, Lazebnik S (2017) Diverse and accurate image description using a variational auto-encoder with an additive gaussian encoding space. Adv Neural Inf Process Syst, 30

  51. Wang Q, Wan J, Chan AB (2022) On diversity in image captioning: metrics and methods. IEEE Trans Pattern Anal Mach Intell 44(2):1035–1049. https://doi.org/10.1109/TPAMI.2020.3013834

    Article  Google Scholar 

  52. Yao T, Pan Y, Li Y, Qiu Z, Mei T (2017) Boosting image captioning with attributes. In: Proceedings of the IEEE international conference on computer vision, pp. 4894–4902

  53. You Q, Jin H, Wang Z, Fang C, Luo J (2016) Image captioning with semantic attention. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4651–4659

  54. Young P, Lai A, Hodosh M, Hockenmaier J (2014) From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans. Assoc Comput Linguist 2:67–78

    Article  Google Scholar 

  55. Zheng Y, Li Y, Wang S (2019) Intention oriented image captions with guiding objects. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8395–8404

  56. Zhong Y, Wang L, Chen J, Yu D, Li Y (2020) Comprehensive image captioning via scene graph decomposition. In: European conference on computer vision, Springer, pp 211–229

Download references

Acknowledgements

This research was supported by the NSFC (No. 61771386) ,and by the Key Research and Development Program of Shaanxi (Program no. 2020SF-359) ,and by the Research and development of manufacturing information system platform supporting product lifecycle management(No. 2018GY-030), and by the Natural Science Foundation of Shaanxi Province (No. 2021JQ-487), and by the, and by the Scientific Research Program Funded of Shaanxi Education Department (NO. 20JK0788).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hong Zhu.

Ethics declarations

Conflict of Interests

The authors declare that they have no competing financial interests or personal relationships that could have influenced the work reported in this paper.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Du, S., Zhu, H., Lin, G. et al. Object semantic analysis for image captioning. Multimed Tools Appl 82, 43179–43206 (2023). https://doi.org/10.1007/s11042-023-14596-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-023-14596-7

Keywords

Navigation