Object semantic analysis for image captioning

Du, Sen; Zhu, Hong; Lin, Guangfeng; Wang, Dong; Shi, Jing; Wang, Jing

doi:10.1007/s11042-023-14596-7

Object semantic analysis for image captioning

Published: 22 April 2023

Volume 82, pages 43179–43206, (2023)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Sen Du¹,
Hong Zhu ORCID: orcid.org/0000-0003-2276-6376¹,
Guangfeng Lin²,
Dong Wang¹,
Jing Shi¹ &
…
Jing Wang²

250 Accesses
Explore all metrics

Abstract

Although existing image captioning models can produce sentences through attention mechanisms and recurrent neural networks, it is difficult to generate multiple sentences to describe different important objects. Most image captioning models lack description diversity, whereas the diversity models often describe unimportant objects, resulting in low accuracy. In this paper, we propose a novel approach to balancing accuracy and diversity. To achieve this, we designed a novel model which combines saliency information and objects’ relative position information to assess the semantic importance of all detected objects. By maintaining the features of important objects and making the network able to describe important objects by operating on the features of unimportant objects, our model can generate sentences with more diversity or accuracy. Experiments demonstrate the characteristics of our model on the MSCOCO and Flickr 30K datasets. In this dataset, our model can provide a set of accurate or diverse descriptions. Compared with the state-of-art models by standard captioning metrics and human evaluation metrics, our model outperforms these works in being able to generate more diverse or accuracy sentences.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Microsoft COCO: Common Objects in Context

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Article Open access 06 February 2017

A Systematic Survey on CAPTCHA Recognition: Types, Creation and Breaking Techniques

Article 14 June 2021

References

Anderson P, Fernando B, Johnson M, Gould S (2016) Spice: semantic propositional image caption evaluation. In: European conference on computer vision, Springer, pp 382–398
Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6077–6086
Banerjee S, Lavie A (2005) Meteor: an automatic metric for mt evaluation with improved correlation with human judgments. In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp 65–72
Chen X, Fang H, Lin T Y, Vedantam R, Gupta S, Dollár P, Zitnick C L (2015) Microsoft coco captions: Data collection and evaluation server. arXiv:1504.00325
Chen S, Jin Q, Wang P, Wu Q (2020) Say as you wish: fine-grained control of image caption generation with abstract scene graphs. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 9962–9971
Dai B, Fidler S, Urtasun R, Lin D (2017) Towards diverse and natural image descriptions via a conditional gan. In: Proceedings of the IEEE international conference on computer vision, pp 2970–2979
Deng J, Dong W, Socher R, Li L J, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition, Ieee, pp 248–255
Deshpande A, Aneja J, Wang L, Schwing A G, Forsyth D (2019) Fast, diverse and accurate image captioning guided by part-of-speech. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10695–10704
Ding X, Guo Y, Ding G, Han J (2019) Acnet: strengthening the kernel skeletons for powerful cnn via asymmetric convolution blocks. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 1911–1920
Farhadi A, Hejrati M, Sadeghi MA, Young P, Rashtchian C, Hockenmaier J, Forsyth D (2010) Every picture tells a story: generating sentences from images. In: European conference on computer vision, Springer, pp 15–29
Feng Q, Wu Y, Fan H, Yan C, Xu M, Yang Y (2020) Cascaded revision network for novel object captioning. IEEE Trans Circ Syst Video Technol 30(10):3413–3421
Article Google Scholar
Gan C, Gan Z, He X, Gao J, Deng L (2017) Stylenet: generating attractive visual captions with styles. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3137–3146
Gan Z, Gan C, He X, Pu Y, Tran K, Gao J, Carin L, Deng L (2017) Semantic compositional networks for visual captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5630–5639
Goferman S, Zelnik-Manor L, Tal A (2011) Context-aware saliency detection. IEEE transactions on pattern analysis and machine intelligence 34(10):1915–1926
Article Google Scholar
Gupta A, Verma Y, Jawahar C (2012) Choosing linguistics over vision to describe images. In: Proceedings of the AAAI conference on artificial intelligence, vol 26
Heidari M, Ghatee M, Nickabadi A, Nezhad AP (2020) Diverse and styled image captioning using svd-based mixture of recurrent experts. arXiv:2007.03338
Hodosh M, Young P, Hockenmaier J (2013) Framing image description as a ranking task: data, models and evaluation metrics. J Artif Intell Res 47:853–899
Article MathSciNet MATH Google Scholar
Hou X, Zhang L (2007) Saliency detection: a spectral residual approach. In: 2007 IEEE conference on computer vision and pattern recognition, Ieee, pp 1–8
Huang L , Wang W, Chen J, Wei XY (2019) Attention on attention for image captioning. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 4634–4643
Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3128–3137
Krishna R, Zhu Y, Groth O, Johnson J, Hata K, Kravitz J, Chen S, Kalantidis Y, Li LJ, Shamma DA et al (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. Int J Comput Vis 123(1):32–73
Article MathSciNet Google Scholar
Kulkarni G, Premraj V, Ordonez V, Dhar S, Li S, Choi Y, Berg AC, Berg TL (2013) Babytalk: understanding and generating simple image descriptions. IEEE Trans Pattern Anal Mach Intell 35(12):2891–2903
Article Google Scholar
Kuznetsova P, Ordonez V, Berg TL, Choi Y (2014) Treetalk: composition and compression of trees for image descriptions. Trans. Assoc Comput Linguist 2:351–362
Article Google Scholar
Li D, Huang Q, He X, Zhang L, Sun MT (2018) Generating diverse and accurate visual captions by comparative adversarial learning. arXiv:1804.00861
Liu JJ, Hou Q, Cheng MM, Feng J, Jiang J (2019) A simple pooling-based design for real-time salient object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3917–3926
Liu T, Yuan Z, Sun J, Wang J, Zheng N, Tang X, Shum HY (2010) Learning to detect a salient object. IEEE Trans Pattern Anal Mach Intell 33:353–367
Google Scholar
Liu S, Zhu Z, Ye N, Guadarrama S, Murphy K (2017) Improved image captioning via policy gradient optimization of spider. In: Proceedings of the IEEE international conference on computer vision, pp 873–881
Liu X, Li H, Shao J, Chen D, Wang X (2018) Show, tell and discriminate: image captioning by self-retrieval with partially labeled data. In: Proceedings of the european conference on computer vision (ECCV), pp 338–354
Lu J, Yang J, Batra D, Parikh D (2018) Neural baby talk. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7219–7228
Luo R, Price B, Cohen S, Shakhnarovich G (2018) Discriminability objective for training descriptive captions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6964–6974
Mao Y, Zhou C, Wang X, Li R (2018) Show and tell more: topic-oriented multi-sentence image captioning. In: IJCAI, pp 4258–4264
Mathews A, Xie L, He X (2016) Senticap: generating image descriptions with sentiments. In: Proceedings of the AAAI conference on artificial intelligence, vol 30
Mathews A, Xie L, He X (2018) Semstyle: learning to generate stylised image captions using unaligned text. In: 2018 IEEE/CVF conference on computer vision and pattern recognition, pp 8591–8600. https://doi.org/10.1109/CVPR.2018.00896
Mitchell M, Dodge J, Goyal A, Yamaguchi K, Stratos K, Han X, Mensch A, Berg A, Berg T, Daumé III H (2012) Midge : generating image descriptions from computer vision detections. In: Proceedings of the 13th conference of the european chapter of the association for computational linguistics, pp 747–756
Pan Y, Yao T, Li Y, Mei T (2020) X-linear attention networks for image captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10971–10980
Qin X, Zhang Z, Huang C, Dehghan M, Zaiane O R, Jagersand M (2020) U2-net: going deeper with nested u-structure for salient object detection. Pattern Recognit 106:107404
Article Google Scholar
Ranzato M, Chopra S, Auli M, Zaremba W (2015) Sequence level training with recurrent neural networks. arXiv:1511.06732
Redmon J, Farhadi A (2018) Yolov3: an incremental improvement. arXiv:1804.02767
Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: towards real-time object detection with region proposal networks. Adv Neural Inf Process Syst 28:91–99
Google Scholar
Rennie SJ, Marcheret E, Mroueh Y, Ross J, Goel V (2017) Self-critical sequence training for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7008–7024
Shuster K, Humeau S, Bordes A, Weston J (2018) Image chat: engaging grounded conversations. arXiv:1811.00945
Siris A, Jiao J, Tam GK, Xie X, Lau RW (2021) Scene context-aware salient object detection. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 4156–4166
Sun H, Bian Y, Liu N, Zhou H (2021) Multi-scale edge-based u-shape network for salient object detection. In: Pacific rim international conference on artificial intelligence, Springer, pp 501–514
Tian P, Mo H, Jiang L (2021) Improved image captioning via semantic feature update. In: 2021 40Th chinese control conference (CCC), pp 7938–7943. https://doi.org/10.23919/CCC52363.2021.9549991
Ushiku Y, Harada T, Kuniyoshi Y (2012) Efficient image annotation for automatic sentence generation. In: Proceedings of the 20th ACM international conference on Multimedia, pp 549–558
Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4566–4575
Vijayakumar AK, Cogswell M, Selvaraju RR, Sun Q, Lee S, Crandall D, Batra D (2016) Diverse beam search: decoding diverse solutions from neural sequence models. arXiv:1610.02424
Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3156–3164
Walther D, Koch C (2006) Modeling attention to salient proto-objects. Neural Netw 19(9):1395–1407
Article MATH Google Scholar
Wang L, Schwing A, Lazebnik S (2017) Diverse and accurate image description using a variational auto-encoder with an additive gaussian encoding space. Adv Neural Inf Process Syst, 30
Wang Q, Wan J, Chan AB (2022) On diversity in image captioning: metrics and methods. IEEE Trans Pattern Anal Mach Intell 44(2):1035–1049. https://doi.org/10.1109/TPAMI.2020.3013834
Article Google Scholar
Yao T, Pan Y, Li Y, Qiu Z, Mei T (2017) Boosting image captioning with attributes. In: Proceedings of the IEEE international conference on computer vision, pp. 4894–4902
You Q, Jin H, Wang Z, Fang C, Luo J (2016) Image captioning with semantic attention. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4651–4659
Young P, Lai A, Hodosh M, Hockenmaier J (2014) From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans. Assoc Comput Linguist 2:67–78
Article Google Scholar
Zheng Y, Li Y, Wang S (2019) Intention oriented image captions with guiding objects. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8395–8404
Zhong Y, Wang L, Chen J, Yu D, Li Y (2020) Comprehensive image captioning via scene graph decomposition. In: European conference on computer vision, Springer, pp 211–229

Download references

Acknowledgements

This research was supported by the NSFC (No. 61771386) ,and by the Key Research and Development Program of Shaanxi (Program no. 2020SF-359) ,and by the Research and development of manufacturing information system platform supporting product lifecycle management(No. 2018GY-030), and by the Natural Science Foundation of Shaanxi Province (No. 2021JQ-487), and by the, and by the Scientific Research Program Funded of Shaanxi Education Department (NO. 20JK0788).

Author information

Authors and Affiliations

School of automation and Information Engineering, Xi’an University of Technology, 5 South Jinhua Road, Xi’an, Shaanxi Province, 710048, People’s Republic of China
Sen Du, Hong Zhu, Dong Wang & Jing Shi
Information Science Department, Xi’an University of Technology, 5 South Jinhua Road, Xi’an, Shaanxi Province, 710048, People’s Republic of China
Guangfeng Lin & Jing Wang

Authors

Sen Du
View author publications
You can also search for this author in PubMed Google Scholar
Hong Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Guangfeng Lin
View author publications
You can also search for this author in PubMed Google Scholar
Dong Wang
View author publications
You can also search for this author in PubMed Google Scholar
Jing Shi
View author publications
You can also search for this author in PubMed Google Scholar
Jing Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hong Zhu.

Ethics declarations

Conflict of Interests

The authors declare that they have no competing financial interests or personal relationships that could have influenced the work reported in this paper.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Du, S., Zhu, H., Lin, G. et al. Object semantic analysis for image captioning. Multimed Tools Appl 82, 43179–43206 (2023). https://doi.org/10.1007/s11042-023-14596-7

Download citation

Received: 10 September 2021
Revised: 19 May 2022
Accepted: 31 January 2023
Published: 22 April 2023
Issue Date: November 2023
DOI: https://doi.org/10.1007/s11042-023-14596-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Object semantic analysis for image captioning

Abstract

Access this article

Similar content being viewed by others

Microsoft COCO: Common Objects in Context

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

A Systematic Survey on CAPTCHA Recognition: Types, Creation and Breaking Techniques

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Object semantic analysis for image captioning

Abstract

Access this article

Similar content being viewed by others

Microsoft COCO: Common Objects in Context

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

A Systematic Survey on CAPTCHA Recognition: Types, Creation and Breaking Techniques

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation