Abstract
Although existing image captioning models can produce sentences through attention mechanisms and recurrent neural networks, it is difficult to generate multiple sentences to describe different important objects. Most image captioning models lack description diversity, whereas the diversity models often describe unimportant objects, resulting in low accuracy. In this paper, we propose a novel approach to balancing accuracy and diversity. To achieve this, we designed a novel model which combines saliency information and objects’ relative position information to assess the semantic importance of all detected objects. By maintaining the features of important objects and making the network able to describe important objects by operating on the features of unimportant objects, our model can generate sentences with more diversity or accuracy. Experiments demonstrate the characteristics of our model on the MSCOCO and Flickr 30K datasets. In this dataset, our model can provide a set of accurate or diverse descriptions. Compared with the state-of-art models by standard captioning metrics and human evaluation metrics, our model outperforms these works in being able to generate more diverse or accuracy sentences.
Similar content being viewed by others
References
Anderson P, Fernando B, Johnson M, Gould S (2016) Spice: semantic propositional image caption evaluation. In: European conference on computer vision, Springer, pp 382–398
Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6077–6086
Banerjee S, Lavie A (2005) Meteor: an automatic metric for mt evaluation with improved correlation with human judgments. In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp 65–72
Chen X, Fang H, Lin T Y, Vedantam R, Gupta S, Dollár P, Zitnick C L (2015) Microsoft coco captions: Data collection and evaluation server. arXiv:1504.00325
Chen S, Jin Q, Wang P, Wu Q (2020) Say as you wish: fine-grained control of image caption generation with abstract scene graphs. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 9962–9971
Dai B, Fidler S, Urtasun R, Lin D (2017) Towards diverse and natural image descriptions via a conditional gan. In: Proceedings of the IEEE international conference on computer vision, pp 2970–2979
Deng J, Dong W, Socher R, Li L J, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition, Ieee, pp 248–255
Deshpande A, Aneja J, Wang L, Schwing A G, Forsyth D (2019) Fast, diverse and accurate image captioning guided by part-of-speech. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10695–10704
Ding X, Guo Y, Ding G, Han J (2019) Acnet: strengthening the kernel skeletons for powerful cnn via asymmetric convolution blocks. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 1911–1920
Farhadi A, Hejrati M, Sadeghi MA, Young P, Rashtchian C, Hockenmaier J, Forsyth D (2010) Every picture tells a story: generating sentences from images. In: European conference on computer vision, Springer, pp 15–29
Feng Q, Wu Y, Fan H, Yan C, Xu M, Yang Y (2020) Cascaded revision network for novel object captioning. IEEE Trans Circ Syst Video Technol 30(10):3413–3421
Gan C, Gan Z, He X, Gao J, Deng L (2017) Stylenet: generating attractive visual captions with styles. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3137–3146
Gan Z, Gan C, He X, Pu Y, Tran K, Gao J, Carin L, Deng L (2017) Semantic compositional networks for visual captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5630–5639
Goferman S, Zelnik-Manor L, Tal A (2011) Context-aware saliency detection. IEEE transactions on pattern analysis and machine intelligence 34(10):1915–1926
Gupta A, Verma Y, Jawahar C (2012) Choosing linguistics over vision to describe images. In: Proceedings of the AAAI conference on artificial intelligence, vol 26
Heidari M, Ghatee M, Nickabadi A, Nezhad AP (2020) Diverse and styled image captioning using svd-based mixture of recurrent experts. arXiv:2007.03338
Hodosh M, Young P, Hockenmaier J (2013) Framing image description as a ranking task: data, models and evaluation metrics. J Artif Intell Res 47:853–899
Hou X, Zhang L (2007) Saliency detection: a spectral residual approach. In: 2007 IEEE conference on computer vision and pattern recognition, Ieee, pp 1–8
Huang L , Wang W, Chen J, Wei XY (2019) Attention on attention for image captioning. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 4634–4643
Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3128–3137
Krishna R, Zhu Y, Groth O, Johnson J, Hata K, Kravitz J, Chen S, Kalantidis Y, Li LJ, Shamma DA et al (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. Int J Comput Vis 123(1):32–73
Kulkarni G, Premraj V, Ordonez V, Dhar S, Li S, Choi Y, Berg AC, Berg TL (2013) Babytalk: understanding and generating simple image descriptions. IEEE Trans Pattern Anal Mach Intell 35(12):2891–2903
Kuznetsova P, Ordonez V, Berg TL, Choi Y (2014) Treetalk: composition and compression of trees for image descriptions. Trans. Assoc Comput Linguist 2:351–362
Li D, Huang Q, He X, Zhang L, Sun MT (2018) Generating diverse and accurate visual captions by comparative adversarial learning. arXiv:1804.00861
Liu JJ, Hou Q, Cheng MM, Feng J, Jiang J (2019) A simple pooling-based design for real-time salient object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3917–3926
Liu T, Yuan Z, Sun J, Wang J, Zheng N, Tang X, Shum HY (2010) Learning to detect a salient object. IEEE Trans Pattern Anal Mach Intell 33:353–367
Liu S, Zhu Z, Ye N, Guadarrama S, Murphy K (2017) Improved image captioning via policy gradient optimization of spider. In: Proceedings of the IEEE international conference on computer vision, pp 873–881
Liu X, Li H, Shao J, Chen D, Wang X (2018) Show, tell and discriminate: image captioning by self-retrieval with partially labeled data. In: Proceedings of the european conference on computer vision (ECCV), pp 338–354
Lu J, Yang J, Batra D, Parikh D (2018) Neural baby talk. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7219–7228
Luo R, Price B, Cohen S, Shakhnarovich G (2018) Discriminability objective for training descriptive captions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6964–6974
Mao Y, Zhou C, Wang X, Li R (2018) Show and tell more: topic-oriented multi-sentence image captioning. In: IJCAI, pp 4258–4264
Mathews A, Xie L, He X (2016) Senticap: generating image descriptions with sentiments. In: Proceedings of the AAAI conference on artificial intelligence, vol 30
Mathews A, Xie L, He X (2018) Semstyle: learning to generate stylised image captions using unaligned text. In: 2018 IEEE/CVF conference on computer vision and pattern recognition, pp 8591–8600. https://doi.org/10.1109/CVPR.2018.00896
Mitchell M, Dodge J, Goyal A, Yamaguchi K, Stratos K, Han X, Mensch A, Berg A, Berg T, Daumé III H (2012) Midge : generating image descriptions from computer vision detections. In: Proceedings of the 13th conference of the european chapter of the association for computational linguistics, pp 747–756
Pan Y, Yao T, Li Y, Mei T (2020) X-linear attention networks for image captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10971–10980
Qin X, Zhang Z, Huang C, Dehghan M, Zaiane O R, Jagersand M (2020) U2-net: going deeper with nested u-structure for salient object detection. Pattern Recognit 106:107404
Ranzato M, Chopra S, Auli M, Zaremba W (2015) Sequence level training with recurrent neural networks. arXiv:1511.06732
Redmon J, Farhadi A (2018) Yolov3: an incremental improvement. arXiv:1804.02767
Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: towards real-time object detection with region proposal networks. Adv Neural Inf Process Syst 28:91–99
Rennie SJ, Marcheret E, Mroueh Y, Ross J, Goel V (2017) Self-critical sequence training for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7008–7024
Shuster K, Humeau S, Bordes A, Weston J (2018) Image chat: engaging grounded conversations. arXiv:1811.00945
Siris A, Jiao J, Tam GK, Xie X, Lau RW (2021) Scene context-aware salient object detection. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 4156–4166
Sun H, Bian Y, Liu N, Zhou H (2021) Multi-scale edge-based u-shape network for salient object detection. In: Pacific rim international conference on artificial intelligence, Springer, pp 501–514
Tian P, Mo H, Jiang L (2021) Improved image captioning via semantic feature update. In: 2021 40Th chinese control conference (CCC), pp 7938–7943. https://doi.org/10.23919/CCC52363.2021.9549991
Ushiku Y, Harada T, Kuniyoshi Y (2012) Efficient image annotation for automatic sentence generation. In: Proceedings of the 20th ACM international conference on Multimedia, pp 549–558
Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4566–4575
Vijayakumar AK, Cogswell M, Selvaraju RR, Sun Q, Lee S, Crandall D, Batra D (2016) Diverse beam search: decoding diverse solutions from neural sequence models. arXiv:1610.02424
Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3156–3164
Walther D, Koch C (2006) Modeling attention to salient proto-objects. Neural Netw 19(9):1395–1407
Wang L, Schwing A, Lazebnik S (2017) Diverse and accurate image description using a variational auto-encoder with an additive gaussian encoding space. Adv Neural Inf Process Syst, 30
Wang Q, Wan J, Chan AB (2022) On diversity in image captioning: metrics and methods. IEEE Trans Pattern Anal Mach Intell 44(2):1035–1049. https://doi.org/10.1109/TPAMI.2020.3013834
Yao T, Pan Y, Li Y, Qiu Z, Mei T (2017) Boosting image captioning with attributes. In: Proceedings of the IEEE international conference on computer vision, pp. 4894–4902
You Q, Jin H, Wang Z, Fang C, Luo J (2016) Image captioning with semantic attention. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4651–4659
Young P, Lai A, Hodosh M, Hockenmaier J (2014) From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans. Assoc Comput Linguist 2:67–78
Zheng Y, Li Y, Wang S (2019) Intention oriented image captions with guiding objects. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8395–8404
Zhong Y, Wang L, Chen J, Yu D, Li Y (2020) Comprehensive image captioning via scene graph decomposition. In: European conference on computer vision, Springer, pp 211–229
Acknowledgements
This research was supported by the NSFC (No. 61771386) ,and by the Key Research and Development Program of Shaanxi (Program no. 2020SF-359) ,and by the Research and development of manufacturing information system platform supporting product lifecycle management(No. 2018GY-030), and by the Natural Science Foundation of Shaanxi Province (No. 2021JQ-487), and by the, and by the Scientific Research Program Funded of Shaanxi Education Department (NO. 20JK0788).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interests
The authors declare that they have no competing financial interests or personal relationships that could have influenced the work reported in this paper.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Du, S., Zhu, H., Lin, G. et al. Object semantic analysis for image captioning. Multimed Tools Appl 82, 43179–43206 (2023). https://doi.org/10.1007/s11042-023-14596-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-023-14596-7