Abstract
Image captioning aims to generate a grammatically correct and semantically accurate natural language description of a given image. To better capture the complex information contained in an image and expand the relevant external knowledge outside the image to generate a better image caption, this paper proposes an end-to-end image captioning framework named Enhance Understanding and Reasoning Ability for Image Captioning (EURAIC) based on the Transformer model. EURAIC provides an enhanced visual understanding ability and caption reasoning ability to improve image captioning performance. To achieve this goal, we use the semantic features of the core objects detected from the image to guide the visual features, which incorporate information on the spatial position relationships between the objects. Then, we introduce an external knowledge network to obtain information other than the inherent content of the image. In this way, a high-quality image caption sentence can be generated for the given image. Experiments on the MSCOCO dataset prove that our method is superior to the baseline model and comparable to other state-of-the-art methods.
Similar content being viewed by others
References
Bengio Y, Courville A, Vincent P (2013) Representation learning: a review and new perspectives. IEEE Trans Pattern Anal Mach Intell 35(8):1798–1828
Mikolov T, Karafiát M., Burget L, Černockỳ J, Khudanpur S (2010) Recurrent neural network based language model. In: Proceedings of the annual conference of the international speech communication association, pp 1045–1048
Huang F, Li Z, Wei H, Zhang C, Ma H (2020) Boost image captioning with knowledge reasoning. Mach Learn 109(12):2313–2332
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Advances in Neural Information Processing Systems, pp 5998–6008
Chen S, Li Z, Huang F, Zhang C, Ma H (2020) Improving object detection with relation mining network. In: Proceedings of the 2020 IEEE international conference on data mining, pp 52–61
Bernardi R, Cakici R, Elliott D, Erdem A, Erdem E, Ikizler-Cinbis N, Keller F, Muscat A, Plank B (2016) Automatic description generation from images: a survey of models, datasets, and evaluation measures. J Artif Intell Res 55:409–442
Socher R, Karpathy A, Le QV, Manning CD, Ng AY (2014) Grounded compositional semantics for finding and describing images with sentences. Transactions of the Association for Computational Linguistics 2:207–218
Karpathy A, Joulin A, Li FF (2014) Deep fragment embeddings for bidirectional image sentence mapping. In: Advances in neural information processing systems, pp 1889–1897
Ma L, Lu Z, Shang L, Li H (2015) Multimodal convolutional neural networks for matching image and sentence. In: Proceedings of the IEEE international conference on computer vision, pp 2623–2631
Kulkarni G, Premraj V, Ordonez V, Dhar S, Li S, Choi Y, Berg AC, Berg TL (2013) Babytalk: Understanding and generating simple image descriptions. IEEE Trans Pattern Anal Mach Intell 35 (12):2891–2903
Lebret R, Pinheiro P, Collobert R (2015) Phrase-based image captioning. In: Proceedings of the international conference on machine learning, PMLR, pp 2085–2094
Lu J, Yang J, Batra D, Parikh D (2018) Neural baby talk. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7219–7228
Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3156–3164
Schmidhuber J (2015) Deep learning in neural networks: an overview. Neural Netw 61:85–117
Xingjian S, Chen Z, Wang H, Yeung D-Y, Wong W-K, Woo W-c (2015) Convolutional lstm network: A machine learning approach for precipitation nowcasting. In: Advances in neural information processing systems, pp 802–810
Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: Neural image caption generation with visual attention. In: Proceedings of the international conference on machine learning, pp 2048–2057
Wei H, Li Z, Huang F, Zhang C, Ma H, Shi Z (2021) Integrating scene semantic knowledge into image captioning. ACM Transactions on Multimedia Computing Communications, and Applications 17(2):52:1–52:22
Qin Y, Du J, Zhang Y, Lu H (2019) Look back and predict forward in image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8367–8375
Wei H, Li Z, Zhang C, Ma H (2020) The synergy of double attention: Combine sentence-level and word-level attention for image captioning. Computer Vision and Image Understanding 201:103068:1–103068:10
Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks, in Advances in Neural Information Processing Systems, pp 3104–3112
Lin T, Wang Y, Liu X, Qiu X (2021) A survey of transformers. arXiv:2106.04554
Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers. In: Proceedings of the European conference on computer vision, pp 213–229
Parmar N, Vaswani A, Uszkoreit J, Kaiser L, Shazeer N, Ku A, Tran D (2018) Image transformer. In: Proceedings of the international conference on machine learning, pp 4055–4064
Dong L, Xu S, Xu B (2018) Speech-transformer: A no-recurrence sequence-to-sequence model for speech recognition. In: Proceedings of the 2018 IEEE international conference on acoustics, speech and signal processing, pp 5884–5888
Schwaller P, Laino T, Gaudin T, Bolgar P, Hunter CA, Bekas C, Lee AA (2019) Molecular transformer: a model for uncertainty-calibrated chemical reaction prediction. ACS Central Sci 5(9):1572–1583
Devlin J, Chang M-W, Lee K, Toutanova K (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805
Ji J, Luo Y, Sun X, Chen F, Luo G, Wu Y, Gao Y, Ji R (2021) Improving image captioning by leveraging intra-and inter-layer global representation in transformer network. In: Proceedings of the AAAI conference on artificial intelligence, pp 1655–1663
Huang L, Wang W, Chen J, Wei X-Y (2019) Attention on attention for image captioning. In: Proceedings of the IEEE international conference on computer vision, pp 4634–4643
Cornia M, Stefanini M, Baraldi L, Cucchiara R (2020) Meshed-memory transformer for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 10578–10587
Hu H, Gu J, Zhang Z, Dai J, Wei Y (2018) Relation networks for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3588–3597
Herdade S, Kappeler A, Boakye K, Soares J (2019) Image captioning: Transforming objects into words. In: Advances in neural information processing systems, pp 11137–11147
Guo L, Liu J, Zhu X, Yao P, Lu S, Lu H (2020) Normalized and geometry-aware self-attention network for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 10327–10336
Li X, Yin X, Li C, Zhang P, Hu X, Zhang L, Wang L, Hu H, Dong L, Wei F et al (2020) Oscar: Object-semantics aligned pre-training for vision-language tasks. In: European conference on computer vision, pp 121–137
Zhang P, Li X, Hu X, Yang J, Zhang L, Wang L, Choi Y, Gao J (2021) Vinvl: Revisiting visual representations in vision-language models. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5579–5588
Ren S, He K, Girshick R, Sun J (2015) Faster r-CNN: Towards real-time object detection with region proposal networks. In: Advances in neural information processing systems, pp 91–99
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Speer R, Chin J, Havasi C (2017) Conceptnet 5.5: An open multilingual graph of general knowledge. In: Proceedings of the AAAI conference on artificial intelligence, pp 4444–4451
Johnson J, Karpathy A, Fei-Fei L (2016) Densecap: Fully convolutional localization networks for dense captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4565–4574
Papineni K, Roukos S, Ward T, Zhu W-J (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the annual meeting of the association for computational linguistics, pp 311–318
Lin C-Y, Hovy E (2003) Automatic evaluation of summaries using n-gram co-occurrence statistics. In: Proceedings of the 2003 human language technology conference of the north american chapter of the association for computational linguistics, pp 150–157
Banerjee S, Lavie A (2005) Meteor: an automatic metric for mt evaluation with improved correlation with human judgments. In: Proceedings of the ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp 65–72
Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: Consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4566–4575
Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6077–6086
Yao T, Pan Y, Li Y, Mei T (2018) Exploring visual relationship for image captioning. In: Proceedings of the European conference on computer vision, pp 684–699
Ke L, Pei W, Li R, Shen X, Tai Y-W (2019) Reflective decoding network for image captioning. In: Proceedings of the IEEE international conference on computer vision, pp 8888–8897
Li G, Zhu L, Liu P, Yang Y (2019) Entangled transformer for image captioning. In: Proceedings of the IEEE international conference on computer vision, pp 8928–8937
Wang L, Bai Z, Zhang Y, Lu H (2020) Show, recall, and tell: Image captioning with recall mechanism. In: Proceedings of the AAAI conference on artificial intelligence, pp 12176–12183
Zhou Y, Wang M, Liu D, Hu Z, Zhang H (2020) More grounded image captioning by distilling image-text matching model. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4777–4786
Wang J, Xu W, Wang Q, Chan AB (2020) Compare and reweight: Distinctive image captioning using similar images sets. In: Proceedings of the European conference on computer vision, pp 370–386
Wu L, Xu M, Sang L, Yao T, Mei T (2020) Noise augmented double-stream graph convolutional networks for image captioning. In: IEEE Transactions on Circuits and Systems for Video Technology
Wang X, Ma L, Fu Y, Xue X (2021) Neural symbolic representation learning for image captioning. In: Proceedings of the 2021 international conference on multimedia retrieval, pp 312–321
Nie W, Li J, Xu N, Liu A-A, Li X, Zhang Y (2021) Triangle-reward reinforcement learning: a visual-linguistic semantic alignment for image captioning. In: Proceedings of the ACM international conference on multimedia, pp 4510–4518
Liu D, Zha Z-J, Zhang H, Zhang Y, Wu F (2018) Context-aware visual policy network for sequence-level image captioning. In: Proceedings of the ACM international conference on multimedia, pp 1416–1424
Yang X, Tang K, Zhang H, Cai J (2019) Auto-encoding scene graphs for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 10685–10694
Guo L, Liu J, Tang J, Li J, Luo W, Lu H (2019) Aligning linguistic words and visual semantic units for image captioning. In: Proceedings of the ACM international conference on multimedia, pp 765–773
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Machine Learn Res 7:1–30
Acknowledgments
This work is supported by National Natural Science Foundation of China(Nos. 61966004, 61866004), Guangxi Natural Science Foundation(No. 2019GXNSFDA245018), Innovation Project of Guangxi Graduate Education (No.XYCBZ2021002), Guangxi “Bagui Scholar” Teams for Innovation and Research Project, Guangxi Talent Highland Project of Big Data Intelligence and Application, and Guangxi Collaborative Innovation Center of Multi-source Information Integration and Intelligent Processing.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interests
We declare that we have no financial and personal relationships with other people or organizations that can inappropriately influence our work. There is no professional or other personal interest of any nature or kind in any product, service and/or company that could be construed as influencing the position presented in, or the review of, the manuscript entitled.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Wei, J., Li, Z., Zhu, J. et al. Enhance understanding and reasoning ability for image captioning. Appl Intell 53, 2706–2722 (2023). https://doi.org/10.1007/s10489-022-03624-y
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-022-03624-y