Skip to main content
Log in

Enhance understanding and reasoning ability for image captioning

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Image captioning aims to generate a grammatically correct and semantically accurate natural language description of a given image. To better capture the complex information contained in an image and expand the relevant external knowledge outside the image to generate a better image caption, this paper proposes an end-to-end image captioning framework named Enhance Understanding and Reasoning Ability for Image Captioning (EURAIC) based on the Transformer model. EURAIC provides an enhanced visual understanding ability and caption reasoning ability to improve image captioning performance. To achieve this goal, we use the semantic features of the core objects detected from the image to guide the visual features, which incorporate information on the spatial position relationships between the objects. Then, we introduce an external knowledge network to obtain information other than the inherent content of the image. In this way, a high-quality image caption sentence can be generated for the given image. Experiments on the MSCOCO dataset prove that our method is superior to the baseline model and comparable to other state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. https://competitions.codalab.org/competitions/3221

References

  1. Bengio Y, Courville A, Vincent P (2013) Representation learning: a review and new perspectives. IEEE Trans Pattern Anal Mach Intell 35(8):1798–1828

    Article  Google Scholar 

  2. Mikolov T, Karafiát M., Burget L, Černockỳ J, Khudanpur S (2010) Recurrent neural network based language model. In: Proceedings of the annual conference of the international speech communication association, pp 1045–1048

  3. Huang F, Li Z, Wei H, Zhang C, Ma H (2020) Boost image captioning with knowledge reasoning. Mach Learn 109(12):2313–2332

    Article  MathSciNet  MATH  Google Scholar 

  4. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Advances in Neural Information Processing Systems, pp 5998–6008

  5. Chen S, Li Z, Huang F, Zhang C, Ma H (2020) Improving object detection with relation mining network. In: Proceedings of the 2020 IEEE international conference on data mining, pp 52–61

  6. Bernardi R, Cakici R, Elliott D, Erdem A, Erdem E, Ikizler-Cinbis N, Keller F, Muscat A, Plank B (2016) Automatic description generation from images: a survey of models, datasets, and evaluation measures. J Artif Intell Res 55:409–442

    Article  Google Scholar 

  7. Socher R, Karpathy A, Le QV, Manning CD, Ng AY (2014) Grounded compositional semantics for finding and describing images with sentences. Transactions of the Association for Computational Linguistics 2:207–218

    Article  Google Scholar 

  8. Karpathy A, Joulin A, Li FF (2014) Deep fragment embeddings for bidirectional image sentence mapping. In: Advances in neural information processing systems, pp 1889–1897

  9. Ma L, Lu Z, Shang L, Li H (2015) Multimodal convolutional neural networks for matching image and sentence. In: Proceedings of the IEEE international conference on computer vision, pp 2623–2631

  10. Kulkarni G, Premraj V, Ordonez V, Dhar S, Li S, Choi Y, Berg AC, Berg TL (2013) Babytalk: Understanding and generating simple image descriptions. IEEE Trans Pattern Anal Mach Intell 35 (12):2891–2903

    Article  Google Scholar 

  11. Lebret R, Pinheiro P, Collobert R (2015) Phrase-based image captioning. In: Proceedings of the international conference on machine learning, PMLR, pp 2085–2094

  12. Lu J, Yang J, Batra D, Parikh D (2018) Neural baby talk. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7219–7228

  13. Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3156–3164

  14. Schmidhuber J (2015) Deep learning in neural networks: an overview. Neural Netw 61:85–117

    Article  Google Scholar 

  15. Xingjian S, Chen Z, Wang H, Yeung D-Y, Wong W-K, Woo W-c (2015) Convolutional lstm network: A machine learning approach for precipitation nowcasting. In: Advances in neural information processing systems, pp 802–810

  16. Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: Neural image caption generation with visual attention. In: Proceedings of the international conference on machine learning, pp 2048–2057

  17. Wei H, Li Z, Huang F, Zhang C, Ma H, Shi Z (2021) Integrating scene semantic knowledge into image captioning. ACM Transactions on Multimedia Computing Communications, and Applications 17(2):52:1–52:22

    Article  Google Scholar 

  18. Qin Y, Du J, Zhang Y, Lu H (2019) Look back and predict forward in image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8367–8375

  19. Wei H, Li Z, Zhang C, Ma H (2020) The synergy of double attention: Combine sentence-level and word-level attention for image captioning. Computer Vision and Image Understanding 201:103068:1–103068:10

    Article  Google Scholar 

  20. Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks, in Advances in Neural Information Processing Systems, pp 3104–3112

  21. Lin T, Wang Y, Liu X, Qiu X (2021) A survey of transformers. arXiv:2106.04554

  22. Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers. In: Proceedings of the European conference on computer vision, pp 213–229

  23. Parmar N, Vaswani A, Uszkoreit J, Kaiser L, Shazeer N, Ku A, Tran D (2018) Image transformer. In: Proceedings of the international conference on machine learning, pp 4055–4064

  24. Dong L, Xu S, Xu B (2018) Speech-transformer: A no-recurrence sequence-to-sequence model for speech recognition. In: Proceedings of the 2018 IEEE international conference on acoustics, speech and signal processing, pp 5884–5888

  25. Schwaller P, Laino T, Gaudin T, Bolgar P, Hunter CA, Bekas C, Lee AA (2019) Molecular transformer: a model for uncertainty-calibrated chemical reaction prediction. ACS Central Sci 5(9):1572–1583

    Article  Google Scholar 

  26. Devlin J, Chang M-W, Lee K, Toutanova K (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805

  27. Ji J, Luo Y, Sun X, Chen F, Luo G, Wu Y, Gao Y, Ji R (2021) Improving image captioning by leveraging intra-and inter-layer global representation in transformer network. In: Proceedings of the AAAI conference on artificial intelligence, pp 1655–1663

  28. Huang L, Wang W, Chen J, Wei X-Y (2019) Attention on attention for image captioning. In: Proceedings of the IEEE international conference on computer vision, pp 4634–4643

  29. Cornia M, Stefanini M, Baraldi L, Cucchiara R (2020) Meshed-memory transformer for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 10578–10587

  30. Hu H, Gu J, Zhang Z, Dai J, Wei Y (2018) Relation networks for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3588–3597

  31. Herdade S, Kappeler A, Boakye K, Soares J (2019) Image captioning: Transforming objects into words. In: Advances in neural information processing systems, pp 11137–11147

  32. Guo L, Liu J, Zhu X, Yao P, Lu S, Lu H (2020) Normalized and geometry-aware self-attention network for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 10327–10336

  33. Li X, Yin X, Li C, Zhang P, Hu X, Zhang L, Wang L, Hu H, Dong L, Wei F et al (2020) Oscar: Object-semantics aligned pre-training for vision-language tasks. In: European conference on computer vision, pp 121–137

  34. Zhang P, Li X, Hu X, Yang J, Zhang L, Wang L, Choi Y, Gao J (2021) Vinvl: Revisiting visual representations in vision-language models. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5579–5588

  35. Ren S, He K, Girshick R, Sun J (2015) Faster r-CNN: Towards real-time object detection with region proposal networks. In: Advances in neural information processing systems, pp 91–99

  36. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778

  37. Speer R, Chin J, Havasi C (2017) Conceptnet 5.5: An open multilingual graph of general knowledge. In: Proceedings of the AAAI conference on artificial intelligence, pp 4444–4451

  38. Johnson J, Karpathy A, Fei-Fei L (2016) Densecap: Fully convolutional localization networks for dense captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4565–4574

  39. Papineni K, Roukos S, Ward T, Zhu W-J (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the annual meeting of the association for computational linguistics, pp 311–318

  40. Lin C-Y, Hovy E (2003) Automatic evaluation of summaries using n-gram co-occurrence statistics. In: Proceedings of the 2003 human language technology conference of the north american chapter of the association for computational linguistics, pp 150–157

  41. Banerjee S, Lavie A (2005) Meteor: an automatic metric for mt evaluation with improved correlation with human judgments. In: Proceedings of the ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp 65–72

  42. Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: Consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4566–4575

  43. Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6077–6086

  44. Yao T, Pan Y, Li Y, Mei T (2018) Exploring visual relationship for image captioning. In: Proceedings of the European conference on computer vision, pp 684–699

  45. Ke L, Pei W, Li R, Shen X, Tai Y-W (2019) Reflective decoding network for image captioning. In: Proceedings of the IEEE international conference on computer vision, pp 8888–8897

  46. Li G, Zhu L, Liu P, Yang Y (2019) Entangled transformer for image captioning. In: Proceedings of the IEEE international conference on computer vision, pp 8928–8937

  47. Wang L, Bai Z, Zhang Y, Lu H (2020) Show, recall, and tell: Image captioning with recall mechanism. In: Proceedings of the AAAI conference on artificial intelligence, pp 12176–12183

  48. Zhou Y, Wang M, Liu D, Hu Z, Zhang H (2020) More grounded image captioning by distilling image-text matching model. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4777–4786

  49. Wang J, Xu W, Wang Q, Chan AB (2020) Compare and reweight: Distinctive image captioning using similar images sets. In: Proceedings of the European conference on computer vision, pp 370–386

  50. Wu L, Xu M, Sang L, Yao T, Mei T (2020) Noise augmented double-stream graph convolutional networks for image captioning. In: IEEE Transactions on Circuits and Systems for Video Technology

  51. Wang X, Ma L, Fu Y, Xue X (2021) Neural symbolic representation learning for image captioning. In: Proceedings of the 2021 international conference on multimedia retrieval, pp 312–321

  52. Nie W, Li J, Xu N, Liu A-A, Li X, Zhang Y (2021) Triangle-reward reinforcement learning: a visual-linguistic semantic alignment for image captioning. In: Proceedings of the ACM international conference on multimedia, pp 4510–4518

  53. Liu D, Zha Z-J, Zhang H, Zhang Y, Wu F (2018) Context-aware visual policy network for sequence-level image captioning. In: Proceedings of the ACM international conference on multimedia, pp 1416–1424

  54. Yang X, Tang K, Zhang H, Cai J (2019) Auto-encoding scene graphs for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 10685–10694

  55. Guo L, Liu J, Tang J, Li J, Luo W, Lu H (2019) Aligning linguistic words and visual semantic units for image captioning. In: Proceedings of the ACM international conference on multimedia, pp 765–773

  56. Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Machine Learn Res 7:1–30

    MathSciNet  MATH  Google Scholar 

Download references

Acknowledgments

This work is supported by National Natural Science Foundation of China(Nos. 61966004, 61866004), Guangxi Natural Science Foundation(No. 2019GXNSFDA245018), Innovation Project of Guangxi Graduate Education (No.XYCBZ2021002), Guangxi “Bagui Scholar” Teams for Innovation and Research Project, Guangxi Talent Highland Project of Big Data Intelligence and Application, and Guangxi Collaborative Innovation Center of Multi-source Information Integration and Intelligent Processing.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhixin Li.

Ethics declarations

Conflict of Interests

We declare that we have no financial and personal relationships with other people or organizations that can inappropriately influence our work. There is no professional or other personal interest of any nature or kind in any product, service and/or company that could be construed as influencing the position presented in, or the review of, the manuscript entitled.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wei, J., Li, Z., Zhu, J. et al. Enhance understanding and reasoning ability for image captioning. Appl Intell 53, 2706–2722 (2023). https://doi.org/10.1007/s10489-022-03624-y

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-022-03624-y

Keywords

Navigation