Skip to main content
Log in

Integrating grid features and geometric coordinates for enhanced image captioning

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

The objective of image captioning is to provide precise descriptions of depicted objects and their relationships. To perform this task, previous studies have mainly relied on region features or a combination of these features and geometric coordinates. However, a significant limitation of these methods is their failure to incorporate grid features and their geometric coordinates, resulting in captions that inadequately identify object-related information within the global context. To overcome this limitation, we employ Swin Transformer and Deformable DETR to extract new grid and region features, along with their respective coordinates. Subsequently, we integrate the geometric coordinates of grids and regions into their corresponding features and incorporate grid features into the region features. The previously obtained features in the encoder are then used to generate text in the decoder. Through quantitative and qualitative analysis of the experimental results, our novel features and caption model have demonstrated superiority over previous methods. Specifically, our approach achieves superior inference accuracy on the COCO and Nocaps image captioning benchmarks. Compared to the baseline method, our model exhibits a 4.3% improvement, reaching a score of 136.9 on the CIDEr evaluation metric.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data availability and access

The processed dataset and source codes are available in: https://github.com/PhoenixZhi/Incorporating-Grid-and-Region-Features-Enhance-Image-Captioning.git

References

  1. Stefanini M, Cornia M, Baraldi L et al (2023) From show to tell: a survey on deep learning-based image captioning. IEEE Transactions on Pattern Analysis and Machine Intelligence 45(1):539–559. https://doi.org/10.1109/TPAMI.2022.3148210

  2. Jia J, Ding X, Pang S et al (2023) Image captioning based on scene graphs: a survey. Expert Syst Appl pp 120698

  3. Zohourianshahzadi Z, Kalita JK (2022) Neural attention for image captioning: review of outstanding methods. Artif Intell Rev 55(5):3833–3862

    Article  Google Scholar 

  4. Hossain MZ, Sohel F, Shiratuddin MF et al (2019) A comprehensive survey of deep learning for image captioning. ACM Computing Surveys (CsUR) 51(6):1–36

    Article  Google Scholar 

  5. Xu K, Ba J, Kiros R et al (2015) Show, attend and tell: Neural image caption generation with visual attention. In: International conference on machine learning, PMLR, pp 2048–2057

  6. Anderson P, He X, Buehler C et al (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6077–6086

  7. Nguyen VQ, Suganuma M, Okatani T (2022) Grit: faster and better image captioning transformer using dual visual features. In: Computer vision–ECCV 2022: 17th European conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVI, Springer, pp 167–184

  8. Liu Z, Lin Y, Cao Y et al (2021) Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 10,012–10,022

  9. Zhu X, Su W, Lu L et al (2021) Deformable detr: deformable transformers for end-to-end object detection. In: International conference on learning representations

  10. Herdade S, Kappeler A, Boakye K et al (2019) Image captioning: transforming objects into words. Adv Neural Inform Process Syst 32

  11. Cornia M, Stefanini M, Baraldi L et al (2020) Meshed-memory transformer for image captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10,578–10,587

  12. Xian T, Li Z, Zhang C et al (2022) Dual global enhanced transformer for image captioning. Neural Netw 148:129–141

    Article  Google Scholar 

  13. Yu J, Li J, Yu Z et al (2020) Multimodal transformer with multi-view visual representation for image captioning. IEEE Transactions on Circuits and Systems for Video Technology 30(12):4467–4480. https://doi.org/10.1109/TCSVT.2019.2947482

  14. Zhu Y, Xia Q, Jin W (2022) Srdd: a lightweight end-to-end object detection with transformer. Connect Sci 34(1):2448–2465. https://doi.org/10.1080/09540091.2022.2125499

  15. Su J, Tang J, Lu Z et al (2019) A neural image captioning model with caption-to-images semantic constructor. Neurocompu-ting 367:144–151. https://doi.org/10.1016/j.neucom.2019.08.012. https://www.sciencedirect.com/science/article/pii/S0925231219311312

  16. Zhu C, Ye X, Lu Q (2022) Semantic space captioner: generating image captions step by step. J Electron Imaging 31(6):063,021–063,021

  17. Wu L, Xu M, Wang J et al (2020) Recall what you see continually using gridlstm in image captioning. IEEE Transactions on multimedia 22(3):808–818. https://doi.org/10.1109/TMM.2019.2931815

    Article  Google Scholar 

  18. Haque AU, Ghani S, Saeed M (2021) Image captioning with positional and geometrical semantics. IEEE Access 9:160,917–160,925

  19. Kuznetsova P, Ordonez V, Berg TL et al (2014) Treetalk: composition and compression of trees for image descriptions. Trans Assoc Comput Linguistics 2:351–362

  20. Chen F, Ji R, Sun X et al (2018) Groupcap: group-based image captioning with structured relevance and diversity constraints. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1345–1353

  21. Jiang W, Ma L, Jiang YG et al (2018) Recurrent fusion network for image captioning. In: Proceedings of the European conference on computer vision (ECCV), pp 499–515

  22. Lu J, Xiong C, Parikh D et al (2017) Knowing when to look: adaptive attention via a visual sentinel for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 375–383

  23. Gupta N, Jalal AS (2020) Integration of textual cues for fine-grained image captioning using deep cnn and lstm. Neural Comput Appl 32:17,899–17,908

  24. Lu Y, Guo C, Dai X et al (2022) Data-efficient image captioning of fine art paintings via virtual-real semantic alignment training. Neurocomputing 490:163–180. https://doi.org/10.1016/j.neucom.2022.01.068. https://www.sciencedirect.com/science/article/pii/S092523122200087X

  25. Huang Y, Chen J, Ma H et al (2022) Attribute assisted teacher-critical training strategies for image captioning. Neurocomputing 506:265–276

    Article  Google Scholar 

  26. Kastner MA, Umemura K, Ide I et al (2021) Imageability- and length-controllable image captioning. IEEE Access 9:162,951–162,961. https://doi.org/10.1109/ACCESS.2021.3131393

  27. Lim JH, Chan CS, Ng KW et al (2022) Protect, show, attend and tell: Empowering image captioning models with ownership protection. Pattern Recog 122(108):285

    Google Scholar 

  28. Yu N, Hu X, Song B et al (2019) Topic-oriented image captioning based on order-embedding. IEEE Transactions on image processing 28(6):2743–2754. https://doi.org/10.1109/TIP.2018.2889922

    Article  MathSciNet  Google Scholar 

  29. Xu C, Yang M, Ao X et al (2021) Retrieval-enhanced adversarial training with dynamic memory-augmented attention for image paragraph captioning. Knowl-Based Syst 214(106):730. https://doi.org/10.1016/j.knosys.2020.106730. https://www.sciencedirect.com/science/article/pii/S0950705120308595

  30. Vaswani A, Shazeer N, Parmar N et al (2017) Attention is all you need. Adv Neural Inform Process Syst 30

  31. Dosovitskiy A, Beyer L, Kolesnikov A et al (2021) An image is worth 16x16 words: transformers for image recognition at scale. In: International conference on learning representations

  32. Carion N, Massa F, Synnaeve G et al (2020) End-to-end object detection with transformers. In: Computer Vision–ECCV 2020: 16th European conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16, Springer, pp 213–229

  33. Yang Q, Ni Z, Ren P (2022) Meta captioning: a meta learning based remote sensing image captioning framework. ISPRS J Photogramm Remote Sens 186:190–200. https://www.sciencedirect.com/science/article/pii/S0924271622000351

  34. Zhou D, Yang J, Bao R (2022) Collaborative strategy network for spatial attention image captioning. Appl Intell Int J Artif Intell, Neural Networks, and Complex Problem-Solving Technologies 52(8):52

    Google Scholar 

  35. Kim DJ, Oh TH, Choi J et al (2022) Dense relational image captioning via multi-task triple-stream networks. IEEE Transactions on pattern analysis and machine intelligence 44(11):7348–7362. https://doi.org/10.1109/TPAMI.2021.3119754

    Article  Google Scholar 

  36. Tan YH, Chan CS (2019) Phrase-based image caption generator with hierarchical lstm network. Neurocomputing 333:86–100. https://doi.org/10.1016/j.neucom.2018.12.026. https://www.sciencedirect.com/science/article/pii/S0925231218314802

  37. Wei J, Li Z, Zhu J et al (2023) Enhance understanding and reasoning ability for image captioning. Appl Intell Int J Artif Intell, Neural Netw, Complex Problem-Solving Technol

    Book  Google Scholar 

  38. Lian Z, Zhang YA, Li HC et al (2023) Cross modification attention-based deliberation model for image captioning. Appl Int 53:5910–5033

    Google Scholar 

  39. Zhao S, Li L, Peng H (2023) Incorporating retrieval-based method for feature enhanced image captioning. Appl Intell 53:9731–9743

    Article  Google Scholar 

  40. Zhao S, Li L, Peng H (2023b) Multi-feature fusion enhanced transformer with multi-layer fused decoding for image captioning. Appl Intell 53:13,398–13,414

  41. Li Y, Wu C, Li L et al (2021) Caption generation from road images for traffic scene modeling. IEEE Transactions on intelligent transportation systems 23(7):7805–7816

    Article  Google Scholar 

  42. Luo Y, Ji J, Sun X et al (2021) Dual-level collaborative transformer for image captioning. In: Proceedings of the AAAI conference on artificial intelligence, pp 2286–2293

  43. Huang L, Wang W, Chen J et al (2019) Attention on attention for image captioning. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 4634–4643

  44. Pan Y, Yao T, Li Y et al (2020) X-linear attention networks for image captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10,971–10,980

  45. Kuo CW, Kira Z (2022) Beyond a pre-trained object detector: Cross-modal textual and visual context for image captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 17,969–17,979

  46. He X, Yang Y, Shi B et al (2019) Vd-san: visual-densely semantic attention network for image caption generation. Neurocomputing 328:48–55. https://doi.org/10.1016/j.neucom.2018.02.106. https://www.sciencedirect.com/science/article/pii/S0925231218309585

  47. Agrawal H, Desai K, Wang Y et al (2019) Nocaps: Novel object captioning at scale. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 8948–8957

  48. Kuznetsova A, Rom H, Alldrin N et al (2020) The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. Int J Comput Vis 128(7):1956–1981

    Article  Google Scholar 

  49. Lin TY, Maire M, Belongie S et al (2014) Microsoft coco: common objects in context. In: Computer vision–ECCV 2014: 13th European conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, Springer, pp 740–755

  50. Zhang P, Li X, Hu X et al (2021) Vinvl: revisiting visual representations in vision-language models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5579–5588

  51. Shao S, Li Z, Zhang T et al (2019) Objects365: a large-scale, high-quality dataset for object detection. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 8430–8439

  52. Rennie SJ, Marcheret E, Mroueh Y, et al (2017) Self-critical sequence training for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7008–7024

  53. Fan Z, Wei Z, Wang S et al (2021) Tcic: theme concepts learning cross language and vision for image captioning. In: International joint conference on artificial intelligence

  54. Lu J, Yang J, Batra D et al (2018) Neural baby talk. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7219–7228

  55. Ren S, He K, Girshick R et al (2015) Faster r-cnn: towards real-time object detection with region proposal networks. Adv Neural Inform Process Syst 28

Download references

Acknowledgements

This work is supported by the National Natural Science Foundation of China [U21A20390] and the Jilin Provincial Natural Science Foundation under Grants No.YDZJ202101ZYTS128.

Author information

Authors and Affiliations

Authors

Contributions

Fengzhi Zhao obtained and processed datasets. Fengzhi Zhao and He Zhao designed the new method, Zhezhou Yu provided suggestions and analyzed the results. Fengzhi Zhao and Tao Wang wrote the manuscript. Tian Bai reviewed and edited this manuscript. All authors contributed to this work and approved the submitted version.

Corresponding author

Correspondence to Tian Bai.

Ethics declarations

Ethical standard

We promise that this manuscript will not be submitted to multiple journals or conferences simultaneously. We promise to abide by ethical standards, respect the autonomy of participants in the use of data, and ensure the legal, transparent, and secure use of data.

Competing interests

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhao, F., Yu, Z., Zhao, H. et al. Integrating grid features and geometric coordinates for enhanced image captioning. Appl Intell 54, 231–245 (2024). https://doi.org/10.1007/s10489-023-05198-9

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-023-05198-9

Keywords