Abstract
The objective of image captioning is to provide precise descriptions of depicted objects and their relationships. To perform this task, previous studies have mainly relied on region features or a combination of these features and geometric coordinates. However, a significant limitation of these methods is their failure to incorporate grid features and their geometric coordinates, resulting in captions that inadequately identify object-related information within the global context. To overcome this limitation, we employ Swin Transformer and Deformable DETR to extract new grid and region features, along with their respective coordinates. Subsequently, we integrate the geometric coordinates of grids and regions into their corresponding features and incorporate grid features into the region features. The previously obtained features in the encoder are then used to generate text in the decoder. Through quantitative and qualitative analysis of the experimental results, our novel features and caption model have demonstrated superiority over previous methods. Specifically, our approach achieves superior inference accuracy on the COCO and Nocaps image captioning benchmarks. Compared to the baseline method, our model exhibits a 4.3% improvement, reaching a score of 136.9 on the CIDEr evaluation metric.



Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data availability and access
The processed dataset and source codes are available in: https://github.com/PhoenixZhi/Incorporating-Grid-and-Region-Features-Enhance-Image-Captioning.git
References
Stefanini M, Cornia M, Baraldi L et al (2023) From show to tell: a survey on deep learning-based image captioning. IEEE Transactions on Pattern Analysis and Machine Intelligence 45(1):539–559. https://doi.org/10.1109/TPAMI.2022.3148210
Jia J, Ding X, Pang S et al (2023) Image captioning based on scene graphs: a survey. Expert Syst Appl pp 120698
Zohourianshahzadi Z, Kalita JK (2022) Neural attention for image captioning: review of outstanding methods. Artif Intell Rev 55(5):3833–3862
Hossain MZ, Sohel F, Shiratuddin MF et al (2019) A comprehensive survey of deep learning for image captioning. ACM Computing Surveys (CsUR) 51(6):1–36
Xu K, Ba J, Kiros R et al (2015) Show, attend and tell: Neural image caption generation with visual attention. In: International conference on machine learning, PMLR, pp 2048–2057
Anderson P, He X, Buehler C et al (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6077–6086
Nguyen VQ, Suganuma M, Okatani T (2022) Grit: faster and better image captioning transformer using dual visual features. In: Computer vision–ECCV 2022: 17th European conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVI, Springer, pp 167–184
Liu Z, Lin Y, Cao Y et al (2021) Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 10,012–10,022
Zhu X, Su W, Lu L et al (2021) Deformable detr: deformable transformers for end-to-end object detection. In: International conference on learning representations
Herdade S, Kappeler A, Boakye K et al (2019) Image captioning: transforming objects into words. Adv Neural Inform Process Syst 32
Cornia M, Stefanini M, Baraldi L et al (2020) Meshed-memory transformer for image captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10,578–10,587
Xian T, Li Z, Zhang C et al (2022) Dual global enhanced transformer for image captioning. Neural Netw 148:129–141
Yu J, Li J, Yu Z et al (2020) Multimodal transformer with multi-view visual representation for image captioning. IEEE Transactions on Circuits and Systems for Video Technology 30(12):4467–4480. https://doi.org/10.1109/TCSVT.2019.2947482
Zhu Y, Xia Q, Jin W (2022) Srdd: a lightweight end-to-end object detection with transformer. Connect Sci 34(1):2448–2465. https://doi.org/10.1080/09540091.2022.2125499
Su J, Tang J, Lu Z et al (2019) A neural image captioning model with caption-to-images semantic constructor. Neurocompu-ting 367:144–151. https://doi.org/10.1016/j.neucom.2019.08.012. https://www.sciencedirect.com/science/article/pii/S0925231219311312
Zhu C, Ye X, Lu Q (2022) Semantic space captioner: generating image captions step by step. J Electron Imaging 31(6):063,021–063,021
Wu L, Xu M, Wang J et al (2020) Recall what you see continually using gridlstm in image captioning. IEEE Transactions on multimedia 22(3):808–818. https://doi.org/10.1109/TMM.2019.2931815
Haque AU, Ghani S, Saeed M (2021) Image captioning with positional and geometrical semantics. IEEE Access 9:160,917–160,925
Kuznetsova P, Ordonez V, Berg TL et al (2014) Treetalk: composition and compression of trees for image descriptions. Trans Assoc Comput Linguistics 2:351–362
Chen F, Ji R, Sun X et al (2018) Groupcap: group-based image captioning with structured relevance and diversity constraints. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1345–1353
Jiang W, Ma L, Jiang YG et al (2018) Recurrent fusion network for image captioning. In: Proceedings of the European conference on computer vision (ECCV), pp 499–515
Lu J, Xiong C, Parikh D et al (2017) Knowing when to look: adaptive attention via a visual sentinel for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 375–383
Gupta N, Jalal AS (2020) Integration of textual cues for fine-grained image captioning using deep cnn and lstm. Neural Comput Appl 32:17,899–17,908
Lu Y, Guo C, Dai X et al (2022) Data-efficient image captioning of fine art paintings via virtual-real semantic alignment training. Neurocomputing 490:163–180. https://doi.org/10.1016/j.neucom.2022.01.068. https://www.sciencedirect.com/science/article/pii/S092523122200087X
Huang Y, Chen J, Ma H et al (2022) Attribute assisted teacher-critical training strategies for image captioning. Neurocomputing 506:265–276
Kastner MA, Umemura K, Ide I et al (2021) Imageability- and length-controllable image captioning. IEEE Access 9:162,951–162,961. https://doi.org/10.1109/ACCESS.2021.3131393
Lim JH, Chan CS, Ng KW et al (2022) Protect, show, attend and tell: Empowering image captioning models with ownership protection. Pattern Recog 122(108):285
Yu N, Hu X, Song B et al (2019) Topic-oriented image captioning based on order-embedding. IEEE Transactions on image processing 28(6):2743–2754. https://doi.org/10.1109/TIP.2018.2889922
Xu C, Yang M, Ao X et al (2021) Retrieval-enhanced adversarial training with dynamic memory-augmented attention for image paragraph captioning. Knowl-Based Syst 214(106):730. https://doi.org/10.1016/j.knosys.2020.106730. https://www.sciencedirect.com/science/article/pii/S0950705120308595
Vaswani A, Shazeer N, Parmar N et al (2017) Attention is all you need. Adv Neural Inform Process Syst 30
Dosovitskiy A, Beyer L, Kolesnikov A et al (2021) An image is worth 16x16 words: transformers for image recognition at scale. In: International conference on learning representations
Carion N, Massa F, Synnaeve G et al (2020) End-to-end object detection with transformers. In: Computer Vision–ECCV 2020: 16th European conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16, Springer, pp 213–229
Yang Q, Ni Z, Ren P (2022) Meta captioning: a meta learning based remote sensing image captioning framework. ISPRS J Photogramm Remote Sens 186:190–200. https://www.sciencedirect.com/science/article/pii/S0924271622000351
Zhou D, Yang J, Bao R (2022) Collaborative strategy network for spatial attention image captioning. Appl Intell Int J Artif Intell, Neural Networks, and Complex Problem-Solving Technologies 52(8):52
Kim DJ, Oh TH, Choi J et al (2022) Dense relational image captioning via multi-task triple-stream networks. IEEE Transactions on pattern analysis and machine intelligence 44(11):7348–7362. https://doi.org/10.1109/TPAMI.2021.3119754
Tan YH, Chan CS (2019) Phrase-based image caption generator with hierarchical lstm network. Neurocomputing 333:86–100. https://doi.org/10.1016/j.neucom.2018.12.026. https://www.sciencedirect.com/science/article/pii/S0925231218314802
Wei J, Li Z, Zhu J et al (2023) Enhance understanding and reasoning ability for image captioning. Appl Intell Int J Artif Intell, Neural Netw, Complex Problem-Solving Technol
Lian Z, Zhang YA, Li HC et al (2023) Cross modification attention-based deliberation model for image captioning. Appl Int 53:5910–5033
Zhao S, Li L, Peng H (2023) Incorporating retrieval-based method for feature enhanced image captioning. Appl Intell 53:9731–9743
Zhao S, Li L, Peng H (2023b) Multi-feature fusion enhanced transformer with multi-layer fused decoding for image captioning. Appl Intell 53:13,398–13,414
Li Y, Wu C, Li L et al (2021) Caption generation from road images for traffic scene modeling. IEEE Transactions on intelligent transportation systems 23(7):7805–7816
Luo Y, Ji J, Sun X et al (2021) Dual-level collaborative transformer for image captioning. In: Proceedings of the AAAI conference on artificial intelligence, pp 2286–2293
Huang L, Wang W, Chen J et al (2019) Attention on attention for image captioning. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 4634–4643
Pan Y, Yao T, Li Y et al (2020) X-linear attention networks for image captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10,971–10,980
Kuo CW, Kira Z (2022) Beyond a pre-trained object detector: Cross-modal textual and visual context for image captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 17,969–17,979
He X, Yang Y, Shi B et al (2019) Vd-san: visual-densely semantic attention network for image caption generation. Neurocomputing 328:48–55. https://doi.org/10.1016/j.neucom.2018.02.106. https://www.sciencedirect.com/science/article/pii/S0925231218309585
Agrawal H, Desai K, Wang Y et al (2019) Nocaps: Novel object captioning at scale. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 8948–8957
Kuznetsova A, Rom H, Alldrin N et al (2020) The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. Int J Comput Vis 128(7):1956–1981
Lin TY, Maire M, Belongie S et al (2014) Microsoft coco: common objects in context. In: Computer vision–ECCV 2014: 13th European conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, Springer, pp 740–755
Zhang P, Li X, Hu X et al (2021) Vinvl: revisiting visual representations in vision-language models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5579–5588
Shao S, Li Z, Zhang T et al (2019) Objects365: a large-scale, high-quality dataset for object detection. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 8430–8439
Rennie SJ, Marcheret E, Mroueh Y, et al (2017) Self-critical sequence training for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7008–7024
Fan Z, Wei Z, Wang S et al (2021) Tcic: theme concepts learning cross language and vision for image captioning. In: International joint conference on artificial intelligence
Lu J, Yang J, Batra D et al (2018) Neural baby talk. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7219–7228
Ren S, He K, Girshick R et al (2015) Faster r-cnn: towards real-time object detection with region proposal networks. Adv Neural Inform Process Syst 28
Acknowledgements
This work is supported by the National Natural Science Foundation of China [U21A20390] and the Jilin Provincial Natural Science Foundation under Grants No.YDZJ202101ZYTS128.
Author information
Authors and Affiliations
Contributions
Fengzhi Zhao obtained and processed datasets. Fengzhi Zhao and He Zhao designed the new method, Zhezhou Yu provided suggestions and analyzed the results. Fengzhi Zhao and Tao Wang wrote the manuscript. Tian Bai reviewed and edited this manuscript. All authors contributed to this work and approved the submitted version.
Corresponding author
Ethics declarations
Ethical standard
We promise that this manuscript will not be submitted to multiple journals or conferences simultaneously. We promise to abide by ethical standards, respect the autonomy of participants in the use of data, and ensure the legal, transparent, and secure use of data.
Competing interests
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhao, F., Yu, Z., Zhao, H. et al. Integrating grid features and geometric coordinates for enhanced image captioning. Appl Intell 54, 231–245 (2024). https://doi.org/10.1007/s10489-023-05198-9
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-023-05198-9