Integrating grid features and geometric coordinates for enhanced image captioning

Zhao, Fengzhi; Yu, Zhezhou; Zhao, He; Wang, Tao; Bai, Tian

doi:10.1007/s10489-023-05198-9

Integrating grid features and geometric coordinates for enhanced image captioning

Published: 07 December 2023

Volume 54, pages 231–245, (2024)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Fengzhi Zhao^1,2,
Zhezhou Yu^1,2,
He Zhao^1,2,
Tao Wang^1,2 &
…
Tian Bai ORCID: orcid.org/0000-0001-8060-4725^1,2

412 Accesses
Explore all metrics

Abstract

The objective of image captioning is to provide precise descriptions of depicted objects and their relationships. To perform this task, previous studies have mainly relied on region features or a combination of these features and geometric coordinates. However, a significant limitation of these methods is their failure to incorporate grid features and their geometric coordinates, resulting in captions that inadequately identify object-related information within the global context. To overcome this limitation, we employ Swin Transformer and Deformable DETR to extract new grid and region features, along with their respective coordinates. Subsequently, we integrate the geometric coordinates of grids and regions into their corresponding features and incorporate grid features into the region features. The previously obtained features in the encoder are then used to generate text in the decoder. Through quantitative and qualitative analysis of the experimental results, our novel features and caption model have demonstrated superiority over previous methods. Specifically, our approach achieves superior inference accuracy on the COCO and Nocaps image captioning benchmarks. Compared to the baseline method, our model exhibits a 4.3% improvement, reaching a score of 136.9 on the CIDEr evaluation metric.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

GRPIC: an end-to-end image captioning model using three visual features

Article 04 September 2024

Image Captioning with Global Information Enhanced Image Representation

GRIT: Faster and Better Image Captioning Transformer Using Dual Visual Features

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Data availability and access

The processed dataset and source codes are available in: https://github.com/PhoenixZhi/Incorporating-Grid-and-Region-Features-Enhance-Image-Captioning.git

References

Stefanini M, Cornia M, Baraldi L et al (2023) From show to tell: a survey on deep learning-based image captioning. IEEE Transactions on Pattern Analysis and Machine Intelligence 45(1):539–559. https://doi.org/10.1109/TPAMI.2022.3148210
Jia J, Ding X, Pang S et al (2023) Image captioning based on scene graphs: a survey. Expert Syst Appl pp 120698
Zohourianshahzadi Z, Kalita JK (2022) Neural attention for image captioning: review of outstanding methods. Artif Intell Rev 55(5):3833–3862
Article Google Scholar
Hossain MZ, Sohel F, Shiratuddin MF et al (2019) A comprehensive survey of deep learning for image captioning. ACM Computing Surveys (CsUR) 51(6):1–36
Article Google Scholar
Xu K, Ba J, Kiros R et al (2015) Show, attend and tell: Neural image caption generation with visual attention. In: International conference on machine learning, PMLR, pp 2048–2057
Anderson P, He X, Buehler C et al (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6077–6086
Nguyen VQ, Suganuma M, Okatani T (2022) Grit: faster and better image captioning transformer using dual visual features. In: Computer vision–ECCV 2022: 17th European conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVI, Springer, pp 167–184
Liu Z, Lin Y, Cao Y et al (2021) Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 10,012–10,022
Zhu X, Su W, Lu L et al (2021) Deformable detr: deformable transformers for end-to-end object detection. In: International conference on learning representations
Herdade S, Kappeler A, Boakye K et al (2019) Image captioning: transforming objects into words. Adv Neural Inform Process Syst 32
Cornia M, Stefanini M, Baraldi L et al (2020) Meshed-memory transformer for image captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10,578–10,587
Xian T, Li Z, Zhang C et al (2022) Dual global enhanced transformer for image captioning. Neural Netw 148:129–141
Article Google Scholar
Yu J, Li J, Yu Z et al (2020) Multimodal transformer with multi-view visual representation for image captioning. IEEE Transactions on Circuits and Systems for Video Technology 30(12):4467–4480. https://doi.org/10.1109/TCSVT.2019.2947482
Zhu Y, Xia Q, Jin W (2022) Srdd: a lightweight end-to-end object detection with transformer. Connect Sci 34(1):2448–2465. https://doi.org/10.1080/09540091.2022.2125499
Su J, Tang J, Lu Z et al (2019) A neural image captioning model with caption-to-images semantic constructor. Neurocompu-ting 367:144–151. https://doi.org/10.1016/j.neucom.2019.08.012. https://www.sciencedirect.com/science/article/pii/S09252312193 11312
Zhu C, Ye X, Lu Q (2022) Semantic space captioner: generating image captions step by step. J Electron Imaging 31(6):063,021–063,021
Wu L, Xu M, Wang J et al (2020) Recall what you see continually using gridlstm in image captioning. IEEE Transactions on multimedia 22(3):808–818. https://doi.org/10.1109/TMM.2019.2931815
Article Google Scholar
Haque AU, Ghani S, Saeed M (2021) Image captioning with positional and geometrical semantics. IEEE Access 9:160,917–160,925
Kuznetsova P, Ordonez V, Berg TL et al (2014) Treetalk: composition and compression of trees for image descriptions. Trans Assoc Comput Linguistics 2:351–362
Chen F, Ji R, Sun X et al (2018) Groupcap: group-based image captioning with structured relevance and diversity constraints. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1345–1353
Jiang W, Ma L, Jiang YG et al (2018) Recurrent fusion network for image captioning. In: Proceedings of the European conference on computer vision (ECCV), pp 499–515
Lu J, Xiong C, Parikh D et al (2017) Knowing when to look: adaptive attention via a visual sentinel for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 375–383
Gupta N, Jalal AS (2020) Integration of textual cues for fine-grained image captioning using deep cnn and lstm. Neural Comput Appl 32:17,899–17,908
Lu Y, Guo C, Dai X et al (2022) Data-efficient image captioning of fine art paintings via virtual-real semantic alignment training. Neurocomputing 490:163–180. https://doi.org/10.1016/j.neucom.2022.01.068. https://www.sciencedirect.com/science/article/pii/S092523122200087X
Huang Y, Chen J, Ma H et al (2022) Attribute assisted teacher-critical training strategies for image captioning. Neurocomputing 506:265–276
Article Google Scholar
Kastner MA, Umemura K, Ide I et al (2021) Imageability- and length-controllable image captioning. IEEE Access 9:162,951–162,961. https://doi.org/10.1109/ACCESS.2021.3131393
Lim JH, Chan CS, Ng KW et al (2022) Protect, show, attend and tell: Empowering image captioning models with ownership protection. Pattern Recog 122(108):285
Google Scholar
Yu N, Hu X, Song B et al (2019) Topic-oriented image captioning based on order-embedding. IEEE Transactions on image processing 28(6):2743–2754. https://doi.org/10.1109/TIP.2018.2889922
Article MathSciNet Google Scholar
Xu C, Yang M, Ao X et al (2021) Retrieval-enhanced adversarial training with dynamic memory-augmented attention for image paragraph captioning. Knowl-Based Syst 214(106):730. https://doi.org/10.1016/j.knosys.2020.106730. https://www.sciencedire ct.com/science/article/pii/S0950705120308595
Vaswani A, Shazeer N, Parmar N et al (2017) Attention is all you need. Adv Neural Inform Process Syst 30
Dosovitskiy A, Beyer L, Kolesnikov A et al (2021) An image is worth 16x16 words: transformers for image recognition at scale. In: International conference on learning representations
Carion N, Massa F, Synnaeve G et al (2020) End-to-end object detection with transformers. In: Computer Vision–ECCV 2020: 16th European conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16, Springer, pp 213–229
Yang Q, Ni Z, Ren P (2022) Meta captioning: a meta learning based remote sensing image captioning framework. ISPRS J Photogramm Remote Sens 186:190–200. https://www.sciencedirect.com/science/article/pii/S0924271622000351
Zhou D, Yang J, Bao R (2022) Collaborative strategy network for spatial attention image captioning. Appl Intell Int J Artif Intell, Neural Networks, and Complex Problem-Solving Technologies 52(8):52
Google Scholar
Kim DJ, Oh TH, Choi J et al (2022) Dense relational image captioning via multi-task triple-stream networks. IEEE Transactions on pattern analysis and machine intelligence 44(11):7348–7362. https://doi.org/10.1109/TPAMI.2021.3119754
Article Google Scholar
Tan YH, Chan CS (2019) Phrase-based image caption generator with hierarchical lstm network. Neurocomputing 333:86–100. https://doi.org/10.1016/j.neucom.2018.12.026. https://www.sciencedirect.com/science/article/pii/S0925231218314802
Wei J, Li Z, Zhu J et al (2023) Enhance understanding and reasoning ability for image captioning. Appl Intell Int J Artif Intell, Neural Netw, Complex Problem-Solving Technol
Book Google Scholar
Lian Z, Zhang YA, Li HC et al (2023) Cross modification attention-based deliberation model for image captioning. Appl Int 53:5910–5033
Google Scholar
Zhao S, Li L, Peng H (2023) Incorporating retrieval-based method for feature enhanced image captioning. Appl Intell 53:9731–9743
Article Google Scholar
Zhao S, Li L, Peng H (2023b) Multi-feature fusion enhanced transformer with multi-layer fused decoding for image captioning. Appl Intell 53:13,398–13,414
Li Y, Wu C, Li L et al (2021) Caption generation from road images for traffic scene modeling. IEEE Transactions on intelligent transportation systems 23(7):7805–7816
Article Google Scholar
Luo Y, Ji J, Sun X et al (2021) Dual-level collaborative transformer for image captioning. In: Proceedings of the AAAI conference on artificial intelligence, pp 2286–2293
Huang L, Wang W, Chen J et al (2019) Attention on attention for image captioning. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 4634–4643
Pan Y, Yao T, Li Y et al (2020) X-linear attention networks for image captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10,971–10,980
Kuo CW, Kira Z (2022) Beyond a pre-trained object detector: Cross-modal textual and visual context for image captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 17,969–17,979
He X, Yang Y, Shi B et al (2019) Vd-san: visual-densely semantic attention network for image caption generation. Neurocomputing 328:48–55. https://doi.org/10.1016/j.neucom.2018.02.106. https://www.sciencedirect.com/science/article/pii/S0925231218309585
Agrawal H, Desai K, Wang Y et al (2019) Nocaps: Novel object captioning at scale. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 8948–8957
Kuznetsova A, Rom H, Alldrin N et al (2020) The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. Int J Comput Vis 128(7):1956–1981
Article Google Scholar
Lin TY, Maire M, Belongie S et al (2014) Microsoft coco: common objects in context. In: Computer vision–ECCV 2014: 13th European conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, Springer, pp 740–755
Zhang P, Li X, Hu X et al (2021) Vinvl: revisiting visual representations in vision-language models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5579–5588
Shao S, Li Z, Zhang T et al (2019) Objects365: a large-scale, high-quality dataset for object detection. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 8430–8439
Rennie SJ, Marcheret E, Mroueh Y, et al (2017) Self-critical sequence training for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7008–7024
Fan Z, Wei Z, Wang S et al (2021) Tcic: theme concepts learning cross language and vision for image captioning. In: International joint conference on artificial intelligence
Lu J, Yang J, Batra D et al (2018) Neural baby talk. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7219–7228
Ren S, He K, Girshick R et al (2015) Faster r-cnn: towards real-time object detection with region proposal networks. Adv Neural Inform Process Syst 28

Download references

Acknowledgements

This work is supported by the National Natural Science Foundation of China [U21A20390] and the Jilin Provincial Natural Science Foundation under Grants No.YDZJ202101ZYTS128.

Author information

Authors and Affiliations

College of Computer Science and Technology, Jilin University, Changchun, 130012, China
Fengzhi Zhao, Zhezhou Yu, He Zhao, Tao Wang & Tian Bai
Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, 130012, China
Fengzhi Zhao, Zhezhou Yu, He Zhao, Tao Wang & Tian Bai

Authors

Fengzhi Zhao
View author publications
You can also search for this author inPubMed Google Scholar
Zhezhou Yu
View author publications
You can also search for this author inPubMed Google Scholar
He Zhao
View author publications
You can also search for this author inPubMed Google Scholar
Tao Wang
View author publications
You can also search for this author inPubMed Google Scholar
Tian Bai
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

Fengzhi Zhao obtained and processed datasets. Fengzhi Zhao and He Zhao designed the new method, Zhezhou Yu provided suggestions and analyzed the results. Fengzhi Zhao and Tao Wang wrote the manuscript. Tian Bai reviewed and edited this manuscript. All authors contributed to this work and approved the submitted version.

Corresponding author

Correspondence to Tian Bai.

Ethics declarations

Ethical standard

We promise that this manuscript will not be submitted to multiple journals or conferences simultaneously. We promise to abide by ethical standards, respect the autonomy of participants in the use of data, and ensure the legal, transparent, and secure use of data.

Competing interests

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zhao, F., Yu, Z., Zhao, H. et al. Integrating grid features and geometric coordinates for enhanced image captioning. Appl Intell 54, 231–245 (2024). https://doi.org/10.1007/s10489-023-05198-9

Download citation

Accepted: 24 November 2023
Published: 07 December 2023
Issue Date: January 2024
DOI: https://doi.org/10.1007/s10489-023-05198-9

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Integrating grid features and geometric coordinates for enhanced image captioning

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

GRPIC: an end-to-end image captioning model using three visual features

Image Captioning with Global Information Enhanced Image Representation

GRIT: Faster and Better Image Captioning Transformer Using Dual Visual Features

Explore related subjects

Data availability and access

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethical standard

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now