research-article

MGTANet: Multi-Scale Guided Token Attention Network for Image Captioning

Authors:

Lixia XuaAuthors Info & Claims

CSAIDE '24: Proceedings of the 2024 3rd International Conference on Cyber Security, Artificial Intelligence and Digital Economy

Pages 237 - 245

https://doi.org/10.1145/3672919.3672964

Published: 24 July 2024 Publication History

Abstract

Recent studies have shown that grid features can play a similar role to region features in visual-language tasks. At the same time, transformer and their variants have demonstrated outstanding performance in image captioning. However, due to the fine-grained nature of grid features, self-attention tends to overly concentrate on a very few adjacent visual features, failing to establish effective connections with global visual features. This results in the loss of object context information, thereby impairing model performance. To address this issue, this paper proposes a Multi-Scale Guided Token Attention Network (MGTANet). Specifically, we introduce a guided token self-attention (GTSA) mechanism in the encoder. By employing a mixed-scale extraction method, we generate "guided tokens" and feed them into the self-attention, thus reducing the number of grid features in subsequent processing. These guided tokens have the ability to capture features at different scales, effectively reducing the grid quantity required in the self-attention process, thereby reducing the computational complexity of the model while still preserving crucial global information. Additionally, we have developed an Adaptive Gated Attention (AGA) mechanism that operates on the decoder. It can flexibly combine global visual information with local detail features and integrate them into the decoder to guide the generation of image captions. To substantiate our model's effectiveness, we conducted thorough experiments and visualizations on the MS-COCO dataset. The results illustrate that under similar experimental conditions, MGTANet yields competitive quality compared to existing models.

References

[1]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998–6008.[7].

[2]

S. Ren, K. He, R. Girshick, J. Sun, Faster r-cnn: towards real-time object detection with region proposal networks, IEEE transactions on pattern analysis and machine intelligence 39 (6), 2016, 1137–1149.[8].

[3]

X. Zhang, X. Sun, Y. Luo, J. Ji, Y. Zhou, Y. Wu, F. Huang, R. Ji, Rstnet: Captioning with adaptive attention on visual and non-visual words,in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 15465–15474.11.

[4]

Y. Luo, J. Ji, X. Sun, L. Cao, Y. Wu, F. Huang, C.-W. Lin, R. Ji, Dual-level collaborative transformer for image captioning, in: Proceedings of the AAAI Conference on Artificial Intelligence, no. 3, 2021, pp. 2286–2293.12.

[5]

Cornia, M., Stefanini, M., Baraldi, L., Cucchiara, R.: Meshed-memory transformer for image captioning. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 10575–10584, 2019, 22.

[6]

Lin, T.-Y., Maire, M., Belongie, S.J., Hays, J., Perona, P., Ramanan, D., Doll´ar, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: European Conference on Computer Vision, 2014, 23.

[7]

Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 664–676, 2014, 24.

Digital Library

[8]

Papineni, K., Roukos, S., Ward, T., Zhu, W.-J.: Bleu: a method for automatic evaluation of machine translation. In: Annual Meeting of the Association for Computational Linguistics, 2002, 25.

[9]

Banerjee, S., Lavie, A.: Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In: IEEvaluation@ACL, 2005, 26.

[10]

Lin, C.-Y.: Rouge: A package for automatic evaluation of summaries. In: Annual Meeting of the Association for Computational Linguistics, 2004, 27.

[11]

Anderson, P., Fernando, B., Johnson, M., Gould, S.: Spice: Semantic propositional image caption evaluation. In: European Conference on Computer Vision, 2016, 29.

[12]

Herdade, S., Kappeler, A., Boakye, K., Soares, J.: Image captioning: Transforming objects into words. ArXiv abs/1906.05963, 2019, 31.

[13]

Fan, Z., Wei, Z., Wang, S., Wang, R., Li, Z., Shan, H., Huang, X.: Tcic: Theme concepts learning cross language and vision for image captioning. In: International Joint Conference on Artificial Intelligence, 2021, 32.

[14]

Zeng, P., Zhang, H., Song, J., Gao, L.: S2 transformer for image captioning. In: International Joint Conference on Artificial Intelligence, 2022, 33.

[15]

Wang, C., Shen, Y., Ji, L.: Geometry attention transformer with position-aware lstms for image captioning. Expert Syst. Appl. 201, 117174, 2021, 34.

Digital Library

[16]

Zhang, Z., Wu, Q., Wang, Y., Chen, F.: Exploring pairwise relationships adaptively from linguistic context in image captioning. IEEE Transactions on Multimedia 24, 3101–3113, 2021, 35.

Digital Library

[17]

Xue, Lixia "PSNet: position-shift alignment network for image caption." International Journal of Multimedia Information Retrieval 12, 2023: 1-12.37.

[18]

M. Hodosh, P. Young, J. Hockenmaier, Framing image description as a ranking task: Data, models and evaluation metrics, Journal of Artificial ntelligence Research 47, 2013, 853–899.

[19]

Young P, Lai A, Hodosh M, Hockenmaier J, 2014, From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans Assoc Comput Linguist 2:67–7839.

[20]

Cheng Y, Huang F, Zhou L, Jin C, Zhang Y, Zhang T, 2017, A hierarchical multimodal attention-based neural network for image captioning. In: Proceedings of the 40th international ACM SIGIR conference on research and development in information retrieval40.

Digital Library

[21]

Li K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, Y. Bengio, Show, attend and tell: Neural image caption generation with visual attention, in: International conference on machine learning, PMLR, 2015, pp. 2048–2057.

[22]

O. Vinyals, A. Toshev, S. Bengio, D. Erhan, Show and tell: A neural image caption generator, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2015, pp. 3156–3164.

[23]

Ding S, Qu S, Xi Y, Wan S, 2020, Stimulus-driven and concept driven analysis for image caption generation. Neurocomputing 398:520–53043.

[24]

J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, T. Darrell, Long-term recurrent convolutional networks for visual recognition and description, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 26254.

[25]

Ma Y, Ji J, Sun X, Zhou Y, Ji R, 2023, Towards local visual modeling for image captioning. Pattern Recogn. 138:10942045.

[26]

V edantam R, Zitnick CL, Parikh D, 2014, Cider: Consensus-based image description evaluation. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR), pp 4566–4575.

[27]

Y ao T, Pan Y, Li Y, Mei T, 2018, Exploring visual relationship for image captioning. In: European Conference on Computer Vision.

[28]

Pan Y, Y ao T, Li Y, Mei T, 2020, X-linear attention networks for image captioning. In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 10968–10977.

Index Terms

MGTANet: Multi-Scale Guided Token Attention Network for Image Captioning
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Natural language generation

Recommendations

Image Captioning With Visual-Semantic Double Attention

In this article, we propose a novel Visual-Semantic Double Attention (VSDA) model for image captioning. In our approach, VSDA consists of two parts: a modified visual attention model is used to extract sub-region image features, then a new SEmantic ...
Bi-Directional Co-Attention Network for Image Captioning
Image Captioning, which automatically describes an image with natural language, is regarded as a fundamental challenge in computer vision. In recent years, significant advance has been made in image captioning through improving attention mechanism. ...
Video Captioning using Hierarchical Multi-Attention Model
ICAIP '18: Proceedings of the 2nd International Conference on Advances in Image Processing

Attention mechanism has been widely used on the temporal task of video captioning and has shown promising improvements. However, in the decoding stage, some words belong to visual words have corresponding canonical visual signals, while other words such ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

CSAIDE '24: Proceedings of the 2024 3rd International Conference on Cyber Security, Artificial Intelligence and Digital Economy

March 2024

676 pages

ISBN:9798400718212

DOI:10.1145/3672919

Copyright © 2024 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 July 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article
Research
Refereed limited

Conference

CSAIDE 2024

CSAIDE 2024: 2024 3rd International Conference on Cyber Security, Artificial Intelligence and Digital Economy

March 1 - 3, 2024

Nanjing, China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
16
Total Downloads

Downloads (Last 12 months)16
Downloads (Last 6 weeks)2

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten