skip to main content
10.1145/3672919.3672964acmotherconferencesArticle/Chapter ViewAbstractPublication PagescsaideConference Proceedingsconference-collections
research-article

MGTANet: Multi-Scale Guided Token Attention Network for Image Captioning

Published: 24 July 2024 Publication History

Abstract

Recent studies have shown that grid features can play a similar role to region features in visual-language tasks. At the same time, transformer and their variants have demonstrated outstanding performance in image captioning. However, due to the fine-grained nature of grid features, self-attention tends to overly concentrate on a very few adjacent visual features, failing to establish effective connections with global visual features. This results in the loss of object context information, thereby impairing model performance. To address this issue, this paper proposes a Multi-Scale Guided Token Attention Network (MGTANet). Specifically, we introduce a guided token self-attention (GTSA) mechanism in the encoder. By employing a mixed-scale extraction method, we generate "guided tokens" and feed them into the self-attention, thus reducing the number of grid features in subsequent processing. These guided tokens have the ability to capture features at different scales, effectively reducing the grid quantity required in the self-attention process, thereby reducing the computational complexity of the model while still preserving crucial global information. Additionally, we have developed an Adaptive Gated Attention (AGA) mechanism that operates on the decoder. It can flexibly combine global visual information with local detail features and integrate them into the decoder to guide the generation of image captions. To substantiate our model's effectiveness, we conducted thorough experiments and visualizations on the MS-COCO dataset. The results illustrate that under similar experimental conditions, MGTANet yields competitive quality compared to existing models.

References

[1]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998–6008.[7].
[2]
S. Ren, K. He, R. Girshick, J. Sun, Faster r-cnn: towards real-time object detection with region proposal networks, IEEE transactions on pattern analysis and machine intelligence 39 (6), 2016, 1137–1149.[8].
[3]
X. Zhang, X. Sun, Y. Luo, J. Ji, Y. Zhou, Y. Wu, F. Huang, R. Ji, Rstnet: Captioning with adaptive attention on visual and non-visual words,in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 15465–15474.11.
[4]
Y. Luo, J. Ji, X. Sun, L. Cao, Y. Wu, F. Huang, C.-W. Lin, R. Ji, Dual-level collaborative transformer for image captioning, in: Proceedings of the AAAI Conference on Artificial Intelligence, no. 3, 2021, pp. 2286–2293.12.
[5]
Cornia, M., Stefanini, M., Baraldi, L., Cucchiara, R.: Meshed-memory transformer for image captioning. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 10575–10584, 2019, 22.
[6]
Lin, T.-Y., Maire, M., Belongie, S.J., Hays, J., Perona, P., Ramanan, D., Doll´ar, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: European Conference on Computer Vision, 2014, 23.
[7]
Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 664–676, 2014, 24.
[8]
Papineni, K., Roukos, S., Ward, T., Zhu, W.-J.: Bleu: a method for automatic evaluation of machine translation. In: Annual Meeting of the Association for Computational Linguistics, 2002, 25.
[9]
Banerjee, S., Lavie, A.: Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In: IEEvaluation@ACL, 2005, 26.
[10]
Lin, C.-Y.: Rouge: A package for automatic evaluation of summaries. In: Annual Meeting of the Association for Computational Linguistics, 2004, 27.
[11]
Anderson, P., Fernando, B., Johnson, M., Gould, S.: Spice: Semantic propositional image caption evaluation. In: European Conference on Computer Vision, 2016, 29.
[12]
Herdade, S., Kappeler, A., Boakye, K., Soares, J.: Image captioning: Transforming objects into words. ArXiv abs/1906.05963, 2019, 31.
[13]
Fan, Z., Wei, Z., Wang, S., Wang, R., Li, Z., Shan, H., Huang, X.: Tcic: Theme concepts learning cross language and vision for image captioning. In: International Joint Conference on Artificial Intelligence, 2021, 32.
[14]
Zeng, P., Zhang, H., Song, J., Gao, L.: S2 transformer for image captioning. In: International Joint Conference on Artificial Intelligence, 2022, 33.
[15]
Wang, C., Shen, Y., Ji, L.: Geometry attention transformer with position-aware lstms for image captioning. Expert Syst. Appl. 201, 117174, 2021, 34.
[16]
Zhang, Z., Wu, Q., Wang, Y., Chen, F.: Exploring pairwise relationships adaptively from linguistic context in image captioning. IEEE Transactions on Multimedia 24, 3101–3113, 2021, 35.
[17]
Xue, Lixia "PSNet: position-shift alignment network for image caption." International Journal of Multimedia Information Retrieval 12, 2023: 1-12.37.
[18]
M. Hodosh, P. Young, J. Hockenmaier, Framing image description as a ranking task: Data, models and evaluation metrics, Journal of Artificial ntelligence Research 47, 2013, 853–899.
[19]
Young P, Lai A, Hodosh M, Hockenmaier J, 2014, From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans Assoc Comput Linguist 2:67–7839.
[20]
Cheng Y, Huang F, Zhou L, Jin C, Zhang Y, Zhang T, 2017, A hierarchical multimodal attention-based neural network for image captioning. In: Proceedings of the 40th international ACM SIGIR conference on research and development in information retrieval40.
[21]
Li K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, Y. Bengio, Show, attend and tell: Neural image caption generation with visual attention, in: International conference on machine learning, PMLR, 2015, pp. 2048–2057.
[22]
O. Vinyals, A. Toshev, S. Bengio, D. Erhan, Show and tell: A neural image caption generator, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2015, pp. 3156–3164.
[23]
Ding S, Qu S, Xi Y, Wan S, 2020, Stimulus-driven and concept driven analysis for image caption generation. Neurocomputing 398:520–53043.
[24]
J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, T. Darrell, Long-term recurrent convolutional networks for visual recognition and description, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 26254.
[25]
Ma Y, Ji J, Sun X, Zhou Y, Ji R, 2023, Towards local visual modeling for image captioning. Pattern Recogn. 138:10942045.
[26]
V edantam R, Zitnick CL, Parikh D, 2014, Cider: Consensus-based image description evaluation. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR), pp 4566–4575.
[27]
Y ao T, Pan Y, Li Y, Mei T, 2018, Exploring visual relationship for image captioning. In: European Conference on Computer Vision.
[28]
Pan Y, Y ao T, Li Y, Mei T, 2020, X-linear attention networks for image captioning. In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 10968–10977.

Index Terms

  1. MGTANet: Multi-Scale Guided Token Attention Network for Image Captioning

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    CSAIDE '24: Proceedings of the 2024 3rd International Conference on Cyber Security, Artificial Intelligence and Digital Economy
    March 2024
    676 pages
    ISBN:9798400718212
    DOI:10.1145/3672919
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 24 July 2024

    Permissions

    Request permissions for this article.

    Check for updates

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Conference

    CSAIDE 2024

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 16
      Total Downloads
    • Downloads (Last 12 months)16
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 05 Mar 2025

    Other Metrics

    Citations

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media