skip to main content
10.1145/3573942.3574061acmotherconferencesArticle/Chapter ViewAbstractPublication PagesaiprConference Proceedingsconference-collections
research-article

GA-SATIC: Semi-autoregressive Transformer Image Captioning Based on Geometric Attention

Published: 16 May 2023 Publication History

Abstract

Image captioning is a cross-domain task involving image and natural language processing. Most of the current models follow an encoder-decoder architecture, where the encoder takes image feature vectors as input, and the decoder uses an autoregressive or non-autoregressive approach for decoding. However, most models extract the feature vectors of an image through region proposals of the object detector, without considering the relative spatial relationship between objects. When decoding, the autoregressive decoding method is adopted, that is, the next word is generated based on the generated word, and the generation is performed step by step, which will lead to high delay in the inference process. To solve this problem, scholars have proposed a non-autoregressive approach, which speeds up inference by generating all words in parallel, but reduces the quality of generated captioning. Aiming at the above problems, this paper proposes a semi-autoregressive Transformer model based on geometric attention. The encoder integrates the relative spatial relationship between the detected objects through geometric attention and appearance attention, so as to enhance spatial awareness; the decoder adopts a semi-autoregressive decoding method, which is serialized globally and paralleled locally, enabling the model to achieve a better trade-off between decoding speed and captioning accuracy. Extensive experiments and ablation studies on the MSCOCO dataset have shown that the model achieves better performance compared to state-of-the-art models.

References

[1]
A. Elhagry, K. Kadaoui, “A Thorough Review on Recent Deep Learning Methodologies for Image Captioning,” in arXiv: 2107.13114, July 2021.
[2]
Yi-Nung Chung, Tun-Chang Lu, Ming-Tsung Yeh, Yu-Xian Huang, and Chun-Yi Wu, "Applying the Video Summarization Algorithm to Surveillance Systems," Journal of Image and Graphics, Vol. 3, No. 1, pp. 20-24, June 2015.
[3]
W. Zaremba, I. Sutskever, O. Vinyals, “Recurrent neural network regularization,” in arXiv: 1409.2329, Feb. 2015.
[4]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A N. Gomez, Ł. Kaiser, Polosukhin I, “Attention is all you need,” in NeurIPS, 2017.
[5]
H. Hu, J. Gu, Z. Zhang, J. Dai, and Y. Wei, “Relation networks for object detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 3588-3597.
[6]
Zheng-cong Fei, “Fast image caption generation with position alignment,” in arXiv: 1912.06365, Dec. 2019.
[7]
Longteng Guo, Jing Liu, Xinxin Zhu, Xingjian He, Jie Jiang, and Hanqing Lu, “Non-autoregressive image captioning with counterfactuals-critical multi-agent learning,” in IJCAI, May 2020.
[8]
Junlong Gao, Xi Meng, Shiqi Wang, Xia Li, Shanshe Wang, Siwei Ma, and Wen Gao, “Masked non-autoregressive image captioning,” in arXiv: 1906.00717, June 2019.
[9]
Bang Yang, Yuexian Zou, Fenglin Liu, and Can Zhang, “Non-autoregressive coarse-to-fine video captioning,” in AAAI, 2021.
[10]
Qiu Ran, Yankai Lin, Peng Li, and Jie Zhou, “Learning to Recover from Multi-Modality Errors for Non-Autoregressive Neural Machine Translation,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Jul. 2020, pp. 3059-3069, Online. Association for Computational Linguistics.
[11]
Zhengcong Fei, “Partially non-autoregressive image captioning,” in AAAI, 2021.
[12]
A. Karpathy and L. Fei-Fei, “Deep visual-semantic alignments for generating image descriptions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3128-3137.
[13]
Mohamed Maher Ben Ismail and Ouiem Bchir, "CE Video Summarization Using Relational Motion Histogram Descriptor," Journal of Image and Graphics, Vol. 3, No. 1, pp. 34-39, June 2015.
[14]
P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang, “Bottom-up and top-down attention for image captioning and visual question answering,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.
[15]
X. Yang, K. Tang, H. Zhang, and J. Cai, “Auto-encoding scene graphs for image captioning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 10685–10694.
[16]
Jiatao Gu, James Bradbury, Caiming Xiong, Victor OK Li, and Richard Socher, “Non-autoregressive neural machine translation,” in International Conference on Learning Representations, 2018.
[17]
Zheng-cong Fei, “Fast image caption generation with position alignment,” in AAAI, 2020.
[18]
E R Vimina and K Poulose Jacob, "Content Based Image Retrieval Using Low Level Features of Automatically Extracted Regions of Interest," Journal of Image and Graphics, Vol. 1, No. 1, pp. 7-11, March 2013.
[19]
Chunqi Wang, Ji Zhang, and Haiqing Chen, “Semi-autoregressive neural machine translation,” in EMNLP, 2018.
[20]
Junlong Gao, Xi Meng, Shiqi Wang, Xia Li, Shanshe Wang, Siwei Ma, and Wen Gao, “Masked non-autoregressive image captioning,” in arXiv: 1906.00717, June 2019.
[21]
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770-778.
[22]
S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” in Advances in Neural Information Processing Systems, 2015, pp. 91–99.
[23]
Steven J Rennie, Etienne Marcheret, Youssef Mroueh, Jerret Ross, and Vaibhava Goel, “Self-critical sequence training for image captioning,” in CVPR, 2017.
[24]
T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in European Conference on Computer Vision, 2014, pp. 740–755. Springer.
[25]
T. Yao, Y. Pan, Y. Li, and T. Mei, “Exploring visual relationship for image captioning,” in European Conference on Computer Vision, 2018, pp. 684–699.
[26]
W. Liu, S. Chen, L. Guo, X. Zhu, and J. Liu, “CPTR: Full Transformer Network for Image Captioning,” in arXiv: 2101.10804, 2021.

Cited By

View all

Index Terms

  1. GA-SATIC: Semi-autoregressive Transformer Image Captioning Based on Geometric Attention

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    AIPR '22: Proceedings of the 2022 5th International Conference on Artificial Intelligence and Pattern Recognition
    September 2022
    1221 pages
    ISBN:9781450396899
    DOI:10.1145/3573942
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 16 May 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Autoregressive approach
    2. Geometric attention
    3. Image captioning
    4. Non-autoregressive approach

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Conference

    AIPR 2022

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 28
      Total Downloads
    • Downloads (Last 12 months)8
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 01 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media