research-article

GA-SATIC: Semi-autoregressive Transformer Image Captioning Based on Geometric Attention

Authors:

Ma LiAuthors Info & Claims

AIPR '22: Proceedings of the 2022 5th International Conference on Artificial Intelligence and Pattern Recognition

Pages 547 - 554

https://doi.org/10.1145/3573942.3574061

Published: 16 May 2023 Publication History

Abstract

Image captioning is a cross-domain task involving image and natural language processing. Most of the current models follow an encoder-decoder architecture, where the encoder takes image feature vectors as input, and the decoder uses an autoregressive or non-autoregressive approach for decoding. However, most models extract the feature vectors of an image through region proposals of the object detector, without considering the relative spatial relationship between objects. When decoding, the autoregressive decoding method is adopted, that is, the next word is generated based on the generated word, and the generation is performed step by step, which will lead to high delay in the inference process. To solve this problem, scholars have proposed a non-autoregressive approach, which speeds up inference by generating all words in parallel, but reduces the quality of generated captioning. Aiming at the above problems, this paper proposes a semi-autoregressive Transformer model based on geometric attention. The encoder integrates the relative spatial relationship between the detected objects through geometric attention and appearance attention, so as to enhance spatial awareness; the decoder adopts a semi-autoregressive decoding method, which is serialized globally and paralleled locally, enabling the model to achieve a better trade-off between decoding speed and captioning accuracy. Extensive experiments and ablation studies on the MSCOCO dataset have shown that the model achieves better performance compared to state-of-the-art models.

References

[1]

A. Elhagry, K. Kadaoui, “A Thorough Review on Recent Deep Learning Methodologies for Image Captioning,” in arXiv: 2107.13114, July 2021.

[2]

Yi-Nung Chung, Tun-Chang Lu, Ming-Tsung Yeh, Yu-Xian Huang, and Chun-Yi Wu, "Applying the Video Summarization Algorithm to Surveillance Systems," Journal of Image and Graphics, Vol. 3, No. 1, pp. 20-24, June 2015.

[3]

W. Zaremba, I. Sutskever, O. Vinyals, “Recurrent neural network regularization,” in arXiv: 1409.2329, Feb. 2015.

[4]

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A N. Gomez, Ł. Kaiser, Polosukhin I, “Attention is all you need,” in NeurIPS, 2017.

Digital Library

[5]

H. Hu, J. Gu, Z. Zhang, J. Dai, and Y. Wei, “Relation networks for object detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 3588-3597.

[6]

Zheng-cong Fei, “Fast image caption generation with position alignment,” in arXiv: 1912.06365, Dec. 2019.

[7]

Longteng Guo, Jing Liu, Xinxin Zhu, Xingjian He, Jie Jiang, and Hanqing Lu, “Non-autoregressive image captioning with counterfactuals-critical multi-agent learning,” in IJCAI, May 2020.

[8]

Junlong Gao, Xi Meng, Shiqi Wang, Xia Li, Shanshe Wang, Siwei Ma, and Wen Gao, “Masked non-autoregressive image captioning,” in arXiv: 1906.00717, June 2019.

[9]

Bang Yang, Yuexian Zou, Fenglin Liu, and Can Zhang, “Non-autoregressive coarse-to-fine video captioning,” in AAAI, 2021.

[10]

Qiu Ran, Yankai Lin, Peng Li, and Jie Zhou, “Learning to Recover from Multi-Modality Errors for Non-Autoregressive Neural Machine Translation,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Jul. 2020, pp. 3059-3069, Online. Association for Computational Linguistics.

[11]

Zhengcong Fei, “Partially non-autoregressive image captioning,” in AAAI, 2021.

[12]

A. Karpathy and L. Fei-Fei, “Deep visual-semantic alignments for generating image descriptions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3128-3137.

[13]

Mohamed Maher Ben Ismail and Ouiem Bchir, "CE Video Summarization Using Relational Motion Histogram Descriptor," Journal of Image and Graphics, Vol. 3, No. 1, pp. 34-39, June 2015.

[14]

P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang, “Bottom-up and top-down attention for image captioning and visual question answering,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.

[15]

X. Yang, K. Tang, H. Zhang, and J. Cai, “Auto-encoding scene graphs for image captioning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 10685–10694.

[16]

Jiatao Gu, James Bradbury, Caiming Xiong, Victor OK Li, and Richard Socher, “Non-autoregressive neural machine translation,” in International Conference on Learning Representations, 2018.

[17]

Zheng-cong Fei, “Fast image caption generation with position alignment,” in AAAI, 2020.

[18]

E R Vimina and K Poulose Jacob, "Content Based Image Retrieval Using Low Level Features of Automatically Extracted Regions of Interest," Journal of Image and Graphics, Vol. 1, No. 1, pp. 7-11, March 2013.

[19]

Chunqi Wang, Ji Zhang, and Haiqing Chen, “Semi-autoregressive neural machine translation,” in EMNLP, 2018.

[20]

Junlong Gao, Xi Meng, Shiqi Wang, Xia Li, Shanshe Wang, Siwei Ma, and Wen Gao, “Masked non-autoregressive image captioning,” in arXiv: 1906.00717, June 2019.

[21]

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770-778.

[22]

S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” in Advances in Neural Information Processing Systems, 2015, pp. 91–99.

Digital Library

[23]

Steven J Rennie, Etienne Marcheret, Youssef Mroueh, Jerret Ross, and Vaibhava Goel, “Self-critical sequence training for image captioning,” in CVPR, 2017.

[24]

T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in European Conference on Computer Vision, 2014, pp. 740–755. Springer.

[25]

T. Yao, Y. Pan, Y. Li, and T. Mei, “Exploring visual relationship for image captioning,” in European Conference on Computer Vision, 2018, pp. 684–699.

Digital Library

[26]

W. Liu, S. Chen, L. Guo, X. Zhu, and J. Liu, “CPTR: Full Transformer Network for Image Captioning,” in arXiv: 2101.10804, 2021.

Cited By

Index Terms

GA-SATIC: Semi-autoregressive Transformer Image Captioning Based on Geometric Attention
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
        Object recognition

Recommendations

Semi-Autoregressive Image Captioning
MM '21: Proceedings of the 29th ACM International Conference on Multimedia

Current state-of-the-art approaches for image captioning typically adopt an autoregressive manner, i.e., generating descriptions word by word, which suffers from slow decoding issue and becomes a bottleneck in real-time applications. Non-autoregressive ...
Neural attention for image captioning: review of outstanding methods
Abstract
Image captioning is the task of automatically generating sentences that describe an input image in the best way possible. The most successful techniques for automatically generating image captions have recently used attentive deep learning models. ...
Image captioning using transformer-based double attention network
Abstract
Image captioning generates a human-like description for a query image, which has attracted considerable attention recently. The most broadly utilized model for image description is an encoder–decoder structure, where the encoder ...
Graphical abstract

Display Omitted

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

AIPR '22: Proceedings of the 2022 5th International Conference on Artificial Intelligence and Pattern Recognition

September 2022

1221 pages

ISBN:9781450396899

DOI:10.1145/3573942

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 May 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

AIPR 2022

AIPR 2022: 2022 5th International Conference on Artificial Intelligence and Pattern Recognition

September 23 - 25, 2022

Xiamen, China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
28
Total Downloads

Downloads (Last 12 months)8
Downloads (Last 6 weeks)2

Reflects downloads up to 01 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten