research-article

Image Captioning Method Based on Layer Feature Attention

Authors:

Yifan LiAuthors Info & Claims

AIPR '22: Proceedings of the 2022 5th International Conference on Artificial Intelligence and Pattern Recognition

Pages 482 - 488

https://doi.org/10.1145/3573942.3574052

Published: 16 May 2023 Publication History

Abstract

The high-level features of images are often used to represent the scene features in the image captioning task, because they contain rich semantic information, but the high-level features can only express a feature of global information, and the local information of small objects is easy to be ignored, which makes it difficult to generate the description of small objects, and thus cannot meet the description requirements of finer granularity. To describe the rich semantic information in the image and retain more description of small objects, an image captioning method based on layer feature attention is proposed. Combined with the existing structure of Transformer decoder, the layer feature attention module is designed. Using the multi-layer features of the image, each decoder stack layer can determine the attention to the features of each layer when decoding, and dynamically learn the similarity between the features of each layer and the sequence semantic features to improve the quality of the statement.

References

[1]

Mori Y, Takahashi H, Oka R. Image-to-word transformation based on dividing and vector quantizing images with words[C]//First international workshop on multimedia intelligent storage and retrieval management. 1999: 1-9.

[2]

Vinyals O, Toshev A, Bengio S, Show and tell: A neural image caption generator[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2015: 3156-3164.

[3]

Chen S, Jiang Y G. Towards bridging event captioner and sentence localizer for weakly supervised dense event captioning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021: 8425-8435.

[4]

Hosseinzadeh M, Wang Y. Image Change Captioning by Learning from an Auxiliary Task[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021: 2725-2734.

[5]

Chen L, Jiang Z, Xiao J, Human-like controllable image captioning with verb-specific semantic roles[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021: 16846-16856.

[6]

Zhang X, Sun X, Luo Y, RSTNet: Captioning with Adaptive Attention on Visual and Non-Visual Words[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021: 15465-15474.

[7]

Hu G X, Yang Z, Hu L, Small object detection with multiscale features[J]. International Journal of Digital Multimedia Broadcasting, 2018, 45(3): 1-10.

Digital Library

[8]

Wu Q, Shen C, Liu L, What value do explicit high level concepts have in vision to language problems?[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 203-212.

[9]

Du X, Song Y, Liu Y, An integrated deep learning framework for joint segmentation of blood pool and myocardium[J]. Medical image analysis, 2020, 62: 676-685.

[10]

Vaswani A, Shazeer N, Parmar N, Attention is all you need[C]//Proceedings of the Advances in neural information processing systems. Curran Associates Inc, 2017: 5998-6008.

[11]

Zhu X, Li L, Liu J, Captioning transformer with stacked attention modules[J]. Applied Sciences, 2018, 8(5): 739-750.

[12]

Li J, Yao P, Guo L, Boosted transformer for image captioning[J]. Applied Sciences, 2019, 9(16): 3260-3272.

[13]

Zhang, X., Sun, X., Luo, Y., RSTNet: Captioning with Adaptive Attention on Visual and Non-Visual Words[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021: 15465-15474.

[14]

Song Z, Zhou X, Dong L, Direction Relation Transformer for Image Captioning[C]//Proceedings of the 29th ACM International Conference on Multimedia. 2021: 5056-5064.

[15]

Dong X, Long C, Xu W, Dual graph convolutional networks with transformer and curriculum learning for image captioning[C]//Proceedings of the 29th ACM International Conference on Multimedia. 2021: 2615-2624.

[16]

Herdade S, Kappeler A, Boakye K, Image captioning: Transforming objects into words[J]. Advances in Neural Information Processing Systems, 2019, 32(6): 230-242.

[17]

Zhou Y, Zhang Y, Hu Z, Semi-Autoregressive Transformer for Image Captioning[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021: 3139-3143.

[18]

Cornia M, Stefanini M, Baraldi L, Meshed-memory transformer for image captioning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020: 10578-10587.

[19]

Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating imagedescriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition. (2015) 3128–3137

[20]

the 40th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, Philadelphia, PA, USA, 7–12 July 2002; pp. 311–318.

[21]

Flick, C. ROUGE: A Package for Automatic Evaluation of summaries. In Proceedings of the Workshop on Text Summarization Branches Out, Barcelona, Spain, 25–26 July 2004.

[22]

Vedantam, R.; Zitnick, C.L.; Parikh, D. CIDEr: Consensus-based Image Description Evaluation.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA,7–12 June 2015.

[23]

Lavie, A.; Agarwal, A. METEOR: An automatic metric for MT evaluation with high levels of correlation with human judgments. In Proceedings of the Second Workshop on Statistical Machine Translation. Association for Computational Linguistics, Prague, Czech Republic, 23 June 2007; pp. 228–231.

[24]

Wright L, Demeure N. Ranger21: a synergistic deep learning optimizer[J]. arXiv preprint arXiv:2106.13731, 2021.

[25]

He C, Tong Q, Yang X, A Multi-layer Feature Parallel Processing Method for Image Captioning[C]//2021 3rd International Conference on Natural Language Processing (ICNLP). IEEE, 2021: 255-261.

Index Terms

Image Captioning Method Based on Layer Feature Attention
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Machine translation

Recommendations

An Image Captioning Method Based on Transformer for Multi-feature Fusion
CISAI '24: Proceedings of the 2024 7th International Conference on Computer Information Science and Artificial Intelligence
In image captioning, methods that generate descriptions by extracting regional visual features often overlook the rich contextual information and details contained in grid features. Additionally, Transformer models have some limitations in distinguishing ...
Attention-guided image captioning with adaptive global and local feature fusion
Highlights
- A fusion mechanism for global feature and local object-level feature is proposed.
Abstract
Although attention mechanisms are exploited widely in encoder-decoder neural network-based image captioning framework, the relation between the selection of salient image regions and the supervision of spatial information on local and ...
Improving Image Captioning with Feature Filtering and Injection
Artificial Neural Networks and Machine Learning – ICANN 2023
Abstract
Image captioning represents a challenging multimodal task, requiring the generation of corresponding textual descriptions for complex input images. Existing methods usually leverage object detectors to extract visual features of images, and thus ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

AIPR '22: Proceedings of the 2022 5th International Conference on Artificial Intelligence and Pattern Recognition

September 2022

1221 pages

ISBN:9781450396899

DOI:10.1145/3573942

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 May 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

AIPR 2022

AIPR 2022: 2022 5th International Conference on Artificial Intelligence and Pattern Recognition

September 23 - 25, 2022

Xiamen, China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
23
Total Downloads

Downloads (Last 12 months)4
Downloads (Last 6 weeks)0

Reflects downloads up to 20 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten