skip to main content
10.1145/3573942.3574052acmotherconferencesArticle/Chapter ViewAbstractPublication PagesaiprConference Proceedingsconference-collections
research-article

Image Captioning Method Based on Layer Feature Attention

Published: 16 May 2023 Publication History

Abstract

The high-level features of images are often used to represent the scene features in the image captioning task, because they contain rich semantic information, but the high-level features can only express a feature of global information, and the local information of small objects is easy to be ignored, which makes it difficult to generate the description of small objects, and thus cannot meet the description requirements of finer granularity. To describe the rich semantic information in the image and retain more description of small objects, an image captioning method based on layer feature attention is proposed. Combined with the existing structure of Transformer decoder, the layer feature attention module is designed. Using the multi-layer features of the image, each decoder stack layer can determine the attention to the features of each layer when decoding, and dynamically learn the similarity between the features of each layer and the sequence semantic features to improve the quality of the statement.

References

[1]
Mori Y, Takahashi H, Oka R. Image-to-word transformation based on dividing and vector quantizing images with words[C]//First international workshop on multimedia intelligent storage and retrieval management. 1999: 1-9.
[2]
Vinyals O, Toshev A, Bengio S, Show and tell: A neural image caption generator[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2015: 3156-3164.
[3]
Chen S, Jiang Y G. Towards bridging event captioner and sentence localizer for weakly supervised dense event captioning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021: 8425-8435.
[4]
Hosseinzadeh M, Wang Y. Image Change Captioning by Learning from an Auxiliary Task[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021: 2725-2734.
[5]
Chen L, Jiang Z, Xiao J, Human-like controllable image captioning with verb-specific semantic roles[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021: 16846-16856.
[6]
Zhang X, Sun X, Luo Y, RSTNet: Captioning with Adaptive Attention on Visual and Non-Visual Words[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021: 15465-15474.
[7]
Hu G X, Yang Z, Hu L, Small object detection with multiscale features[J]. International Journal of Digital Multimedia Broadcasting, 2018, 45(3): 1-10.
[8]
Wu Q, Shen C, Liu L, What value do explicit high level concepts have in vision to language problems?[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 203-212.
[9]
Du X, Song Y, Liu Y, An integrated deep learning framework for joint segmentation of blood pool and myocardium[J]. Medical image analysis, 2020, 62: 676-685.
[10]
Vaswani A, Shazeer N, Parmar N, Attention is all you need[C]//Proceedings of the Advances in neural information processing systems. Curran Associates Inc, 2017: 5998-6008.
[11]
Zhu X, Li L, Liu J, Captioning transformer with stacked attention modules[J]. Applied Sciences, 2018, 8(5): 739-750.
[12]
Li J, Yao P, Guo L, Boosted transformer for image captioning[J]. Applied Sciences, 2019, 9(16): 3260-3272.
[13]
Zhang, X., Sun, X., Luo, Y., RSTNet: Captioning with Adaptive Attention on Visual and Non-Visual Words[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021: 15465-15474.
[14]
Song Z, Zhou X, Dong L, Direction Relation Transformer for Image Captioning[C]//Proceedings of the 29th ACM International Conference on Multimedia. 2021: 5056-5064.
[15]
Dong X, Long C, Xu W, Dual graph convolutional networks with transformer and curriculum learning for image captioning[C]//Proceedings of the 29th ACM International Conference on Multimedia. 2021: 2615-2624.
[16]
Herdade S, Kappeler A, Boakye K, Image captioning: Transforming objects into words[J]. Advances in Neural Information Processing Systems, 2019, 32(6): 230-242.
[17]
Zhou Y, Zhang Y, Hu Z, Semi-Autoregressive Transformer for Image Captioning[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021: 3139-3143.
[18]
Cornia M, Stefanini M, Baraldi L, Meshed-memory transformer for image captioning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020: 10578-10587.
[19]
Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating imagedescriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition. (2015) 3128–3137
[20]
the 40th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, Philadelphia, PA, USA, 7–12 July 2002; pp. 311–318.
[21]
Flick, C. ROUGE: A Package for Automatic Evaluation of summaries. In Proceedings of the Workshop on Text Summarization Branches Out, Barcelona, Spain, 25–26 July 2004.
[22]
Vedantam, R.; Zitnick, C.L.; Parikh, D. CIDEr: Consensus-based Image Description Evaluation.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA,7–12 June 2015.
[23]
Lavie, A.; Agarwal, A. METEOR: An automatic metric for MT evaluation with high levels of correlation with human judgments. In Proceedings of the Second Workshop on Statistical Machine Translation. Association for Computational Linguistics, Prague, Czech Republic, 23 June 2007; pp. 228–231.
[24]
Wright L, Demeure N. Ranger21: a synergistic deep learning optimizer[J]. arXiv preprint arXiv:2106.13731, 2021.
[25]
He C, Tong Q, Yang X, A Multi-layer Feature Parallel Processing Method for Image Captioning[C]//2021 3rd International Conference on Natural Language Processing (ICNLP). IEEE, 2021: 255-261.

Index Terms

  1. Image Captioning Method Based on Layer Feature Attention

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    AIPR '22: Proceedings of the 2022 5th International Conference on Artificial Intelligence and Pattern Recognition
    September 2022
    1221 pages
    ISBN:9781450396899
    DOI:10.1145/3573942
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 16 May 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. EfficientNet
    2. Image captioning
    3. Layer Feature Attention
    4. Multi-layer features
    5. Transformer

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Conference

    AIPR 2022

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 23
      Total Downloads
    • Downloads (Last 12 months)4
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 20 Feb 2025

    Other Metrics

    Citations

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media