skip to main content
10.1145/3573942.3574065acmotherconferencesArticle/Chapter ViewAbstractPublication PagesaiprConference Proceedingsconference-collections
research-article

Image Description Generation Method Based on X-Linear Attention Mechanism

Published: 16 May 2023 Publication History

Abstract

Aiming at the problem that existing image description models cannot model high-order multimodal feature interaction, this paper introduces the X-Linear attention mechanism, which uses bilinear pooling and ELU activation function to model high-order feature interaction between multimodal features. At the same time, the X-Linear attention mechanism uses spatial and channel attention mechanisms to enhance the expression ability of the model and the ability to generate image description sentences. The experimental results on the MSCOCO data set show that this method is effective and has a great improvement in each evaluation metric.

References

[1]
Kulkarni G, Premraj V, Ordonez V, Babytalk: Understanding and generating simple image descriptions[J]. IEEE transactions on pattern analysis and machine intelligence, 2013, 35(12): 2891-2903.
[2]
Sutskever I, Vinyals O, Le Q V. Sequence to sequence learning with neural networks[C]//Advances in neural information processing systems. 2014: 3104-3112.
[3]
Vinyals O, Toshev A, Bengio S, Show and tell: A neural image caption generator[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2015: 3156-3164.
[4]
Ren S, He K, Girshick R, Faster r-cnn: Towards real-time object detection with region proposal networks[J]. Advances in neural information processing systems, 2015, 28: 91-99.
[5]
Lin T Y, RoyChowdhury A, Maji S. Bilinear cnn models for fine-grained visual recognition[C]//Proceedings of the IEEE international conference on computer vision. 2015: 1449-1457.
[6]
Gao Y, Beijbom O, Zhang N, Compact bilinear pooling[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 317-326.
[7]
Fukui A, Park D H, Yang D, Multimodal compact bilinear pooling for visual question answering and visual grounding[J]. arXiv preprint arXiv:1606.01847, 2016.
[8]
Kim J H, On K W, Lim W, Hadamard product for low-rank bilinear pooling[J]. arXiv preprint arXiv:1610.04325, 2016.
[9]
Huang L, Wang W, Xia Y, Adaptively aligned image captioning via adaptive attention time[J]. Advances in neural information processing systems, 2019, 32.
[10]
[10] Hu J, Shen L, Sun G. Squeeze-and-excitation networks[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 7132-7141.
[11]
Barron J T. Continuously differentiable exponential linear units[J]. arXiv preprint arXiv:1704.07483, 2017.
[12]
Karpathy A, Fei-Fei L. Deep visual-semantic alignments for generating image descriptions[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2015: 3128-3137.
[13]
Bengio S, Vinyals O, Jaitly N, Scheduled sampling for sequence prediction with recurrent neural networks[J]. arXiv preprint arXiv:1506.03099, 2015.
[14]
Rennie S J, Marcheret E, Mroueh Y, Self-critical sequence training for image captioning[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 7008-7024.
[15]
Anderson P, He X, Buehler C, Bottom-up and top-down attention for image captioning and visual question answering[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 6077-6086.

Index Terms

  1. Image Description Generation Method Based on X-Linear Attention Mechanism

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    AIPR '22: Proceedings of the 2022 5th International Conference on Artificial Intelligence and Pattern Recognition
    September 2022
    1221 pages
    ISBN:9781450396899
    DOI:10.1145/3573942
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 16 May 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Conference

    AIPR 2022

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 28
      Total Downloads
    • Downloads (Last 12 months)13
    • Downloads (Last 6 weeks)3
    Reflects downloads up to 01 Mar 2025

    Other Metrics

    Citations

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media