skip to main content
10.1145/3546607.3546614acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicvarsConference Proceedingsconference-collections
research-article

Text-guided Attention Mechanism Fine-grained Image Classification

Published: 25 August 2022 Publication History

Abstract

Scene texts with explicit semantic information in natural images can provide important clues to solve the corresponding computer vision problems. In the text, we usually focus on using multimodal content in the form of visual and text prompts to solve the task of fine-grained image classification and retrieval. In this paper, graph convolution network is used to perform multimodal reasoning, and the features of relationship enhancement are obtained by learning the common semantic space between salient objects and texts found in images. By obtaining a set of enhanced visual and textual functions, the proposed model is highly superior to the existing technologies in two different tasks (fine-grained classification and image retrieval in contextual texts).

References

[1]
S. Karaoglu, R. Tao, T. Gevers and A. W. M. Smeulders, “Words Matter: Scene Text for Image Classification and Retrieval,” in IEEE Transactions on Multimedia, vol. 19, no. 5, pp. 1063-1076, May 2017.
[2]
Karaoglu S, Gemert J C V, Gevers T. “Object Reading: Text Recognition for Object Recognition,” IEEE Transl. pp. 456-465, 2012 [Digets International Conference on Computer Vision] Berlin: Sringer-Verlag.
[3]
Xiang B, Mingkun Y, Pengyuan L.“Integrating scene text and visual appearance for fifine-grained image classifification,” IEEE Transl.vol. 5. pp. 66322-66335, 2018 [Digets IEEE Access].
[4]
Andres M, Sounak D, Ali F B, Lluis G. “Fine-grained Image Classification and Retrieval by Combining Visual and Locally Pooled Textual Features,” IEEE Transl. pp. 2950-2959, 2020 [Digets Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)].
[5]
Tsung-Yu L, Aruni R, Subhransu M. “Bilinear cnn models forfifine-grained visual recognition,” IEEE Transl. pp. 1449-1457, 2015 [Digets Proceedings of the IEEE International Conference on Computer Vision (ICCV)].
[6]
Bolei Z, Aditya K, Agata L. “Learning deep features for discriminative localization,” IEEE Transl pp. 2921-2929, 2016 [Digets Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)].
[7]
Tian T X, Yi C X, Kui Y Y. “The application of twolevel attention models in deep convolutional neural network for fifine-grained image classifification,” IEEE Transl pp. 842-850,2015 [Digets In Proceedings of the IEEE conference on computer vision and pattern recognition].
[8]
Marcel S, Erik R. “Neural activation constellations: Unsupervised part model discovery with convolutional networks,” IEEE Transl. pp. 1143-1151, 2015 [Digets In Proceedings of the IEEE International Conference on Computer Visio].
[9]
Ramprasaath R S, Michael C, Abhishek D. “Grad-cam: Visual explanations from deep networks via gradient-based localization,” IEEE Transl. pp. 618-626, 2017 [Digets In Proceedings of the IEEE international conference on computer vision].
[10]
Shao Q R, Kai M H, Ross G. “Faster r-cnn: Towards real-time object detection with area proposal networks,” IEEE Transl. pp. 92-99, 2015 [Digets In Advances in neural information processing systems].
[11]
Piotr B, Edouard G, Armand J. “Enriching word vectors with subword information,” IEEE Transl. pp. 135-146, 2017 [Digets Transactions of the Association for Computational Linguistics].
[12]
Tomas M, Ilya S, Kai C. “Distributed representations of words and phrases and their compositionality,” IEEE Transl. pp. 3111-3119, 2013 [Digets In Advances in neural information processing systems].
[13]
Jia D, Wei D, Richard S. “Imagenet: A large-scale hierarchical image database,” IEEE Transl. pp. 248-255, 2009 [Digets In Computer Vision and Pattern Recognition].
[14]
Kim J H, On K W, Lim W. “Hadamard product for low-rank bilinear pooling,” arXiv preprint arXiv, pp. 1610.04325, 2016.
[15]
Ben-younes H, Cadene R, Thome N. “BLOCK: Bilinear Superdiagonal Fusion for Visual Question Answering and Visual Relationship Detection,” IEEE Transl. pp. 8102-8109, 2019 [Digets Proceedings of the AAAI Conference on Artificial Intelligence].

Index Terms

  1. Text-guided Attention Mechanism Fine-grained Image Classification

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    ICVARS '22: Proceedings of the 2022 6th International Conference on Virtual and Augmented Reality Simulations
    March 2022
    119 pages
    ISBN:9781450387330
    DOI:10.1145/3546607
    This work is licensed under a Creative Commons Attribution International 4.0 License.

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 25 August 2022

    Check for updates

    Author Tags

    1. fine-grained image analysis
    2. graph neural network
    3. multimodal reasoning

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Conference

    ICVARS 2022

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 81
      Total Downloads
    • Downloads (Last 12 months)22
    • Downloads (Last 6 weeks)4
    Reflects downloads up to 25 Jan 2025

    Other Metrics

    Citations

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media