skip to main content
10.1145/3573942.3574072acmotherconferencesArticle/Chapter ViewAbstractPublication PagesaiprConference Proceedingsconference-collections
research-article

Research on Image Description Generation Method Based on G-AoANet

Published: 16 May 2023 Publication History

Abstract

Most of the image description generation methods in the attention-based encoder-decoder framework extract local features from images. Despite the relatively high semantic level of local features, it still has two problems to be solved, one is object loss, where some important objects may be lost when generating image descriptions, and the other is prediction error, as an object may be identified in the wrong class. In this paper, a G-AoANet model is proposed to solve the above problems. The model uses an attention mechanism to combine global features with local features. In this way, our model can selectively focus on both object and contextual information, improving the quality of the generated descriptions. Experimental results show that the model improves the initially reported best CIDEr-D and SPICE scores on the MS COCO dataset by 9.3% and 5.1% respectively.

References

[1]
Hao Fang, Saurabh Gupta, Forrest Iandola, Rupesh K Srivastava, Li Deng, Piotr Dollár, Jianfeng Gao, Xiaodong He, Margaret Mitchell, John C Platt, From captions to visual concepts and back. In CVPR, 2015
[2]
Girish Kulkarni, Visruth Premraj, Vicente Ordonez, Sagnik Dhar, Siming Li, Yejin Choi, Alexander C Berg, and Tamara L Berg. Babytalk: Understanding and generating simple image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(12):2891–2903, 2013.
[3]
L. Huang, W. Wang, J. Chen, and X.-Y . Wei, “Attention on Attention for Image Captioning,” in ICCV, 2019.
[4]
C. Xiong, S. Merity, and R. Socher. Dynamic memory networks for visual and textual question answering. In ICML, 2016.
[5]
P . Kuznetsova, V . Ordonez, A. C. Berg, T. L. Berg, and Y . Choi, “Collective generation of natural image descriptions,” in Proc. 50th Annu. Meet. Assoc. Comput. Linguistics, Long Papers-Volume 1, 2012, pp. 359–368.
[6]
A. Farhadi, “Every picture tells a story: Generating sentences from images,” in Proc. Eur . Conf. Comput. Vis., Berlin, Germany: Springer, 2010, pp. 15–29.
[7]
C. Szegedy, V . Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” 2015, arXiv:1512.00567.
[8]
K. Cho, “Learning phrase representations using RNN encoder-decoder for statistical machine translation,” Empirical Meth. Natural Lang. Process., pp. 1724–1734, 2014.
[9]
D. Bahdanau, K. Cho, and Y . Bengio, “Neural machine translation by jointly learning to align and translate,” in Proc. Int. Conf. Learn. Represent., 2015.
[10]
Ronald A Rensink. The dynamic representation of scenes. Visual Cognition, 7:17–42, 2000.
[11]
L. Minh-0ang, H. Pham, and C. D. Manning, “Effective approaches to attention-based neural machine translation,” in Proceedings of the Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, September 2015.
[12]
Tsung Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollr, and C. Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014.
[13]
Banerjee Satanjeev. Meteor : An automatic metric for mt evaluation with improved correlation with human judgments. In ACL, 2005.
[14]
Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description evaluation. In CVPR, 2015.
[15]
Carlos Flick. Rouge: A package for automatic evaluation of summaries. In The Workshop on Text Summarization Branches Out, 2004.
[16]
Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. Spice: Semantic propositional image caption evaluation. In ECCV, 2016.
[17]
Kishore Papineni, Salim Roukos, Todd Ward, and Weijing Zhu. Bleu: a method for automatic evaluation of machine translation. In ACL, 2002.

Index Terms

  1. Research on Image Description Generation Method Based on G-AoANet

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    AIPR '22: Proceedings of the 2022 5th International Conference on Artificial Intelligence and Pattern Recognition
    September 2022
    1221 pages
    ISBN:9781450396899
    DOI:10.1145/3573942
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 16 May 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Encoder-decoder
    2. Global attention
    3. Image description

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Conference

    AIPR 2022

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 26
      Total Downloads
    • Downloads (Last 12 months)11
    • Downloads (Last 6 weeks)3
    Reflects downloads up to 01 Mar 2025

    Other Metrics

    Citations

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media