skip to main content
10.1145/3573942.3574090acmotherconferencesArticle/Chapter ViewAbstractPublication PagesaiprConference Proceedingsconference-collections
research-article

Visual Question Answering Model Based on CAM and GCN

Published: 16 May 2023 Publication History

Abstract

Visual Question Answering (VQA) is a challenging problem that needs to combine concepts from computer vision and natural language processing. In recent years, researchers have proposed many methods for this typical multimodal problem. Most existing methods use a two-stream strategy, i.e., compute image and question features separately and fuse them using various techniques, rarely relying on higher-level image representations, to capture semantic and spatial relationships. Based on the above problems, a visual question answering model (CAM-GCN) based on Cooperative Attention Mechanism (CAM) and Graph Convolutional Network (GCN) is proposed. First, the graph learning module and the concept of graph convolution are combined to learn the problem-specific graph representation of the input image and capture the interactive image representation of the specific problem. Image region dependence, and finally, continue to optimize the fused features through feature enhancement. The test results on the VQA v2 dataset show that the CAM-GCN model achieves better classification results than the current representative models.

References

[1]
Bao X G, Zhou C L, Xiao K J, Review of Visual Question Answering Research [J]. Journal of Software, 2021, 32(8):23.
[2]
Gao H. Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question Answering[J]. Computer ence, 2015:2296-2304.
[3]
Yang Z, He X, Gao J, Deng L, Smola A. Stacked attention networks for image question answering. In: Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition. 2016. 21−29. [
[4]
Yu Z, Yu J, Fan J, Tao P. Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In:Proc. of the IEEE Int'l Conf on Computer Vision. 2017. 1821−1830. [
[5]
Yu Z, Yu J, Cui Y, Tao D, Tian Q. Deep modular co-attention networks for visual question answering. In: Proc. of the IEEE Conf.on Computer Vision and Pattern Recognition. 2019. 6281−6290. [
[6]
Cadene R, Ben-Younes H, Cord M, Thome N. Murel: Multimodal relational reasoning for visual question answering. In: Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition. 2019. 1989−1998. [
[7]
Li L, Gan Z, Cheng Y, Liu J. Relation-aware graph attention network for visual question answering. In: Proc. of the IEEE Int'l Conf. on Computer Vision. 2019. 10313−10322. [
[8]
Li L H, Yatskar M, D Yin, VisualBERT: A Simple and Performant Baseline for Vision and Language[J]. 2019. arXiv:1908.03557,2019.
[9]
LI X,YIN X,LI C, Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks[C]//European Conference on Computer Vision.Springer,Cham,2020:121-137.
[10]
Yan Z, Hare J, Gel-Bennett A P. Learning to Count Objects in Natural Images for Visual Question Answering[J]. 2018.
[11]
Nguyen D K, Okatani T. Multi-task Learning of Hierarchical Vision-Language Representation[J]. IEEE, 2020.
[12]
Cho K, Merrienboer B V, Gulcehre C, Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation[J]. Computer Science, 2014.
[13]
Jain U, Lazebnik S, Schwing A. Two can play this Game: Visual Dialog with Discriminative Question Generation and Answering[J]. IEEE, 2018.
[14]
Norcliffe-Brown W, Vafeais E, Parisot S. Learning Conditioned Graph Structures for Interpretable Visual Question Answering[J]. 2018.
[15]
Bianchi F M, Grattarola D, Livi L, Graph Neural Networks with Convolutional ARMA Filters[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021, PP(99): 1-1.
[16]
Zou P R, Xiao F, Zhang W, Multi-module collaborative attention model for visual question answering [J]. Computer Engineering, 2022,48(02): 250-260.
[17]
Noh H, Kim T, Mun J, Transfer Learning via Unsupervised Task Discovery for Visual Question Answering[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2019.
[18]
Peters M, Neumann M, Iyyer M, Deep Contextualized Word Representations[J]. 2018.
[19]
Antol S, Agrawal A, Lu J, VQA: Visual question answering[C]//Proceedings of the IEEE International Conference on Computer Vision. Santiago, Chile: IEEE, 2015: 2425-2433.
[20]
Yang Zichao, He X, Gao J, Stacked attention networks for image question answering[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. 2016: 21-29.
[21]
Yang Zhuoqian, Qin Z, Yu J, and Hu Y. 2018. Multi-modal learning with prior visual relation reasoning. [EB/OL]. [2018-12-23].
[22]
Yang, Zhuoqian, Zengchang Qin, Prior Visual Relationship Reasoning For Visual Question Answering[C]//2020 IEEE International Conference on Image Processing (ICIP), 2020: 1411-1415.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
AIPR '22: Proceedings of the 2022 5th International Conference on Artificial Intelligence and Pattern Recognition
September 2022
1221 pages
ISBN:9781450396899
DOI:10.1145/3573942
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 May 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Feature Fusion
  2. Graph Convolutional Network
  3. Graph Learning
  4. Multimodal Learning
  5. Visual Question Answering

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

AIPR 2022

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 35
    Total Downloads
  • Downloads (Last 12 months)10
  • Downloads (Last 6 weeks)6
Reflects downloads up to 01 Mar 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media