research-article

Adversarial Learning of Answer-Related Representation for Visual Question Answering

Authors:

Xiaoming Zhang,

Zhoujun LiAuthors Info & Claims

CIKM '18: Proceedings of the 27th ACM International Conference on Information and Knowledge Management

Pages 1013 - 1022

https://doi.org/10.1145/3269206.3271765

Published: 17 October 2018 Publication History

Abstract

Visual Question Answering (VQA) aims to learn a joint embedding of the question sentence and the corresponding image to infer the answer. Existing approaches learn the joint embedding don't consider the answer-related information, which results in that the learned representation is not effective to reflect the answer of the question. To address this problem, this paper proposes a novel method, i.e., Adversarial Learning of Answer-Related Representation (ALARR) for visual question answering, which seeks an effective answer-related representation for the question-image pair based on adversarial learning between two processes. The embedding learning process aims to generate modality-invariant joint representations for the question-image and question-answer pairs, respectively. Meanwhile, it tries to confuse the other process, embedding discriminator, which tries to discriminate the two representations from different modalities of pairs. Specifically, the joint embedding of the question-image pair is learned by a three-level attention model, and the joint representation of the question-answer pair is learned by a semantic integration model. Through the adversarial leaning, the answer-related representation are better preserved. Then an answer predictor is proposed to infer the answer from the answer-related representation. Experiments conducted on two widely used VQA benchmark datasets demonstrate that the proposed model outperforms the state-of-the-art approaches.

References

[1]

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. 2015. VQA: Visual Question Answering. In Proceedings of IEEE International Conference on Computer Vision. 2425--2433.

Digital Library

[2]

Hedi Ben-younes, Rémi Cadène, Matthieu Cord, and Nicolas Thome. 2017. MUTAN: Multimodal Tucker Fusion for Visual Question Answering. In Proceedings of IEEE International Conference on Computer Vision. 2631--2639.

[3]

Antoine Bordes, Sumit Chopra, and Jason Weston. 2014. Question Answering with Subgraph Embeddings. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 615--620.

[4]

Antoine Bordes, Jason Weston, and Nicolas Usunier. {n. d.}. Open Question Answering with Weakly Supervised Embedding Models. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases. 165--180.

[5]

Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. 2016. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Proceedings of Advances In Neural Information Processing Systems. 2172--2180.

Digital Library

[6]

eng Wang, Qi Wu, Chunhua Shen, and Anton van den Hengel. 2017. The VQA-Machine: Learning How to Use Existing Vision Algorithms to Answer New Questions. In IEEE Conference on Computer Vision and Pattern Recognition . 3909--3918.

[7]

Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, and Marcus Rohrbach. 2016. Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 457--468.

[8]

Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. 2014. generative adversarial nets. In Proceedings of the Advances in Neural Information Processing Systems. 2672--2680.

Digital Library

[9]

Andrej Karpathy and Li Fei-Fei. 2015. Deep Visual-Semantic Alignments for Generating Image Descriptions. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition . 3128--3137.

[10]

Jin-Hwa Kim, Kyoung Woon On, Woosang Lim, Jeonghee Kim, JungWoo Ha, and Byoung-Tak Zhang. 2017. Hadamard Product for Low-rank Bilinear Pooling. In Proceedings of 5th International Conference on Learning Representations .

[11]

Jiwei Li, Will Monroe, Tianlin Shi, Sébastien Jean, Alan Ritter, and Dan Jurafsky. 2017. Adversarial Learning for Neural Dialogue Generation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing . 2157--2169.

[12]

Hang Li Lin Ma, Zhengdong Lu. 2016. Learning to Answer Questions from Image Using Convolutional Neural Network. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence . 3567--3573.

Digital Library

[13]

Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. 2016. Hierarchical Question-Image Co-Attention for Visual Question Answering. In proceedings of Advances in Neural Information Processing Systems 29. 289--297.

Digital Library

[14]

Pan Lu, Hongsheng Li, Wei Zhang, Jianyong Wang, and Xiaogang Wang. 2018. Co-Attending Free-Form Regions and Detections With Multi-Modal Multiplicative Feature Embedding for Visual Question Answering. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence .

[15]

Mateusz Malinowski and Mario Fritz. 2014. A Multi-World Approach to Question Answering about Real-World Scenes based on Uncertain Input. In Proceedings of the Neural Information Processing Systems Conference. 1682--1690.

Digital Library

[16]

Mateusz Malinowski, Marcus Rohrbach, and Mario Fritz. 2015. Ask Your Neurons: A Neural-Based Approach to Answering Questions about Images. In IEEE International Conference on Computer Vision. 1--9.

Digital Library

[17]

Richard S. Zemel Mengye Ren, Ryan Kiros. 2015. Exploring Models and Data for Image Question Answering. In Proceedings of the Neural Information Processing Systems Conference. 2953--2961.

Digital Library

[18]

Hyeonseob Nam, Jung-Woo Ha, and Jeonghee Kim. 2017. Dual Attention Networks for Multimodal Reasoning and Matching. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2156--2164.

[19]

Hyeonwoo Noh, Paul Hongsuck Seo, and Bohyung Han. 2016. Image Question Answering Using Convolutional Neural Network with Dynamic Parameter Prediction. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 30--38.

[20]

Tim Salimans, Ian J. Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. 2016. Improved Techniques for Training GANs. In Proceedings of Advances In Neural Information Processing Systems. 2226--2234.

Digital Library

[21]

Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. In Proceedings of the 3rd International Conference on Learning Representations . arXiv:1409.1556.

[22]

Bokun Wang, Yang Yang, Xing Xu, Alan Hanjalic, and Heng Tao Shen. 2017a. Adversarial Cross-Modal Retrieval. In Proceedings of the ACM on Multimedia Conference . 154--162.

Digital Library

[23]

Jun Wang, Lantao Yu, Weinan Zhang, Yu Gong, Yinghui Xu, Benyou Wang, Peng Zhang, and Dell Zhang. 2017b. IRGAN: A Minimax Game for Unifying Generative and Discriminative Information Retrieval Models. In Proceedings of the 40th International ACM SIGIR . 515--524.

Digital Library

[24]

Senzhang Wang, Xia Hu, Philip S Yu, and Zhoujun Li. 2014. MMRate: inferring multi-aspect diffusion networks with multi-pattern cascades. In Proceedings of the 20th ACM SIGKDD. 1246--1255.

Digital Library

[25]

Qi Wu, Peng Wang, Chunhua Shen, Anthony R. Dick, and Anton van den Hengel. 2016. Ask Me Anything: Free-Form Visual Question Answering Based on Knowledge from External Sources. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 4622--4630.

[26]

Zhibiao Wu and Martha Palmer. 1994. Verb Semantics and Lexical Selection. In Proceedings of the 32nd annual meeting on Association for Computational Linguistics. 133--138.

Digital Library

[27]

Liang Xie, Jialie Shen, and Lei Zhu. 2016. Online Cross-Modal Hashing for Web Image Retrieval. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence. 294--300.

Digital Library

[28]

Caiming Xiong, Stephen Merity, and Richard Socher. 2016. Dynamic Memory Networks for Visual and Textual Question Answering. In Proceedings of the 33nd International Conference on Machine Learning. 2397--2406.

Digital Library

[29]

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio:. 2015. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. In Proceedings of the 32nd International Conference on Machine Learning . 2048--2057.

Digital Library

[30]

Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alexander J. Smola. 2016. Stacked Attention Networks for Image Question Answering. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 21--29.

[31]

Victor S. Lempitsky Yaroslav Ganin. 2015. Unsupervised Domain Adaptation by Backpropagation. In Proceedings of the 32nd International Conference on Machine Learning. 1180--1189.

Digital Library

[32]

Dongfei Yu, Jianlong Fu, Tao Mei, and Yong Rui. 2017a. Multi-level Attention Networks for Visual Question Answering. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 21--29.

[33]

Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. 2017b. SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence . 2852--2858.

[34]

Yuke Zhu, Oliver Groth, Michael S. Bernstein, and Li Fei-Fei. 2016. Visual7W: Grounded Question Answering in Images. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 4995--5004.

[35]

Yuke Zhu, Joseph J. Lim, and Li FeiFei. 2017. Knowledge Acquisition for Visual Question Answering via Iterative Querying. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition . 6146--6155.

Cited By

Liu YZhang BWang CYan GZhou KLi ZZhang L(2025)Vision-language representation learning with breadth and depth attention pre-trainingKnowledge-Based Systems10.1016/j.knosys.2024.112941(112941)Online publication date: Jan-2025
https://doi.org/10.1016/j.knosys.2024.112941
Lee DLee WSerra ESpezzano F(2024)Learning Prompt-Level Quality Variance for Cost-Effective Text-to-Image GenerationProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679954(3847-3851)Online publication date: 21-Oct-2024
https://dl.acm.org/doi/10.1145/3627673.3679954
Lu SLiu MYin LYin ZLiu XZheng W(2023)The multi-modal fusion in visual question answering: a review of attention mechanismsPeerJ Computer Science10.7717/peerj-cs.14009(e1400)Online publication date: 30-May-2023
https://doi.org/10.7717/peerj-cs.1400
Show More Cited By

Index Terms

Adversarial Learning of Answer-Related Representation for Visual Question Answering
1. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Question answering
    2. Specialized information retrieval
      1. Multimedia and multimodal retrieval

Recommendations

Generative Attention Model with Adversarial Self-learning for Visual Question Answering
Thematic Workshops '17: Proceedings of the on Thematic Workshops of ACM Multimedia 2017

Visual question answering (VQA) is arguably one of the most challenging multimodal understanding problems as it requires reasoning and deep understanding of the image, the question, and their semantic relationship. Existing VQA methods heavily rely on ...
Adversarial correlated autoencoder for unsupervised multi-view representation learning
Abstract
To eliminate the view discrepancy of multi-view data due to different distributions, the key is to learn the common representation for multi-view data in many practical applications. To achieve the end, we propose a novel unsupervised ...
Multimodal attention-driven visual question answering for Malayalam
Abstract
Visual question answering is a challenging task that necessitates for sophisticated reasoning over the visual elements to provide an accurate answer to a question. Majority of the state-of-the-art VQA models are only applicable to English ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CIKM '18: Proceedings of the 27th ACM International Conference on Information and Knowledge Management

October 2018

2362 pages

ISBN:9781450360142

DOI:10.1145/3269206

General Chair:
Alfredo Cuzzocrea
University of Trieste, Italy
,
Program Chairs:
James Allan
University of Massachusetts, USA
,
Norman Paton
University of Manchester, United Kingdom
,
Divesh Srivastava
AT&T Labs Research, USA
,
Rakesh Agrawal
Data Insights Lab, USA
,
Andrei Broder
Google Research, USA
,
Mohammed Zaki
Rensselaer Polytechnic Institute, USA
,
Selcuk Candan
Arizona State University, USA
,
Alexandros Labrinidis
University of Pittsburgh, USA
,
Assaf Schuster
Technion, Israel
,
Haixun Wang
Google Research, USA

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Beijing Natural Science Foundation of China
National Natural Science Foundation of China
State Key Labo-ratory of Software Development Environment

Conference

CIKM '18

Sponsor:

CIKM '18: The 27th ACM International Conference on Information and Knowledge Management

October 22 - 26, 2018

Torino, Italy

Acceptance Rates

CIKM '18 Paper Acceptance Rate 147 of 826 submissions, 18%;

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Sponsor:
sigir
sigir

The 34th ACM International Conference on Information and Knowledge Management

November 10 - 14, 2025

Seoul , Republic of Korea

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

12
Total Citations
View Citations
376
Total Downloads

Downloads (Last 12 months)13
Downloads (Last 6 weeks)0

Reflects downloads up to 20 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Liu YZhang BWang CYan GZhou KLi ZZhang L(2025)Vision-language representation learning with breadth and depth attention pre-trainingKnowledge-Based Systems10.1016/j.knosys.2024.112941(112941)Online publication date: Jan-2025
https://doi.org/10.1016/j.knosys.2024.112941
Lee DLee WSerra ESpezzano F(2024)Learning Prompt-Level Quality Variance for Cost-Effective Text-to-Image GenerationProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679954(3847-3851)Online publication date: 21-Oct-2024
https://dl.acm.org/doi/10.1145/3627673.3679954
Lu SLiu MYin LYin ZLiu XZheng W(2023)The multi-modal fusion in visual question answering: a review of attention mechanismsPeerJ Computer Science10.7717/peerj-cs.14009(e1400)Online publication date: 30-May-2023
https://doi.org/10.7717/peerj-cs.1400
Han HKe ZNie XDai LSlamu W(2023)Multimodal Fusion with Dual-Attention Based on Textual Double-Embedding Networks for Rumor DetectionApplied Sciences10.3390/app1308488613:8(4886)Online publication date: 13-Apr-2023
https://doi.org/10.3390/app13084886
Liu YZhang XHuang FZhang BLi Z(2022)Cross-Attentional Spatio-Temporal Semantic Graph Networks for Video Question AnsweringIEEE Transactions on Image Processing10.1109/TIP.2022.314252631(1684-1696)Online publication date: 2022
https://doi.org/10.1109/TIP.2022.3142526
Liu YZhang XZhao ZZhang BCheng LLi Z(2022)ALSA: Adversarial Learning of Supervised Attentions for Visual Question AnsweringIEEE Transactions on Cybernetics10.1109/TCYB.2020.302942352:6(4520-4533)Online publication date: Jun-2022
https://doi.org/10.1109/TCYB.2020.3029423
Sharma HJalal A(2022)Improving visual question answering by combining scene-text informationMultimedia Tools and Applications10.1007/s11042-022-12317-081:9(12177-12208)Online publication date: 1-Apr-2022
https://dl.acm.org/doi/10.1007/s11042-022-12317-0
Huang FLi CGao BLiu YAlotaibi SChen H(2021)Deep Attentive Multimodal Network Representation Learning for Social Media ImagesACM Transactions on Internet Technology10.1145/341729521:3(1-17)Online publication date: 16-Jun-2021
https://dl.acm.org/doi/10.1145/3417295
Liu YZhang XHuang FCheng LLi Z(2021)Adversarial Learning With Multi-Modal Attention for Visual Question AnsweringIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2020.301608332:9(3894-3908)Online publication date: Sep-2021
https://doi.org/10.1109/TNNLS.2020.3016083
Yuan ZSun SDuan LLi CWu XXu C(2021)Adversarial Multimodal Network for Movie Story Question AnsweringIEEE Transactions on Multimedia10.1109/TMM.2020.300266723(1744-1756)Online publication date: 2021
https://doi.org/10.1109/TMM.2020.3002667
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten