skip to main content
10.1145/3269206.3271765acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Adversarial Learning of Answer-Related Representation for Visual Question Answering

Published: 17 October 2018 Publication History

Abstract

Visual Question Answering (VQA) aims to learn a joint embedding of the question sentence and the corresponding image to infer the answer. Existing approaches learn the joint embedding don't consider the answer-related information, which results in that the learned representation is not effective to reflect the answer of the question. To address this problem, this paper proposes a novel method, i.e., Adversarial Learning of Answer-Related Representation (ALARR) for visual question answering, which seeks an effective answer-related representation for the question-image pair based on adversarial learning between two processes. The embedding learning process aims to generate modality-invariant joint representations for the question-image and question-answer pairs, respectively. Meanwhile, it tries to confuse the other process, embedding discriminator, which tries to discriminate the two representations from different modalities of pairs. Specifically, the joint embedding of the question-image pair is learned by a three-level attention model, and the joint representation of the question-answer pair is learned by a semantic integration model. Through the adversarial leaning, the answer-related representation are better preserved. Then an answer predictor is proposed to infer the answer from the answer-related representation. Experiments conducted on two widely used VQA benchmark datasets demonstrate that the proposed model outperforms the state-of-the-art approaches.

References

[1]
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. 2015. VQA: Visual Question Answering. In Proceedings of IEEE International Conference on Computer Vision. 2425--2433.
[2]
Hedi Ben-younes, Rémi Cadène, Matthieu Cord, and Nicolas Thome. 2017. MUTAN: Multimodal Tucker Fusion for Visual Question Answering. In Proceedings of IEEE International Conference on Computer Vision. 2631--2639.
[3]
Antoine Bordes, Sumit Chopra, and Jason Weston. 2014. Question Answering with Subgraph Embeddings. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 615--620.
[4]
Antoine Bordes, Jason Weston, and Nicolas Usunier. {n. d.}. Open Question Answering with Weakly Supervised Embedding Models. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases. 165--180.
[5]
Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. 2016. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Proceedings of Advances In Neural Information Processing Systems. 2172--2180.
[6]
eng Wang, Qi Wu, Chunhua Shen, and Anton van den Hengel. 2017. The VQA-Machine: Learning How to Use Existing Vision Algorithms to Answer New Questions. In IEEE Conference on Computer Vision and Pattern Recognition . 3909--3918.
[7]
Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, and Marcus Rohrbach. 2016. Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 457--468.
[8]
Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. 2014. generative adversarial nets. In Proceedings of the Advances in Neural Information Processing Systems. 2672--2680.
[9]
Andrej Karpathy and Li Fei-Fei. 2015. Deep Visual-Semantic Alignments for Generating Image Descriptions. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition . 3128--3137.
[10]
Jin-Hwa Kim, Kyoung Woon On, Woosang Lim, Jeonghee Kim, JungWoo Ha, and Byoung-Tak Zhang. 2017. Hadamard Product for Low-rank Bilinear Pooling. In Proceedings of 5th International Conference on Learning Representations .
[11]
Jiwei Li, Will Monroe, Tianlin Shi, Sébastien Jean, Alan Ritter, and Dan Jurafsky. 2017. Adversarial Learning for Neural Dialogue Generation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing . 2157--2169.
[12]
Hang Li Lin Ma, Zhengdong Lu. 2016. Learning to Answer Questions from Image Using Convolutional Neural Network. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence . 3567--3573.
[13]
Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. 2016. Hierarchical Question-Image Co-Attention for Visual Question Answering. In proceedings of Advances in Neural Information Processing Systems 29. 289--297.
[14]
Pan Lu, Hongsheng Li, Wei Zhang, Jianyong Wang, and Xiaogang Wang. 2018. Co-Attending Free-Form Regions and Detections With Multi-Modal Multiplicative Feature Embedding for Visual Question Answering. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence .
[15]
Mateusz Malinowski and Mario Fritz. 2014. A Multi-World Approach to Question Answering about Real-World Scenes based on Uncertain Input. In Proceedings of the Neural Information Processing Systems Conference. 1682--1690.
[16]
Mateusz Malinowski, Marcus Rohrbach, and Mario Fritz. 2015. Ask Your Neurons: A Neural-Based Approach to Answering Questions about Images. In IEEE International Conference on Computer Vision. 1--9.
[17]
Richard S. Zemel Mengye Ren, Ryan Kiros. 2015. Exploring Models and Data for Image Question Answering. In Proceedings of the Neural Information Processing Systems Conference. 2953--2961.
[18]
Hyeonseob Nam, Jung-Woo Ha, and Jeonghee Kim. 2017. Dual Attention Networks for Multimodal Reasoning and Matching. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2156--2164.
[19]
Hyeonwoo Noh, Paul Hongsuck Seo, and Bohyung Han. 2016. Image Question Answering Using Convolutional Neural Network with Dynamic Parameter Prediction. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 30--38.
[20]
Tim Salimans, Ian J. Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. 2016. Improved Techniques for Training GANs. In Proceedings of Advances In Neural Information Processing Systems. 2226--2234.
[21]
Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. In Proceedings of the 3rd International Conference on Learning Representations . arXiv:1409.1556.
[22]
Bokun Wang, Yang Yang, Xing Xu, Alan Hanjalic, and Heng Tao Shen. 2017a. Adversarial Cross-Modal Retrieval. In Proceedings of the ACM on Multimedia Conference . 154--162.
[23]
Jun Wang, Lantao Yu, Weinan Zhang, Yu Gong, Yinghui Xu, Benyou Wang, Peng Zhang, and Dell Zhang. 2017b. IRGAN: A Minimax Game for Unifying Generative and Discriminative Information Retrieval Models. In Proceedings of the 40th International ACM SIGIR . 515--524.
[24]
Senzhang Wang, Xia Hu, Philip S Yu, and Zhoujun Li. 2014. MMRate: inferring multi-aspect diffusion networks with multi-pattern cascades. In Proceedings of the 20th ACM SIGKDD. 1246--1255.
[25]
Qi Wu, Peng Wang, Chunhua Shen, Anthony R. Dick, and Anton van den Hengel. 2016. Ask Me Anything: Free-Form Visual Question Answering Based on Knowledge from External Sources. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 4622--4630.
[26]
Zhibiao Wu and Martha Palmer. 1994. Verb Semantics and Lexical Selection. In Proceedings of the 32nd annual meeting on Association for Computational Linguistics. 133--138.
[27]
Liang Xie, Jialie Shen, and Lei Zhu. 2016. Online Cross-Modal Hashing for Web Image Retrieval. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence. 294--300.
[28]
Caiming Xiong, Stephen Merity, and Richard Socher. 2016. Dynamic Memory Networks for Visual and Textual Question Answering. In Proceedings of the 33nd International Conference on Machine Learning. 2397--2406.
[29]
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio:. 2015. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. In Proceedings of the 32nd International Conference on Machine Learning . 2048--2057.
[30]
Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alexander J. Smola. 2016. Stacked Attention Networks for Image Question Answering. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 21--29.
[31]
Victor S. Lempitsky Yaroslav Ganin. 2015. Unsupervised Domain Adaptation by Backpropagation. In Proceedings of the 32nd International Conference on Machine Learning. 1180--1189.
[32]
Dongfei Yu, Jianlong Fu, Tao Mei, and Yong Rui. 2017a. Multi-level Attention Networks for Visual Question Answering. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 21--29.
[33]
Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. 2017b. SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence . 2852--2858.
[34]
Yuke Zhu, Oliver Groth, Michael S. Bernstein, and Li Fei-Fei. 2016. Visual7W: Grounded Question Answering in Images. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 4995--5004.
[35]
Yuke Zhu, Joseph J. Lim, and Li FeiFei. 2017. Knowledge Acquisition for Visual Question Answering via Iterative Querying. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition . 6146--6155.

Cited By

View all
  • (2025)Vision-language representation learning with breadth and depth attention pre-trainingKnowledge-Based Systems10.1016/j.knosys.2024.112941(112941)Online publication date: Jan-2025
  • (2024)Learning Prompt-Level Quality Variance for Cost-Effective Text-to-Image GenerationProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679954(3847-3851)Online publication date: 21-Oct-2024
  • (2023)The multi-modal fusion in visual question answering: a review of attention mechanismsPeerJ Computer Science10.7717/peerj-cs.14009(e1400)Online publication date: 30-May-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CIKM '18: Proceedings of the 27th ACM International Conference on Information and Knowledge Management
October 2018
2362 pages
ISBN:9781450360142
DOI:10.1145/3269206
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. adversarial learning
  2. representation
  3. visual question answering

Qualifiers

  • Research-article

Funding Sources

  • Beijing Natural Science Foundation of China
  • National Natural Science Foundation of China
  • State Key Labo-ratory of Software Development Environment

Conference

CIKM '18
Sponsor:

Acceptance Rates

CIKM '18 Paper Acceptance Rate 147 of 826 submissions, 18%;
Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)13
  • Downloads (Last 6 weeks)0
Reflects downloads up to 20 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Vision-language representation learning with breadth and depth attention pre-trainingKnowledge-Based Systems10.1016/j.knosys.2024.112941(112941)Online publication date: Jan-2025
  • (2024)Learning Prompt-Level Quality Variance for Cost-Effective Text-to-Image GenerationProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679954(3847-3851)Online publication date: 21-Oct-2024
  • (2023)The multi-modal fusion in visual question answering: a review of attention mechanismsPeerJ Computer Science10.7717/peerj-cs.14009(e1400)Online publication date: 30-May-2023
  • (2023)Multimodal Fusion with Dual-Attention Based on Textual Double-Embedding Networks for Rumor DetectionApplied Sciences10.3390/app1308488613:8(4886)Online publication date: 13-Apr-2023
  • (2022)Cross-Attentional Spatio-Temporal Semantic Graph Networks for Video Question AnsweringIEEE Transactions on Image Processing10.1109/TIP.2022.314252631(1684-1696)Online publication date: 2022
  • (2022)ALSA: Adversarial Learning of Supervised Attentions for Visual Question AnsweringIEEE Transactions on Cybernetics10.1109/TCYB.2020.302942352:6(4520-4533)Online publication date: Jun-2022
  • (2022)Improving visual question answering by combining scene-text informationMultimedia Tools and Applications10.1007/s11042-022-12317-081:9(12177-12208)Online publication date: 1-Apr-2022
  • (2021)Deep Attentive Multimodal Network Representation Learning for Social Media ImagesACM Transactions on Internet Technology10.1145/341729521:3(1-17)Online publication date: 16-Jun-2021
  • (2021)Adversarial Learning With Multi-Modal Attention for Visual Question AnsweringIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2020.301608332:9(3894-3908)Online publication date: Sep-2021
  • (2021)Adversarial Multimodal Network for Movie Story Question AnsweringIEEE Transactions on Multimedia10.1109/TMM.2020.300266723(1744-1756)Online publication date: 2021
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media