Loading [a11y]/accessibility-menu.js
Adversarial Learning With Multi-Modal Attention for Visual Question Answering | IEEE Journals & Magazine | IEEE Xplore
Scheduled Maintenance: On Tuesday, 25 February, IEEE Xplore will undergo scheduled maintenance from 1:00-5:00 PM ET (1800-2200 UTC). During this time, there may be intermittent impact on performance. We apologize for any inconvenience.

Adversarial Learning With Multi-Modal Attention for Visual Question Answering


Abstract:

Visual question answering (VQA) has been proposed as a challenging task and attracted extensive research attention. It aims to learn a joint representation of the questio...Show More

Abstract:

Visual question answering (VQA) has been proposed as a challenging task and attracted extensive research attention. It aims to learn a joint representation of the question–image pair for answer inference. Most of the existing methods focus on exploring the multi-modal correlation between the question and image to learn the joint representation. However, the answer-related information is not fully captured by these methods, which results that the learned representation is ineffective to reflect the answer of the question. To tackle this problem, we propose a novel model, i.e., adversarial learning with multi-modal attention (ALMA), for VQA. An adversarial learning-based framework is proposed to learn the joint representation to effectively reflect the answer-related information. Specifically, multi-modal attention with the Siamese similarity learning method is designed to build two embedding generators, i.e., question–image embedding and question–answer embedding. Then, adversarial learning is conducted as an interplay between the two embedding generators and an embedding discriminator. The generators have the purpose of generating two modality-invariant representations for the question–image and question–answer pairs, whereas the embedding discriminator aims to discriminate the two representations. Both the multi-modal attention module and the adversarial networks are integrated into an end-to-end unified framework to infer the answer. Experiments performed on three benchmark data sets confirm the favorable performance of ALMA compared with state-of-the-art approaches.
Published in: IEEE Transactions on Neural Networks and Learning Systems ( Volume: 32, Issue: 9, September 2021)
Page(s): 3894 - 3908
Date of Publication: 24 August 2020

ISSN Information:

PubMed ID: 32833656

Funding Agency:


References

References is not available for this document.