Learning Modality-Invariant Features by Cross-Modality Adversarial Network for Visual Question Answering

Fu, Ze; Zheng, Changmeng; Cai, Yi; Li, Qing; Wang, Tao

doi:10.1007/978-3-030-85896-4_25

Ze Fu^12,13,
Changmeng Zheng^12,13,
Yi Cai^12,13,
Qing Li¹⁴ &
…
Tao Wang¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12858))

Included in the following conference series:

Asia-Pacific Web (APWeb) and Web-Age Information Management (WAIM) Joint International Conference on Web and Big Data

1518 Accesses
1 Citations

Abstract

Visual Question Answering (VQA) is a typical multimodal task with significant development prospect on web application. In order to answer the question based on the corresponding image, a VQA model needs to utilize the information from different modality efficiently. Although the multimodal fusion methods such as attention mechanism make significant contribution for VQA, these methods try to co-learn the multimodal features directly, ignoring the large gap between different modality and thus poor aligning the semantic. In this paper, we propose a Cross-Modality Adversarial Network (CMAN) to address this limitation. Our method combines cross-modality adversarial learning with modality-invariant attention learning aiming to learn the modality-invariant features for better semantic alignment and higher answer prediction accuracy. The accuracy of model achieves 70.81% on the test-dev split on the VQA-v2 dataset. Our results also show that the model narrows the gap between different modalities effectively and improves the alignment performance of the multimodal information.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Agrawal, A., Batra, D., Parikh, D.: Analyzing the behavior of visual question answering models. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 1955–1960 (2016)
Google Scholar
Agrawal, A., Batra, D., Parikh, D., Kembhavi, A.: Don’t just assume; look and answer: Overcoming priors for visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4971–4980 (2018)
Google Scholar
Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6077–6086 (2018)
Google Scholar
Antol, S., et al.: VQA: visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2425–2433 (2015)
Google Scholar
Ben-Younes, H., Cadene, R., Cord, M., Thome, N.: MUTAN: multimodal tucker fusion for visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2612–2620 (2017)
Google Scholar
Cadene, R., Ben-Younes, H., Cord, M., Thome, N.: MUREL: multimodal relational reasoning for visual question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1989–1998 (2019)
Google Scholar
Ganin, Y., Lempitsky, V.: Unsupervised domain adaptation by backpropagation. In: International Conference on Machine Learning, pp. 1180–1189. PMLR (2015)
Google Scholar
Gao, P., et al.: Dynamic fusion with intra-and inter-modality attention flow for visual question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6639–6648 (2019)
Google Scholar
Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the V in VQA matter: elevating the role of image understanding in visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6904–6913 (2017)
Google Scholar
Jiang, H., Misra, I., Rohrbach, M., Learned-Miller, E., Chen, X.: In defense of grid features for visual question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10267–10276 (2020)
Google Scholar
Kim, J.H., Jun, J., Zhang, B.T.: Bilinear attention networks. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 1571–1581 (2018)
Google Scholar
Lasecki, W.S., Zhong, Y., Bigham, J.P.: Increasing the bandwidth of crowdsourced visual question answering to better support blind users. In: Proceedings of the 16th International ACM SIGACCESS Conference on Computers and Accessibility, pp. 263–264 (2014)
Google Scholar
Liu, Y., Peng, Y., Lim, K., Ling, N.: A novel image retrieval algorithm based on transfer learning and fusion features. World Wide Web 22(3), 1313–1324 (2019)
Article Google Scholar
Liu, Y., Zhang, X., Huang, F., Cheng, L., Li, Z.: Adversarial learning with multi-modal attention for visual question answering. IEEE Trans. Neural Netw. Learn. Syst. (2020)
Google Scholar
Liu, Y., Zhang, X., Huang, F., Li, Z.: Adversarial learning of answer-related representation for visual question answering. In: Proceedings of the 27th ACM International Conference on Information and Knowledge Management, pp. 1013–1022 (2018)
Google Scholar
Van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9(11), 2579–2605 (2008)
Google Scholar
Manjunatha, V., Saini, N., Davis, L.S.: Explicit bias discovery in visual question answering models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9562–9571 (2019)
Google Scholar
Ramakrishnan, S., Agrawal, A., Lee, S.: Overcoming language priors in visual question answering with adversarial regularization. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 1548–1558 (2018)
Google Scholar
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39(6), 1137–1149 (2016)
Article Google Scholar
Rennie, S.J., Marcheret, E., Mroueh, Y., Ross, J., Goel, V.: Self-critical sequence training for image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7008–7024 (2017)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: NIPS (2017)
Google Scholar
Xu, X., He, L., Lu, H., Gao, L., Ji, Y.: Deep adversarial metric learning for cross-modal retrieval. World Wide Web 22(2), 657–672 (2019)
Article Google Scholar
Yang, Z., He, X., Gao, J., Deng, L., Smola, A.: Stacked attention networks for image question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 21–29 (2016)
Google Scholar
Yu, Z., Yu, J., Cui, Y., Tao, D., Tian, Q.: Deep modular co-attention networks for visual question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6281–6290 (2019)
Google Scholar
Zhang, W.N., Zhu, Q., Wang, Y., Zhao, Y., Liu, T.: Neural personalized response generation as domain adaptation. World Wide Web 22(4), 1427–1446 (2019)
Article Google Scholar

Download references

Acknowledgments

This work was supported by National Natural Science Foundation of China (No. 62076100), National Key Research and Development Program of China (Standard knowledge graph for epidemic prevention and production recovering intelligent service platform and its applications), the Fundamental Research Funds for the Central Universities, SCUT (No. D2201300, D2210010), the Science and Technology Programs of Guangzhou (201902010046), the Science and Technology Planning Project of Guangdong Province (No. 2020B0101100002).

Author information

Authors and Affiliations

School of Software Engineering, South China University of Technology, Guangzhou, China
Ze Fu, Changmeng Zheng & Yi Cai
Key Laboratory of Big Data and Intelligent Robot (SCUT), MOE of China, Shanghai, China
Ze Fu, Changmeng Zheng & Yi Cai
Department of Computing, The Hong Kong Polytechnic University, Hong Kong SAR, China
Qing Li
Department of Biostatistics and Health Informatics, Institute of Psychiatry, Psychology and Neuroscience, King’s College London, London, UK
Tao Wang

Authors

Ze Fu
View author publications
You can also search for this author in PubMed Google Scholar
Changmeng Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Yi Cai
View author publications
You can also search for this author in PubMed Google Scholar
Qing Li
View author publications
You can also search for this author in PubMed Google Scholar
Tao Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yi Cai .

Editor information

Editors and Affiliations

University of Macau, Macau, China
Leong Hou U
University of Caen Normandie, Caen, France
Marc Spaniol
Osaka University, Osaka, Japan
Yasushi Sakurai
South China University of Technology, Guangzhou, China
Junying Chen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Fu, Z., Zheng, C., Cai, Y., Li, Q., Wang, T. (2021). Learning Modality-Invariant Features by Cross-Modality Adversarial Network for Visual Question Answering. In: U, L.H., Spaniol, M., Sakurai, Y., Chen, J. (eds) Web and Big Data. APWeb-WAIM 2021. Lecture Notes in Computer Science(), vol 12858. Springer, Cham. https://doi.org/10.1007/978-3-030-85896-4_25

Download citation

DOI: https://doi.org/10.1007/978-3-030-85896-4_25
Published: 19 August 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-85895-7
Online ISBN: 978-3-030-85896-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics