Abstract
Visual Question Answering (VQA) is a typical multimodal task with significant development prospect on web application. In order to answer the question based on the corresponding image, a VQA model needs to utilize the information from different modality efficiently. Although the multimodal fusion methods such as attention mechanism make significant contribution for VQA, these methods try to co-learn the multimodal features directly, ignoring the large gap between different modality and thus poor aligning the semantic. In this paper, we propose a Cross-Modality Adversarial Network (CMAN) to address this limitation. Our method combines cross-modality adversarial learning with modality-invariant attention learning aiming to learn the modality-invariant features for better semantic alignment and higher answer prediction accuracy. The accuracy of model achieves 70.81% on the test-dev split on the VQA-v2 dataset. Our results also show that the model narrows the gap between different modalities effectively and improves the alignment performance of the multimodal information.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Agrawal, A., Batra, D., Parikh, D.: Analyzing the behavior of visual question answering models. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 1955–1960 (2016)
Agrawal, A., Batra, D., Parikh, D., Kembhavi, A.: Don’t just assume; look and answer: Overcoming priors for visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4971–4980 (2018)
Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6077–6086 (2018)
Antol, S., et al.: VQA: visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2425–2433 (2015)
Ben-Younes, H., Cadene, R., Cord, M., Thome, N.: MUTAN: multimodal tucker fusion for visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2612–2620 (2017)
Cadene, R., Ben-Younes, H., Cord, M., Thome, N.: MUREL: multimodal relational reasoning for visual question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1989–1998 (2019)
Ganin, Y., Lempitsky, V.: Unsupervised domain adaptation by backpropagation. In: International Conference on Machine Learning, pp. 1180–1189. PMLR (2015)
Gao, P., et al.: Dynamic fusion with intra-and inter-modality attention flow for visual question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6639–6648 (2019)
Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the V in VQA matter: elevating the role of image understanding in visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6904–6913 (2017)
Jiang, H., Misra, I., Rohrbach, M., Learned-Miller, E., Chen, X.: In defense of grid features for visual question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10267–10276 (2020)
Kim, J.H., Jun, J., Zhang, B.T.: Bilinear attention networks. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 1571–1581 (2018)
Lasecki, W.S., Zhong, Y., Bigham, J.P.: Increasing the bandwidth of crowdsourced visual question answering to better support blind users. In: Proceedings of the 16th International ACM SIGACCESS Conference on Computers and Accessibility, pp. 263–264 (2014)
Liu, Y., Peng, Y., Lim, K., Ling, N.: A novel image retrieval algorithm based on transfer learning and fusion features. World Wide Web 22(3), 1313–1324 (2019)
Liu, Y., Zhang, X., Huang, F., Cheng, L., Li, Z.: Adversarial learning with multi-modal attention for visual question answering. IEEE Trans. Neural Netw. Learn. Syst. (2020)
Liu, Y., Zhang, X., Huang, F., Li, Z.: Adversarial learning of answer-related representation for visual question answering. In: Proceedings of the 27th ACM International Conference on Information and Knowledge Management, pp. 1013–1022 (2018)
Van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9(11), 2579–2605 (2008)
Manjunatha, V., Saini, N., Davis, L.S.: Explicit bias discovery in visual question answering models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9562–9571 (2019)
Ramakrishnan, S., Agrawal, A., Lee, S.: Overcoming language priors in visual question answering with adversarial regularization. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 1548–1558 (2018)
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39(6), 1137–1149 (2016)
Rennie, S.J., Marcheret, E., Mroueh, Y., Ross, J., Goel, V.: Self-critical sequence training for image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7008–7024 (2017)
Vaswani, A., et al.: Attention is all you need. In: NIPS (2017)
Xu, X., He, L., Lu, H., Gao, L., Ji, Y.: Deep adversarial metric learning for cross-modal retrieval. World Wide Web 22(2), 657–672 (2019)
Yang, Z., He, X., Gao, J., Deng, L., Smola, A.: Stacked attention networks for image question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 21–29 (2016)
Yu, Z., Yu, J., Cui, Y., Tao, D., Tian, Q.: Deep modular co-attention networks for visual question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6281–6290 (2019)
Zhang, W.N., Zhu, Q., Wang, Y., Zhao, Y., Liu, T.: Neural personalized response generation as domain adaptation. World Wide Web 22(4), 1427–1446 (2019)
Acknowledgments
This work was supported by National Natural Science Foundation of China (No. 62076100), National Key Research and Development Program of China (Standard knowledge graph for epidemic prevention and production recovering intelligent service platform and its applications), the Fundamental Research Funds for the Central Universities, SCUT (No. D2201300, D2210010), the Science and Technology Programs of Guangzhou (201902010046), the Science and Technology Planning Project of Guangdong Province (No. 2020B0101100002).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Fu, Z., Zheng, C., Cai, Y., Li, Q., Wang, T. (2021). Learning Modality-Invariant Features by Cross-Modality Adversarial Network for Visual Question Answering. In: U, L.H., Spaniol, M., Sakurai, Y., Chen, J. (eds) Web and Big Data. APWeb-WAIM 2021. Lecture Notes in Computer Science(), vol 12858. Springer, Cham. https://doi.org/10.1007/978-3-030-85896-4_25
Download citation
DOI: https://doi.org/10.1007/978-3-030-85896-4_25
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-85895-7
Online ISBN: 978-3-030-85896-4
eBook Packages: Computer ScienceComputer Science (R0)