Abstract
In recent years, researchers have focused on Visual Question Answering (VQA) due to its numerous real-world applications. And visual attention mechanisms are widely used to assist answer prediction by selecting important regions. Nevertheless, few works consider the process of how the model progressively selects informative regions. To simulate the dynamic reasoning process of human beings, the existing method, AiR-M, decomposes the answer prediction process into a sequence of reasoning steps, in which each step contains a reasoning operation and a corresponding attention map. However, AiR-M neglects the variable number of reasoning steps for different questions and pads the reasoning step sequence with invalid steps, which introduces inaccurate information into answer prediction and thus limits the model performance. In this paper, we propose a Dynamic Alternative Attention model (\(\textrm{DA}^{2}\)) to address this problem. Specifically, \(\textrm{DA}^{2}\) consists of a feature extraction module denoted as \(\textrm{DA}^{2}\)-f and a training module denoted as \(\textrm{DA}^{2}\)-t. \(\textrm{DA}^{2}\)-f is used to provide the answer prediction progress with more accurate visual information by adaptively filtering out the visual regions of invalid steps. And \(\textrm{DA}^{2}\)-t improves model training by masking out the attention maps corresponding to invalid steps in the objective function. Experimental results on the GQA dataset verify the effectiveness of our proposed method.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Agrawal, A., Batra, D., Parikh, D., Kembhavi, A.: Don’t just assume; look and answer: overcoming priors for visual question answering. In: CVPR, pp. 4971–4980. Computer Vision Foundation/IEEE Computer Society (2018)
Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: CVPR, pp. 6077–6086. Computer Vision Foundation / IEEE Computer Society (2018)
Antol, S., et al.: VQA: visual question answering. In: ICCV, pp. 2425–2433. IEEE Computer Society (2015)
Ben-younes, H., Cadène, R., Cord, M., Thome, N.: MUTAN: multimodal tucker fusion for visual question answering. In: ICCV, pp. 2631–2639. IEEE Computer Society (2017)
Bigham, J.P., et al.: Vizwiz: nearly real-time answers to visual questions. In: ACM, pp. 333–342 (2010)
Bordes, A., Usunier, N., García-Durán, A., Weston, J., Yakhnenko, O.: Translating embeddings for modeling multi-relational data. In: Advances in Neural Information Processing Systems, pp. 2787–2795 (2013)
Bruna, J., Zaremba, W., Szlam, A., LeCun, Y.: Spectral networks and locally connected networks on graphs. In: ICLR (2014)
Chen, S., Jiang, M., Yang, J., Zhao, Q.: AiR: attention with reasoning capability. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 91–107. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_6
Cho, K., van Merrienboer, B., Bahdanau, D., Bengio, Y.: On the properties of neural machine translation: encoder-decoder approaches. In: EMNLP, pp. 103–111. Association for Computational Linguistics (2014)
Das, A., Agrawal, H., Zitnick, L., Parikh, D., Batra, D.: Human attention in visual question answering: do humans and deep networks look at the same regions? Comput. Vis. Image Underst. 163, 90–100 (2017)
Fukui, A., Park, D.H., Yang, D., Rohrbach, A., Darrell, T., Rohrbach, M.: Multimodal compact bilinear pooling for visual question answering and visual grounding. In: EMNLP, pp. 457–468. The Association for Computational Linguistics (2016)
Gao, P., et al.: Dynamic fusion with intra- and inter-modality attention flow for visual question answering. In: CVPR, pp. 6639–6648. Computer Vision Foundation / IEEE (2019)
Gao, P., You, H., Zhang, Z., Wang, X., Li, H.: Multi-modality latent interaction network for visual question answering. In: ICCV, pp. 5824–5834. IEEE (2019)
Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the V in VQA matter: elevating the role of image understanding in visual question answering. In: CVPR, pp. 6325–6334. IEEE Computer Society (2017)
Gui, L., Wang, B., Huang, Q., Hauptmann, A., Bisk, Y., Gao, J.: Kat: a knowledge augmented transformer for vision-and-language. NAACL (2022)
Guo, Q., et al.: Constructing Chinese historical literature knowledge graph based on BERT. In: Xing, C., Fu, X., Zhang, Y., Zhang, G., Borjigin, C. (eds.) WISA 2021. LNCS, vol. 12999, pp. 323–334. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-87571-8_28
Haurilet, M., Roitberg, A., Stiefelhagen, R.: It’s not about the journey; it’s about the destination: following soft paths under question-guidance for visual reasoning. In: CVPR, pp. 1930–1939. Computer Vision Foundation/IEEE (2019)
Huang, P., Huang, J., Guo, Y., Qiao, M., Zhu, Y.: Multi-grained attention with object-level grounding for visual question answering. In: ACL, pp. 3595–3600. Association for Computational Linguistics (2019)
Hudson, D.A., Manning, C.D.: GQA: a new dataset for real-world visual reasoning and compositional question answering. In: CVPR, pp. 6700–6709. Computer Vision Foundation / IEEE (2019)
Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L., Zitnick, C.L., Girshick, R.B.: CLEVR: a diagnostic dataset for compositional language and elementary visual reasoning. In: CVPR, pp. 1988–1997. IEEE Computer Society (2017)
Kafle, K., Price, B.L., Cohen, S., Kanan, C.: DVQA: understanding data visualizations via question answering. In: CVPR, pp. 5648–5656. Computer Vision Foundation / IEEE Computer Society (2018)
Kahou, S.E., Michalski, V., Atkinson, A., Kádár, Á., Trischler, A., Bengio, Y.: Figureqa: an annotated figure dataset for visual reasoning. In: ICLR. OpenReview.net (2018)
Li, L., Gan, Z., Cheng, Y., Liu, J.: Relation-aware graph attention network for visual question answering. In: ICCV, pp. 10312–10321. IEEE (2019)
Lin, X., Parikh, D.: Leveraging visual question answering for image-caption ranking. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 261–277. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46475-6_17
Lu, J., Yang, J., Batra, D., Parikh, D.: Hierarchical question-image co-attention for visual question answering. In: Advances in Neural Information Processing Systems, pp. 289–297 (2016)
Patro, B.N., Anupriy, S., Namboodiri, V.: Explanation vs attention: a two-player game to obtain attention for VQA. In: AAAI, pp. 11848–11855. AAAI Press (2020)
Qiao, T., Dong, J., Xu, D.: Exploring human-like attention supervision in visual question answering. In: AAAI, pp. 7300–7307. AAAI Press (2018)
Shi, J., Zhang, H., Li, J.: Explainable and explicit visual reasoning over scene graphs. In: CVPR, pp. 8376–8384. Computer Vision Foundation / IEEE (2019)
Shih, K.J., Singh, S., Hoiem, D.: Where to look: focus regions for visual question answering. In: CVPR, pp. 4613–4621. IEEE Computer Society (2016)
Tang, K., Zhang, H., Wu, B., Luo, W., Liu, W.: Learning to compose dynamic tree structures for visual contexts. In: CVPR, pp. 6619–6628. Computer Vision Foundation / IEEE (2019)
Vo, N., et al.: Composing text and image for image retrieval - an empirical odyssey. In: CVPR (2019)
Wang, P., Wu, Q., Shen, C., Dick, A.R., van den Hengel, A.: Explicit knowledge-based reasoning for visual question answering. In: IJCAI, pp. 1290–1296. ijcai.org (2017)
Wang, P., Wu, Q., Shen, C., Dick, A.R., van den Hengel, A.: FVQA: fact-based visual question answering. IEEE Trans. Pattern Anal. Mach. Intell. 40(10), 2413–2427 (2018)
Wu, F., Jing, X., Wei, P., Lan, C., Ji, Y., Jiang, G., Huang, Q.: Semi-supervised multi-view graph convolutional networks with application to webpage classification. Inf. Sci. 591, 142–154 (2022)
Wu, J., Hu, Z., Mooney, R.J.: Generating question relevant captions to aid visual question answering. In: ACL, pp. 3585–3594. Association for Computational Linguistics (2019)
Xu, H., Saenko, K.: Ask, attend and answer: exploring question-guided spatial attention for visual question answering. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 451–466. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46478-7_28
Yang, Z., He, X., Gao, J., Deng, L., Smola, A.J.: Stacked attention networks for image question answering. In: CVPR, pp. 21–29. IEEE Computer Society (2016)
Yu, Z., Yu, J., Fan, J., Tao, D.: Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In: ICCV, pp. 1821–1830 (2017)
Zhang, P., Goyal, Y., Summers-Stay, D., Batra, D., Parikh, D.: Yin and yang: balancing and answering binary visual questions. In: CVPR, pp. 5014–5022. IEEE Computer Society (2016)
Zhang, Y., Niebles, J.C., Soto, A.: Interpretable visual question answering by visual grounding from attention supervision mining. In: WACV, pp. 349–357. IEEE (2019)
Acknowledgements
This research is supported by the NSFC-Xinjiang Joint Fund (No. U1903128), and the Fundamental Research Funds for the Central Universities (No. 63223046).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Liu, X., Guo, W., Zhang, Y., Zhang, Y. (2022). Dynamic Alternative Attention for Visual Question Answering. In: Zhao, X., Yang, S., Wang, X., Li, J. (eds) Web Information Systems and Applications. WISA 2022. Lecture Notes in Computer Science, vol 13579. Springer, Cham. https://doi.org/10.1007/978-3-031-20309-1_33
Download citation
DOI: https://doi.org/10.1007/978-3-031-20309-1_33
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20308-4
Online ISBN: 978-3-031-20309-1
eBook Packages: Computer ScienceComputer Science (R0)