Dynamic Alternative Attention for Visual Question Answering

Liu, Xumeng; Guo, Wenya; Zhang, Yuhao; Zhang, Ying

doi:10.1007/978-3-031-20309-1_33

Xumeng Liu¹¹,
Wenya Guo¹¹,
Yuhao Zhang¹² &
…
Ying Zhang¹¹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13579))

Included in the following conference series:

International Conference on Web Information Systems and Applications

972 Accesses

Abstract

In recent years, researchers have focused on Visual Question Answering (VQA) due to its numerous real-world applications. And visual attention mechanisms are widely used to assist answer prediction by selecting important regions. Nevertheless, few works consider the process of how the model progressively selects informative regions. To simulate the dynamic reasoning process of human beings, the existing method, AiR-M, decomposes the answer prediction process into a sequence of reasoning steps, in which each step contains a reasoning operation and a corresponding attention map. However, AiR-M neglects the variable number of reasoning steps for different questions and pads the reasoning step sequence with invalid steps, which introduces inaccurate information into answer prediction and thus limits the model performance. In this paper, we propose a Dynamic Alternative Attention model (\(\textrm{DA}^{2}\)) to address this problem. Specifically, \(\textrm{DA}^{2}\) consists of a feature extraction module denoted as \(\textrm{DA}^{2}\)-f and a training module denoted as \(\textrm{DA}^{2}\)-t. \(\textrm{DA}^{2}\)-f is used to provide the answer prediction progress with more accurate visual information by adaptively filtering out the visual regions of invalid steps. And \(\textrm{DA}^{2}\)-t improves model training by masking out the attention maps corresponding to invalid steps in the objective function. Experimental results on the GQA dataset verify the effectiveness of our proposed method.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://cs.stanford.edu/people/dorarad/gqa/download.html.

References

Agrawal, A., Batra, D., Parikh, D., Kembhavi, A.: Don’t just assume; look and answer: overcoming priors for visual question answering. In: CVPR, pp. 4971–4980. Computer Vision Foundation/IEEE Computer Society (2018)
Google Scholar
Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: CVPR, pp. 6077–6086. Computer Vision Foundation / IEEE Computer Society (2018)
Google Scholar
Antol, S., et al.: VQA: visual question answering. In: ICCV, pp. 2425–2433. IEEE Computer Society (2015)
Google Scholar
Ben-younes, H., Cadène, R., Cord, M., Thome, N.: MUTAN: multimodal tucker fusion for visual question answering. In: ICCV, pp. 2631–2639. IEEE Computer Society (2017)
Google Scholar
Bigham, J.P., et al.: Vizwiz: nearly real-time answers to visual questions. In: ACM, pp. 333–342 (2010)
Google Scholar
Bordes, A., Usunier, N., García-Durán, A., Weston, J., Yakhnenko, O.: Translating embeddings for modeling multi-relational data. In: Advances in Neural Information Processing Systems, pp. 2787–2795 (2013)
Google Scholar
Bruna, J., Zaremba, W., Szlam, A., LeCun, Y.: Spectral networks and locally connected networks on graphs. In: ICLR (2014)
Google Scholar
Chen, S., Jiang, M., Yang, J., Zhao, Q.: AiR: attention with reasoning capability. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 91–107. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_6
Chapter Google Scholar
Cho, K., van Merrienboer, B., Bahdanau, D., Bengio, Y.: On the properties of neural machine translation: encoder-decoder approaches. In: EMNLP, pp. 103–111. Association for Computational Linguistics (2014)
Google Scholar
Das, A., Agrawal, H., Zitnick, L., Parikh, D., Batra, D.: Human attention in visual question answering: do humans and deep networks look at the same regions? Comput. Vis. Image Underst. 163, 90–100 (2017)
Article Google Scholar
Fukui, A., Park, D.H., Yang, D., Rohrbach, A., Darrell, T., Rohrbach, M.: Multimodal compact bilinear pooling for visual question answering and visual grounding. In: EMNLP, pp. 457–468. The Association for Computational Linguistics (2016)
Google Scholar
Gao, P., et al.: Dynamic fusion with intra- and inter-modality attention flow for visual question answering. In: CVPR, pp. 6639–6648. Computer Vision Foundation / IEEE (2019)
Google Scholar
Gao, P., You, H., Zhang, Z., Wang, X., Li, H.: Multi-modality latent interaction network for visual question answering. In: ICCV, pp. 5824–5834. IEEE (2019)
Google Scholar
Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the V in VQA matter: elevating the role of image understanding in visual question answering. In: CVPR, pp. 6325–6334. IEEE Computer Society (2017)
Google Scholar
Gui, L., Wang, B., Huang, Q., Hauptmann, A., Bisk, Y., Gao, J.: Kat: a knowledge augmented transformer for vision-and-language. NAACL (2022)
Google Scholar
Guo, Q., et al.: Constructing Chinese historical literature knowledge graph based on BERT. In: Xing, C., Fu, X., Zhang, Y., Zhang, G., Borjigin, C. (eds.) WISA 2021. LNCS, vol. 12999, pp. 323–334. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-87571-8_28
Chapter Google Scholar
Haurilet, M., Roitberg, A., Stiefelhagen, R.: It’s not about the journey; it’s about the destination: following soft paths under question-guidance for visual reasoning. In: CVPR, pp. 1930–1939. Computer Vision Foundation/IEEE (2019)
Google Scholar
Huang, P., Huang, J., Guo, Y., Qiao, M., Zhu, Y.: Multi-grained attention with object-level grounding for visual question answering. In: ACL, pp. 3595–3600. Association for Computational Linguistics (2019)
Google Scholar
Hudson, D.A., Manning, C.D.: GQA: a new dataset for real-world visual reasoning and compositional question answering. In: CVPR, pp. 6700–6709. Computer Vision Foundation / IEEE (2019)
Google Scholar
Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L., Zitnick, C.L., Girshick, R.B.: CLEVR: a diagnostic dataset for compositional language and elementary visual reasoning. In: CVPR, pp. 1988–1997. IEEE Computer Society (2017)
Google Scholar
Kafle, K., Price, B.L., Cohen, S., Kanan, C.: DVQA: understanding data visualizations via question answering. In: CVPR, pp. 5648–5656. Computer Vision Foundation / IEEE Computer Society (2018)
Google Scholar
Kahou, S.E., Michalski, V., Atkinson, A., Kádár, Á., Trischler, A., Bengio, Y.: Figureqa: an annotated figure dataset for visual reasoning. In: ICLR. OpenReview.net (2018)
Google Scholar
Li, L., Gan, Z., Cheng, Y., Liu, J.: Relation-aware graph attention network for visual question answering. In: ICCV, pp. 10312–10321. IEEE (2019)
Google Scholar
Lin, X., Parikh, D.: Leveraging visual question answering for image-caption ranking. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 261–277. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46475-6_17
Chapter Google Scholar
Lu, J., Yang, J., Batra, D., Parikh, D.: Hierarchical question-image co-attention for visual question answering. In: Advances in Neural Information Processing Systems, pp. 289–297 (2016)
Google Scholar
Patro, B.N., Anupriy, S., Namboodiri, V.: Explanation vs attention: a two-player game to obtain attention for VQA. In: AAAI, pp. 11848–11855. AAAI Press (2020)
Google Scholar
Qiao, T., Dong, J., Xu, D.: Exploring human-like attention supervision in visual question answering. In: AAAI, pp. 7300–7307. AAAI Press (2018)
Google Scholar
Shi, J., Zhang, H., Li, J.: Explainable and explicit visual reasoning over scene graphs. In: CVPR, pp. 8376–8384. Computer Vision Foundation / IEEE (2019)
Google Scholar
Shih, K.J., Singh, S., Hoiem, D.: Where to look: focus regions for visual question answering. In: CVPR, pp. 4613–4621. IEEE Computer Society (2016)
Google Scholar
Tang, K., Zhang, H., Wu, B., Luo, W., Liu, W.: Learning to compose dynamic tree structures for visual contexts. In: CVPR, pp. 6619–6628. Computer Vision Foundation / IEEE (2019)
Google Scholar
Vo, N., et al.: Composing text and image for image retrieval - an empirical odyssey. In: CVPR (2019)
Google Scholar
Wang, P., Wu, Q., Shen, C., Dick, A.R., van den Hengel, A.: Explicit knowledge-based reasoning for visual question answering. In: IJCAI, pp. 1290–1296. ijcai.org (2017)
Google Scholar
Wang, P., Wu, Q., Shen, C., Dick, A.R., van den Hengel, A.: FVQA: fact-based visual question answering. IEEE Trans. Pattern Anal. Mach. Intell. 40(10), 2413–2427 (2018)
Article Google Scholar
Wu, F., Jing, X., Wei, P., Lan, C., Ji, Y., Jiang, G., Huang, Q.: Semi-supervised multi-view graph convolutional networks with application to webpage classification. Inf. Sci. 591, 142–154 (2022)
Article Google Scholar
Wu, J., Hu, Z., Mooney, R.J.: Generating question relevant captions to aid visual question answering. In: ACL, pp. 3585–3594. Association for Computational Linguistics (2019)
Google Scholar
Xu, H., Saenko, K.: Ask, attend and answer: exploring question-guided spatial attention for visual question answering. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 451–466. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46478-7_28
Chapter Google Scholar
Yang, Z., He, X., Gao, J., Deng, L., Smola, A.J.: Stacked attention networks for image question answering. In: CVPR, pp. 21–29. IEEE Computer Society (2016)
Google Scholar
Yu, Z., Yu, J., Fan, J., Tao, D.: Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In: ICCV, pp. 1821–1830 (2017)
Google Scholar
Zhang, P., Goyal, Y., Summers-Stay, D., Batra, D., Parikh, D.: Yin and yang: balancing and answering binary visual questions. In: CVPR, pp. 5014–5022. IEEE Computer Society (2016)
Google Scholar
Zhang, Y., Niebles, J.C., Soto, A.: Interpretable visual question answering by visual grounding from attention supervision mining. In: WACV, pp. 349–357. IEEE (2019)
Google Scholar

Download references

Acknowledgements

This research is supported by the NSFC-Xinjiang Joint Fund (No. U1903128), and the Fundamental Research Funds for the Central Universities (No. 63223046).

Author information

Authors and Affiliations

College of Computer Science, Nankai University, Tianjin, 300350, China
Xumeng Liu, Wenya Guo & Ying Zhang
College of Cyber Science, Nankai University, Tianjin, 300350, China
Yuhao Zhang

Authors

Xumeng Liu
View author publications
You can also search for this author in PubMed Google Scholar
Wenya Guo
View author publications
You can also search for this author in PubMed Google Scholar
Yuhao Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Ying Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wenya Guo .

Editor information

Editors and Affiliations

National University of Defense Technology, Changsha, China
Xiang Zhao
Guangzhou University, Guangzhou, China
Shiyu Yang
Tianjin University, Tianjin, China
Xin Wang
Deakin University, Melbourne, VIC, Australia
Jianxin Li

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Liu, X., Guo, W., Zhang, Y., Zhang, Y. (2022). Dynamic Alternative Attention for Visual Question Answering. In: Zhao, X., Yang, S., Wang, X., Li, J. (eds) Web Information Systems and Applications. WISA 2022. Lecture Notes in Computer Science, vol 13579. Springer, Cham. https://doi.org/10.1007/978-3-031-20309-1_33

Download citation

DOI: https://doi.org/10.1007/978-3-031-20309-1_33
Published: 08 December 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20308-4
Online ISBN: 978-3-031-20309-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Dynamic Alternative Attention for Visual Question Answering