Abstract
Figure Question Answering (FQA) is an emerging multimodal task that shares similarities with Visual Question Answering (VQA). FQA aims to solve the problem of answering questions related to scientifically designed charts. In this study, we propose a novel model, called the Multi-view Attention Relation Network (MVARN), which utilizes key picture characteristics and multi-view relational reasoning to address this challenge. To enhance the expression ability of image output features, we introduce a Contextual Transformer (CoT) block that implements relational reasoning based on both pixel and channel views. Our experimental evaluation on the Figure QA and DVQA datasets demonstrates that the MVARN model outperforms other state-of-the-art techniques. Our approach yields fair outcomes across different classes of questions, which confirms its effectiveness and robustness.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Zhou, B., Tian, Y., Sukhbaatar, S., Szlam, A., Fergus, R.: VQA: visual question answering. In: ICCV, pp. 2425–2433 (2015)
Kafle, K., Price, B., Cohen, S., Kanan, C.: Dvqa: Understanding data visualizations via question answering. In: CVPR, pp. 5648–5656 (2018)
Kahou, S.E., et al.: Figureqa: an annotated figure dataset for visual reasoning. arXiv preprint arXiv:1710.07300 (2017)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR (2015)
Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L.: CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning. In: CVPR, pp. 1988–1997 (2017)
Kafle, K., Kanan, C.; Answer-type prediction for visual question answering. In: CVPR, pp. 4976–4984 (2016)
Reedy, R., Ramesh, R., Deshpande, A., and Khapra M.M.: FigureNet: a deep learning model for question answering on scientific plots. In: 2019 International Joint Conference on Neural Networks (IJCNN), pp. 1–8 (2019)
Zhu, J., Wu. G., Xue, T., Wu, Q.F.: An affinity-driven relation network for figure question answering. In: 2020 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6 (2020)
Chaudhry, R., Shekhar, S., Gupta, U., Maneriker, P, Bansal, P., Joshi, A.: LEAF-QA: locate, encode and attend for figure question answering. In: 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 3501–3510 (2020)
Itti, L., Koch, C., Niebur, E.: A model of saliency-based visual attention for rapid scene analysis. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI), 20, 1254 – 1259 (1998)
Rensink, R.A.: The dynamic representation of scenes. Visual Cog. 7 (2000)
Acknowledgments
This work was supported by the Key Project of National Key R&D Project (No. 2017YFC1703303); Industry University-Research Cooperation Project of Fujian Science and Technology Planning (No: 2022H6012); Industry University-Research Cooperation Project of Ningde City and Xiamen University (No. 2020C001); Natural Science Foundation of Fujian Province of China (No. 2021J011169, No. 2020J01435).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Wang, Y., Wu, Q., Lin, W., Ma, L., Li, Y. (2023). MVARN: Multi-view Attention Relation Network for Figure Question Answering. In: Jin, Z., Jiang, Y., Buchmann, R.A., Bi, Y., Ghiran, AM., Ma, W. (eds) Knowledge Science, Engineering and Management. KSEM 2023. Lecture Notes in Computer Science(), vol 14119. Springer, Cham. https://doi.org/10.1007/978-3-031-40289-0_3
Download citation
DOI: https://doi.org/10.1007/978-3-031-40289-0_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-40288-3
Online ISBN: 978-3-031-40289-0
eBook Packages: Computer ScienceComputer Science (R0)