MVARN: Multi-view Attention Relation Network for Figure Question Answering

Wang, Yingdong; Wu, Qingfeng; Lin, Weiqiang; Ma, Linjian; Li, Ying

doi:10.1007/978-3-031-40289-0_3

Yingdong Wang ORCID: orcid.org/0000-0001-5510-3160¹³,
Qingfeng Wu ORCID: orcid.org/0000-0001-9248-9015¹³,
Weiqiang Lin¹³,
Linjian Ma¹³ &
…
Ying Li¹³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14119))

Included in the following conference series:

International Conference on Knowledge Science, Engineering and Management

629 Accesses

Abstract

Figure Question Answering (FQA) is an emerging multimodal task that shares similarities with Visual Question Answering (VQA). FQA aims to solve the problem of answering questions related to scientifically designed charts. In this study, we propose a novel model, called the Multi-view Attention Relation Network (MVARN), which utilizes key picture characteristics and multi-view relational reasoning to address this challenge. To enhance the expression ability of image output features, we introduce a Contextual Transformer (CoT) block that implements relational reasoning based on both pixel and channel views. Our experimental evaluation on the Figure QA and DVQA datasets demonstrates that the MVARN model outperforms other state-of-the-art techniques. Our approach yields fair outcomes across different classes of questions, which confirms its effectiveness and robustness.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Zhou, B., Tian, Y., Sukhbaatar, S., Szlam, A., Fergus, R.: VQA: visual question answering. In: ICCV, pp. 2425–2433 (2015)
Google Scholar
Kafle, K., Price, B., Cohen, S., Kanan, C.: Dvqa: Understanding data visualizations via question answering. In: CVPR, pp. 5648–5656 (2018)
Google Scholar
Kahou, S.E., et al.: Figureqa: an annotated figure dataset for visual reasoning. arXiv preprint arXiv:1710.07300 (2017)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR (2015)
Google Scholar
Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L.: CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning. In: CVPR, pp. 1988–1997 (2017)
Google Scholar
Kafle, K., Kanan, C.; Answer-type prediction for visual question answering. In: CVPR, pp. 4976–4984 (2016)
Google Scholar
Reedy, R., Ramesh, R., Deshpande, A., and Khapra M.M.: FigureNet: a deep learning model for question answering on scientific plots. In: 2019 International Joint Conference on Neural Networks (IJCNN), pp. 1–8 (2019)
Google Scholar
Zhu, J., Wu. G., Xue, T., Wu, Q.F.: An affinity-driven relation network for figure question answering. In: 2020 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6 (2020)
Google Scholar
Chaudhry, R., Shekhar, S., Gupta, U., Maneriker, P, Bansal, P., Joshi, A.: LEAF-QA: locate, encode and attend for figure question answering. In: 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 3501–3510 (2020)
Google Scholar
Itti, L., Koch, C., Niebur, E.: A model of saliency-based visual attention for rapid scene analysis. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI), 20, 1254 – 1259 (1998)
Google Scholar
Rensink, R.A.: The dynamic representation of scenes. Visual Cog. 7 (2000)
Google Scholar

Download references

Acknowledgments

This work was supported by the Key Project of National Key R&D Project (No. 2017YFC1703303); Industry University-Research Cooperation Project of Fujian Science and Technology Planning (No: 2022H6012); Industry University-Research Cooperation Project of Ningde City and Xiamen University (No. 2020C001); Natural Science Foundation of Fujian Province of China (No. 2021J011169, No. 2020J01435).

Author information

Authors and Affiliations

Xiamen University, Xiamen, Fujian, China
Yingdong Wang, Qingfeng Wu, Weiqiang Lin, Linjian Ma & Ying Li

Authors

Yingdong Wang
View author publications
You can also search for this author in PubMed Google Scholar
Qingfeng Wu
View author publications
You can also search for this author in PubMed Google Scholar
Weiqiang Lin
View author publications
You can also search for this author in PubMed Google Scholar
Linjian Ma
View author publications
You can also search for this author in PubMed Google Scholar
Ying Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Qingfeng Wu .

Editor information

Editors and Affiliations

Peking University, Beijing, China
Zhi Jin
South China Normal University, Guangzhou, China
Yuncheng Jiang
Babeș-Bolyai University, Cluj-Napoca, Romania
Robert Andrei Buchmann
Ulster University, Belfast, UK
Yaxin Bi
Babeș-Bolyai University, Cluj-Napoca, Romania
Ana-Maria Ghiran
South China Normal University, Guangzhou, China
Wenjun Ma

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, Y., Wu, Q., Lin, W., Ma, L., Li, Y. (2023). MVARN: Multi-view Attention Relation Network for Figure Question Answering. In: Jin, Z., Jiang, Y., Buchmann, R.A., Bi, Y., Ghiran, AM., Ma, W. (eds) Knowledge Science, Engineering and Management. KSEM 2023. Lecture Notes in Computer Science(), vol 14119. Springer, Cham. https://doi.org/10.1007/978-3-031-40289-0_3

Download citation

DOI: https://doi.org/10.1007/978-3-031-40289-0_3
Published: 09 August 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-40288-3
Online ISBN: 978-3-031-40289-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics