Skip to main content

MVARN: Multi-view Attention Relation Network for Figure Question Answering

  • Conference paper
  • First Online:
Knowledge Science, Engineering and Management (KSEM 2023)

Abstract

Figure Question Answering (FQA) is an emerging multimodal task that shares similarities with Visual Question Answering (VQA). FQA aims to solve the problem of answering questions related to scientifically designed charts. In this study, we propose a novel model, called the Multi-view Attention Relation Network (MVARN), which utilizes key picture characteristics and multi-view relational reasoning to address this challenge. To enhance the expression ability of image output features, we introduce a Contextual Transformer (CoT) block that implements relational reasoning based on both pixel and channel views. Our experimental evaluation on the Figure QA and DVQA datasets demonstrates that the MVARN model outperforms other state-of-the-art techniques. Our approach yields fair outcomes across different classes of questions, which confirms its effectiveness and robustness.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Zhou, B., Tian, Y., Sukhbaatar, S., Szlam, A., Fergus, R.: VQA: visual question answering. In: ICCV, pp. 2425–2433 (2015)

    Google Scholar 

  2. Kafle, K., Price, B., Cohen, S., Kanan, C.: Dvqa: Understanding data visualizations via question answering. In: CVPR, pp. 5648–5656 (2018)

    Google Scholar 

  3. Kahou, S.E., et al.: Figureqa: an annotated figure dataset for visual reasoning. arXiv preprint arXiv:1710.07300 (2017)

  4. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR (2015)

    Google Scholar 

  5. Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L.: CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning. In: CVPR, pp. 1988–1997 (2017)

    Google Scholar 

  6. Kafle, K., Kanan, C.; Answer-type prediction for visual question answering. In: CVPR, pp. 4976–4984 (2016)

    Google Scholar 

  7. Reedy, R., Ramesh, R., Deshpande, A., and Khapra M.M.: FigureNet: a deep learning model for question answering on scientific plots. In: 2019 International Joint Conference on Neural Networks (IJCNN), pp. 1–8 (2019)

    Google Scholar 

  8. Zhu, J., Wu. G., Xue, T., Wu, Q.F.: An affinity-driven relation network for figure question answering. In: 2020 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6 (2020)

    Google Scholar 

  9. Chaudhry, R., Shekhar, S., Gupta, U., Maneriker, P, Bansal, P., Joshi, A.: LEAF-QA: locate, encode and attend for figure question answering. In: 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 3501–3510 (2020)

    Google Scholar 

  10. Itti, L., Koch, C., Niebur, E.: A model of saliency-based visual attention for rapid scene analysis. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI), 20, 1254 â€“ 1259 (1998)

    Google Scholar 

  11. Rensink, R.A.: The dynamic representation of scenes. Visual Cog. 7 (2000)

    Google Scholar 

Download references

Acknowledgments

This work was supported by the Key Project of National Key R&D Project (No. 2017YFC1703303); Industry University-Research Cooperation Project of Fujian Science and Technology Planning (No: 2022H6012); Industry University-Research Cooperation Project of Ningde City and Xiamen University (No. 2020C001); Natural Science Foundation of Fujian Province of China (No. 2021J011169, No. 2020J01435).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Qingfeng Wu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Wang, Y., Wu, Q., Lin, W., Ma, L., Li, Y. (2023). MVARN: Multi-view Attention Relation Network for Figure Question Answering. In: Jin, Z., Jiang, Y., Buchmann, R.A., Bi, Y., Ghiran, AM., Ma, W. (eds) Knowledge Science, Engineering and Management. KSEM 2023. Lecture Notes in Computer Science(), vol 14119. Springer, Cham. https://doi.org/10.1007/978-3-031-40289-0_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-40289-0_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-40288-3

  • Online ISBN: 978-3-031-40289-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics