skip to main content
10.1145/3474085.3475255acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Hierarchical Multi-Task Learning for Diagram Question Answering with Multi-Modal Transformer

Published: 17 October 2021 Publication History

Abstract

Diagram question answering (DQA) is an effective way to evaluate the reasoning ability for diagram semantic understanding, which is a very challenging task and largely understudied compared with natural images. Existing separate two-stage methods for DQA are limited in ineffective feedback mechanisms. To address this problem, in this paper, we propose a novel structural parsing-integrated Hierarchical Multi-Task Learning (HMTL) model for diagram question answering based on a multi-modal transformer framework. In the proposed paradigm of multi-task learning, the two tasks of diagram structural parsing and question answering are in the different semantic levels and equipped with different transformer blocks, which constituents a hierarchical architecture. The structural parsing module encodes the information of constituents and their relationships in diagrams, while the diagram question answering module decodes the structural signals and combines question-answers to infer correct answers. Visual diagrams and textual question-answers are interplayed in the multi-modal transformer, which achieves cross-modal semantic comprehension and reasoning. Extensive experiments on the benchmark AI2D and FOODWEBS datasets demonstrate the effectiveness of our proposed HMTL over other state-of-the-art methods.

Supplementary Material

MP4 File (mm21-fp0510.mp4)
Presentation video for ACM MM 2021 oral paper: Hierarchical Multi-Task Learning for Diagram Question Answering with Multi-Modal Transformer. This paper proposed a multi-modal transformer based hierarchical multi-task learning model for diagram question answering task. In the proposed paradigm of multi-task learning, the two tasks of diagram structural parsing and question answering are in the different semantic levels and equipped with different transformer blocks. Experiments on AI2D and FOODWEBS show the effectiveness of this method.

References

[1]
Aishwarya Agrawal, Jiasen Lu, Stanislaw Antol, Margaret Mitchell, C. Lawrence Zitnick, Devi Parikh, and Dhruv Batra. 2017. VQA: Visual Question Answering - www.visualqa.org. Int. J. Comput. Vis. 123, 1 (2017), 4--31.
[2]
Lei Jimmy Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. 2016. Layer Normalization. CoRR abs/1607.06450 (2016). http://arxiv.org/abs/1607.06450
[3]
Jize Cao, Zhe Gan, Yu Cheng, Licheng Yu, Yen-Chun Chen, and Jingjing Liu. 2020. Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models. In Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part VI (Lecture Notes in Computer Science), Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm (Eds.), Vol. 12351. Springer, 565--580.
[4]
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-End Object Detection with Transformers. In Computer Vision -- ECCV 2020, Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm (Eds.). Springer International Publishing, Cham, 213--229.
[5]
Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. 2020. UNITER: UNiversal Image-TExt Representation Learning. In Computer Vision -- ECCV 2020, Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm (Eds.). Springer International Publishing, Cham, 104--120.
[6]
Junyoung Chung, Çaglar Gülçehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. CoRR abs/1412.3555 (2014). http://arxiv.org/abs/1412.3555
[7]
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Taf jord. 2018. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457 (2018).
[8]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 4171--4186. https://doi.org/10.18653/v1/N19--1423
[9]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ICLR (2021).
[10]
Ronald W. Ferguson and Kenneth D. Forbus. 1998. Telling juxtapositions: Using repetition and alignable difference in diagram understanding.
[11]
Simon Ging, Mohammadreza Zolfaghari, Hamed Pirsiavash, and Thomas Brox. [n.d.]. COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.). Curran Associates, Inc., 22605--22618.
[12]
M. Haurilet, A. Roitberg, and R. Stiefelhagen. 2019. It's Not About the Journey; It's About the Destination: Following Soft Paths Under Question-Guidance for Visual Reasoning. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 1930--1939.
[13]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 770--778.
[14]
Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. 2016. A diagram is worth a dozen images. In European Conference on Computer Vision. Springer, 235--251.
[15]
A. Kembhavi, M. Seo, D. Schwenk, J. Choi, A. Farhadi, and H. Hajishirzi. 2017. Are You Smarter Than a Sixth Grader? Textbook Question Answering for Multimodal Machine Comprehension. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 5376--5384.
[16]
Daesik Kim, Seonhoon Kim, and Nojun Kwak. 2019. Textbook Question Answering with Multi-modal Context Graph Understanding and Self-supervised Open-set Comprehension. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, 3568--3584.
[17]
Daesik Kim, YoungJoon Yoo, Jeesoo Kim, Sangkuk Lee, and Nojun Kwak. 2018. Dynamic Graph Generation Network: Generating Relational Knowledge from Diagrams. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 4167--4175.
[18]
Jayant Krishnamurthy, Oyvind Taf jord, and Aniruddha Kembhavi. 2016. Semantic Parsing to Probabilistic Programs for Situated Question Answering. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Austin, Texas.
[19]
Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. 2017. RACE: Large-scale ReAding Comprehension Dataset From Examinations. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Copenhagen, Denmark.
[20]
Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. [n.d.]. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26--30, 2020.
[21]
Phuc H. Le-Khac, Graham Healy, and Alan F. Smeaton. 2020. Contrastive Representation Learning: A Framework and Review. IEEE Access 8 (2020), 193907--193934.
[22]
Gen Li, Nan Duan, Yuejian Fang, Ming Gong, and Daxin Jiang. 2020. Unicoder-VL: A Universal Encoder for Vision and Language by Cross-Modal Pre-Training. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7--12, 2020. AAAI Press, 11336--11344.
[23]
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. CoRR abs/1907.11692 (2019). http://arxiv.org/abs/1907.11692
[24]
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. CoRR abs/2103.14030 (2021). https://arxiv.org/abs/2103.14030
[25]
Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8--14, 2019, Vancouver, BC, Canada, Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d'Alché-Buc, Emily B. Fox, and Roman Garnett (Eds.). 13--23.
[26]
Jiasen Lu, Vedanuj Goswami, Marcus Rohrbach, Devi Parikh, and Stefan Lee. 2020. 12-in-1: Multi-Task Vision and Language Representation Learning. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13--19, 2020. IEEE, 10434--10443.
[27]
Larry O'Gorman. 1997. Document Image Analysis: An Executive Briefing. IEEE Computer Society Press.
[28]
Joseph Redmon and Ali Farhadi. 2018. YOLOv3: An Incremental Improvement. CoRR abs/1804.02767 (2018). arXiv:1804.02767 http://arxiv.org/abs/1804.02767
[29]
Min Joon Seo, Hannaneh Hajishirzi, Ali Farhadi, and Oren Etzioni. 2014. Diagram Understanding in Geometry Questions. In Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence, July 27 -31, 2014, Québec City, Québec, Canada, Carla E. Brodley and Peter Stone (Eds.). AAAI Press, 2831--2838.
[30]
Rohini K Srihari. 1994. Computational models for integrating linguistic and visual information: A survey. Artificial Intelligence Review 8, 5 (1994), 349--369.
[31]
Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai. [n.d.]. VL-BERT: Pre-training of Generic Visual-Linguistic Representations. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26--30, 2020.
[32]
Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. 2019. VideoBERT: A Joint Model for Video and Language Representation Learning. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019. IEEE, 7463--7472.
[33]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. [n.d.]. Attention is All you Need. In Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.). Curran Associates, Inc.
[34]
Jörg von Engelhardt. 2002. The language of graphics: A framework for the analysis of syntax and meaning in maps, charts and diagrams. Yuri Engelhardt.
[35]
Zhaokai Wang, Renda Bao, Qi Wu, and Si Liu. 2020. Confidence-aware Non-repetitive Multimodal Transformers for TextCaps. CoRR abs/2012.03662 (2020). https://arxiv.org/abs/2012.03662
[36]
Yasuhiko Watanabe and Makoto Nagao. 1998. Diagram understanding using integration of layout information and textual information. In COLING 1998 Volume 2: The 17th International Conference on Computational Linguistics.
[37]
Cai YuanQiang, Dawei Du, Libo Zhang, Longyin Wen, Weiqiang Wang, Yanjun Wu, and Siwei Lyu. 2020. Guided Attention Network for Object Detection and Counting on Drones. In Proceedings of the 28th ACM International Conference on Multimedia. 709--717.
[38]
Luowei Zhou, Hamid Palangi, Lei Zhang, Houdong Hu, Jason J. Corso, and Jianfeng Gao. 2020. Unified Vision-Language Pre-Training for Image Captioning and VQA. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7--12, 2020. AAAI Press, 13041--13049.

Cited By

View all
  • (2024)Alignment Relation is What You Need for Diagram ParsingIEEE Transactions on Image Processing10.1109/TIP.2024.337451133(2131-2144)Online publication date: 2024
  • (2024)CoG-DQA: Chain-of-Guiding Learning with Large Language Models for Diagram Question Answering2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.01325(13969-13979)Online publication date: 16-Jun-2024
  • (2023)GVQA: Learning to Answer Questions about Graphs with Visualizations via Knowledge BaseProceedings of the 2023 CHI Conference on Human Factors in Computing Systems10.1145/3544548.3581067(1-16)Online publication date: 19-Apr-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '21: Proceedings of the 29th ACM International Conference on Multimedia
October 2021
5796 pages
ISBN:9781450386517
DOI:10.1145/3474085
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. diagram question answering
  2. multi-modal
  3. multi-task learning
  4. transformer

Qualifiers

  • Research-article

Funding Sources

Conference

MM '21
Sponsor:
MM '21: ACM Multimedia Conference
October 20 - 24, 2021
Virtual Event, China

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)50
  • Downloads (Last 6 weeks)6
Reflects downloads up to 16 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Alignment Relation is What You Need for Diagram ParsingIEEE Transactions on Image Processing10.1109/TIP.2024.337451133(2131-2144)Online publication date: 2024
  • (2024)CoG-DQA: Chain-of-Guiding Learning with Large Language Models for Diagram Question Answering2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.01325(13969-13979)Online publication date: 16-Jun-2024
  • (2023)GVQA: Learning to Answer Questions about Graphs with Visualizations via Knowledge BaseProceedings of the 2023 CHI Conference on Human Factors in Computing Systems10.1145/3544548.3581067(1-16)Online publication date: 19-Apr-2023
  • (2023)DisAVR: Disentangled Adaptive Visual Reasoning Network for Diagram Question AnsweringIEEE Transactions on Image Processing10.1109/TIP.2023.330691032(4812-4827)Online publication date: 1-Jan-2023
  • (2023)Diagram Perception Networks for Textbook Question Answering via Joint OptimizationInternational Journal of Computer Vision10.1007/s11263-023-01954-z132:5(1578-1591)Online publication date: 30-Nov-2023

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media