research-article

Hierarchical Multi-Task Learning for Diagram Question Answering with Multi-Modal Transformer

Authors:

Zhaoquan Yuan,

Xiao Peng,

Xiao Wu,

Changsheng XuAuthors Info & Claims

MM '21: Proceedings of the 29th ACM International Conference on Multimedia

Pages 1313 - 1321

https://doi.org/10.1145/3474085.3475255

Published: 17 October 2021 Publication History

Get Access

Abstract

Diagram question answering (DQA) is an effective way to evaluate the reasoning ability for diagram semantic understanding, which is a very challenging task and largely understudied compared with natural images. Existing separate two-stage methods for DQA are limited in ineffective feedback mechanisms. To address this problem, in this paper, we propose a novel structural parsing-integrated Hierarchical Multi-Task Learning (HMTL) model for diagram question answering based on a multi-modal transformer framework. In the proposed paradigm of multi-task learning, the two tasks of diagram structural parsing and question answering are in the different semantic levels and equipped with different transformer blocks, which constituents a hierarchical architecture. The structural parsing module encodes the information of constituents and their relationships in diagrams, while the diagram question answering module decodes the structural signals and combines question-answers to infer correct answers. Visual diagrams and textual question-answers are interplayed in the multi-modal transformer, which achieves cross-modal semantic comprehension and reasoning. Extensive experiments on the benchmark AI2D and FOODWEBS datasets demonstrate the effectiveness of our proposed HMTL over other state-of-the-art methods.

Supplementary Material

MP4 File (mm21-fp0510.mp4)

Presentation video for ACM MM 2021 oral paper: Hierarchical Multi-Task Learning for Diagram Question Answering with Multi-Modal Transformer. This paper proposed a multi-modal transformer based hierarchical multi-task learning model for diagram question answering task. In the proposed paradigm of multi-task learning, the two tasks of diagram structural parsing and question answering are in the different semantic levels and equipped with different transformer blocks. Experiments on AI2D and FOODWEBS show the effectiveness of this method.

Download
47.52 MB

References

[1]

Aishwarya Agrawal, Jiasen Lu, Stanislaw Antol, Margaret Mitchell, C. Lawrence Zitnick, Devi Parikh, and Dhruv Batra. 2017. VQA: Visual Question Answering - www.visualqa.org. Int. J. Comput. Vis. 123, 1 (2017), 4--31.

Abstract

Supplementary Material

References

Cited By

Recommendations

Textbook Question Answering with Multi-type Question Learning and Contextualized Diagram Representation

Interactive Multi-modal Question-Answering

VOGUE: Answer Verbalization Through Multi-Task Learning

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations