research-article

A Region-based Document VQA

Authors:

Fangxiang Feng,

Fan YangAuthors Info & Claims

MM '22: Proceedings of the 30th ACM International Conference on Multimedia

Pages 4909 - 4920

https://doi.org/10.1145/3503161.3548172

Published: 10 October 2022 Publication History

Abstract

Practical Document Visual Question Answering (DocVQA) needs not only to recognize and extract the document contents, but also reason on them for answering questions. However, previous DocVQA data mainly focuses on in-line questions, where the answers could be directly extracted after locating keywords in the documents, which needs less reasoning. This paper therefore builds a large-scale dataset named Region-based Document VQA (RDVQA), which includes more practical questions for DocVQA. We then propose a novel Reason-over-In-region-Question-answering (ReIQ) model for addressing the problems. It is a pre-training-based model, where a Spatial-Token Pre-trained Model (STPM) is employed as the backbone. Two novel pre-training tasks, Masked Text Box Regression and Shuffled Triplet Reconstruction, are proposed to learn the entailment relationship between text blocks and tokens as well as contextual information, respectively. Moreover, a DocVQA State Tracking Module (DocST) is also proposed to track the DocVQA state in the fine-tuning stage. Experimental results show that our model improves the performance onRDVQA significantly, although more work should be done for practical DocVQA as shown inRDVQA.

Supplementary Material

MP4 File (MM22-fp1740.mp4)

Presentation video

Download
150.11 MB

References

[1]

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18--22, 2018. IEEE Computer Society, 6077--6086. https://doi.org/10.1109/CVPR.2018.00636

[2]

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. 2015. VQA: Visual Question Answering. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7--13, 2015. IEEE Computer Society, 2425--2433. https://doi.org/10.1109/ICCV.2015.279

[3]

Srikar Appalaraju, Bhavan Jasani, Bhargava Urala Kota, Yusheng Xie, and R. Manmatha. 20. DocFormer: End-to-End Transformer for Document Understanding. ArXiv preprint, Vol. abs/ (20). https://arxiv.org/abs/

[4]

Ali Furkan Biten, Rubè n Tito, André s Mafla, Llu'i s Gó mez i Bigorda, Marcc al Rusi n ol, C. V. Jawahar, Ernest Valveny, and Dimosthenis Karatzas. 2019. Scene Text Visual Question Answering. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019. IEEE, 4290--4300. https://doi.org/10.1109/ICCV.2019.00439

[5]

Feilong Chen, Fandong Meng, Jiaming Xu, Peng Li, Bo Xu, and Jie Zhou. 2020. DMRM: A Dual-Channel Multi-Hop Reasoning Model for Visual Dialog. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7--12, 2020. AAAI Press, 7504--7511. https://aaai.org/ojs/index.php/AAAI/article/view/6248

[6]

Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José M. F. Moura, Devi Parikh, and Dhruv Batra. 2017. Visual Dialog. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21--26, 2017. IEEE Computer Society, 1080--1089. https://doi.org/10.1109/CVPR.2017.121

[7]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 4171--4186. https://doi.org/10.18653/v1/N19--1423

[8]

Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, and Marcus Rohrbach. 2016. Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Austin, Texas, 457--468. https://doi.org/10.18653/v1/D16--1044

[9]

Zhe Gan, Yu Cheng, Ahmed Kholy, Linjie Li, Jingjing Liu, and Jianfeng Gao. 2019. Multi-step Reasoning via Recurrent Dual Attention for Visual Dialog. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, 6463--6474. https://doi.org/10.18653/v1/P19--1648

[10]

Lukasz Garncarek, Rafa? Powalski, Tomasz Stanisawek, Bartosz Topolski, Piotr Halama, and Filip Grali'ski. 2020. LAMBERT: Layout-Aware language Modeling using BERT for information extraction.

[11]

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. 2017. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21--26, 2017. IEEE Computer Society, 6325--6334. https://doi.org/10.1109/CVPR.2017.670

[12]

Danna Gurari, Qing Li, Abigale J. Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P. Bigham. 2018. VizWiz Grand Challenge: Answering Visual Questions From Blind People. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18--22, 2018. IEEE Computer Society, 3608--3617. https://doi.org/10.1109/CVPR.2018.00380

[13]

Ronghang Hu, Amanpreet Singh, Trevor Darrell, and Marcus Rohrbach. 2020. Iterative Answer Prediction With Pointer-Augmented Multimodal Transformers for TextVQA. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13--19, 2020. IEEE, 9989--9999. https://doi.org/10.1109/CVPR42600.2020.01001

[14]

Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C. Lawrence Zitnick, and Ross B. Girshick. 2017. CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21--26, 2017. IEEE Computer Society, 1988--1997. https://doi.org/10.1109/CVPR.2017.215

[15]

Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2020. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26--30, 2020. OpenReview.net. https://openreview.net/forum?id=H1eA7AEtvS

[16]

Chenliang Li, Bin Bi, Ming Yan, Wei Wang, Songfang Huang, Fei Huang, and Luo Si. 2021. StructuralLM: Structural Pre-training for Form Understanding. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Online, 6309--6318. https://doi.org/10.18653/v1/2021.acl-long.493

[17]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Michael Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2020. RoBERTa: A Robustly Optimized BERT Pretraining Approach. ArXiv preprint, Vol. abs/ (2020). https://arxiv.org/abs/

[18]

Jiasen Lu, Anitha Kannan, Jianwei Yang, Devi Parikh, and Dhruv Batra. 2017. Best of Both Worlds: Transferring Knowledge from Discriminative Learning to a Generative Visual Dialog Model. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4--9, 2017, Long Beach, CA, USA, Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (Eds.). 314--324. https://proceedings.neurips.cc/paper/2017/hash/077e29b11be80ab57e1a2ecabb7da330-Abstract.html

Digital Library

[19]

Minesh Mathew, Dimosthenis Karatzas, and C. V. Jawahar. 20. DocVQA: A Dataset for VQA on Document Images. ArXiv preprint, Vol. abs/ (20). https://arxiv.org/abs/

[20]

Rafa Powalski, Lukasz Borchmann, Dawid Jurkiewicz, Tomasz Dwojak, Micha Pietruszka, and Gabriela Paka. 2021. Going full-tilt boogie on document understanding with text-image-layout transformer. In International Conference on Document Analysis and Recognition. Springer, 732--747.

Digital Library

[21]

Yuxi Qian, Yuncong Hu, Ruonan Wang, Fangxiang Feng, and Xiaojie Wang. 2022. Question-Driven Graph Fusion Network For Visual Question Answering. arxiv: 2204.00975 [cs.CV]

[22]

Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know What You Don't Know: Unanswerable Questions for SQuAD. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Association for Computational Linguistics, Melbourne, Australia, 784--789. https://doi.org/10.18653/v1/P18--2124

[23]

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000 Questions for Machine Comprehension of Text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Austin, Texas, 2383--2392. https://doi.org/10.18653/v1/D16--1264

[24]

Min Joon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. 2017. Bidirectional Attention Flow for Machine Comprehension. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24--26, 2017, Conference Track Proceedings. OpenReview.net. https://openreview.net/forum?id=HJ0UKP9ge

[25]

Qingyi Si, Zheng Lin, Ming yu Zheng, Peng Fu, and Weiping Wang. 2021. Check It Again:Progressive Visual Question Answering via Visual Entailment. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Online, 4101--4110. https://doi.org/10.18653/v1/2021.acl-long.317

[26]

Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. 2019. Towards VQA Models That Can Read. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16--20, 2019. Computer Vision Foundation / IEEE, 8317--8326. https://doi.org/10.1109/CVPR.2019.00851

[27]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Neural Information Processing Systems.

[28]

Ruonan Wang, Yuxi Qian, Fangxiang Feng, Xiaojie Wang, and Huixing Jiang. 2022. Co-VQA : Answering by Interactive Sub Question Sequence. https://doi.org/10.48550/ARXIV.2204.00879

[29]

Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, and Ming Zhou. 2020. LayoutLM: Pre-training of Text and Layout for Document Image Understanding. In KDD '20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, CA, USA, August 23--27, 2020, Rajesh Gupta, Yan Liu, Jiliang Tang, and B. Aditya Prakash (Eds.). ACM, 1192--1200. https://dl.acm.org/doi/10.1145/3394486.3403172

[30]

Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Wanxiang Che, Min Zhang, and Lidong Zhou. 2021b. LayoutLMv2: Multi-modal Pre-training for Visually-rich Document Understanding. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Online, 2579--2591. https://doi.org/10.18653/v1/2021.acl-long.201

[31]

Zipeng Xu, Fandong Meng, Xiaojie Wang, Duo Zheng, Chenxu Lv, and Jie Zhou. 2021a. modeling Explicit Concerning States for Reinforcement Learning in Visual Dialogue. In Proceedings of the 32nd British Machine Vision Conference (BMVC).

[32]

Ikuya Yamada, Akari Asai, Hiroyuki Shindo, Hideaki Takeda, and Yuji Matsumoto. 2020. LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, 6442--6454. https://doi.org/10.18653/v1/2020.emnlp-main.523

[33]

Adams Wei Yu, David Dohan, Minh-Thang Luong, Rui Zhao, Kai Chen, Mohammad Norouzi, and Quoc V. Le. 2018. QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net. https://openreview.net/forum?id=B14TlG-RW

[34]

Zhou Yu, Jun Yu, Jianping Fan, and Dacheng Tao. 2017. Multi-modal Factorized Bilinear Pooling with Co-attention Learning for Visual Question Answering. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22--29, 2017. IEEE Computer Society, 1839--1848. https://doi.org/10.1109/ICCV.2017.202

[35]

Li Yulin, Yuxi Qian, Yuechen Yu, Xiameng Qin, Chengquan Zhang, Yan Liu, Yao Kun, Han Junyu, Jingtuo Liu, and Errui Ding. 20. StrucTexT: Structured Text Understanding with Multi-Modal Transformers. ArXiv preprint, Vol. abs/ (20). https://arxiv.org/abs/

[36]

Zhenrong Zhang, Jiefeng Ma, Jun Du, Licheng Wang, and Jianshu Zhang. 2022. Multimodal Pre-training Based on Graph Attention Network for Document Understanding.

[37]

Zhuosheng Zhang, Yuwei Wu, Hai Zhao, Zuchao Li, Shuailiang Zhang, Xi Zhou, and Xiang Zhou. 2020a. Semantics-Aware BERT for Language Understanding. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7--12, 2020. AAAI Press, 9628--9635. https://aaai.org/ojs/index.php/AAAI/article/view/6510

[38]

Zhuosheng Zhang, Yuwei Wu, Junru Zhou, Sufeng Duan, Hai Zhao, and Rui Wang. 2020b. SG-Net: Syntax-Guided Machine Reading Comprehension. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7--12, 2020. AAAI Press, 9636--9643. https://aaai.org/ojs/index.php/AAAI/article/view/6511

[39]

Zhuosheng Zhang, Junjie Yang, and Hai Zhao. 2020c. Retrospective reader for machine reading comprehension. ArXiv preprint, Vol. abs/2001.09694 (2020). https://arxiv.org/abs/2001.09694

[40]

Duo Zheng, Zipeng Xu, Fandong Meng, Xiaojie Wang, Jiaan Wang, and Jie Zhou. 2021. Enhancing Visual Dialog Questioner with Entity-based Strategy Learning and Augmented Guesser. In Empirical Methods in Natural Language Processing.

Cited By

Ding YRen KHuang JLuo SHan SLarson K(2024)MMVQAProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence10.24963/ijcai.2024/690(6243-6251)Online publication date: 3-Aug-2024
https://dl.acm.org/doi/10.24963/ijcai.2024/690
Van Landeghem JMaity SBanerjee ABlaschko MMoens MLladós JBiswas S(2024)DistilDoc: Knowledge Distillation for Visually-Rich Document ApplicationsDocument Analysis and Recognition - ICDAR 202410.1007/978-3-031-70546-5_12(195-217)Online publication date: 11-Sep-2024
https://doi.org/10.1007/978-3-031-70546-5_12
Banerjee ABiswas SLladós JPal U(2024)GraphKD: Exploring Knowledge Distillation Towards Document Object Detection with Structured Graph CreationDocument Analysis and Recognition - ICDAR 202410.1007/978-3-031-70543-4_21(354-373)Online publication date: 9-Sep-2024
https://doi.org/10.1007/978-3-031-70543-4_21
Show More Cited By

Index Terms

A Region-based Document VQA
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Information extraction
2. Information systems
  1. Information retrieval
    1. Document representation
      1. Document structure

Recommendations

CHIC: Corporate Document for Visual Question Answering
Document Analysis and Recognition - ICDAR 2024
Abstract
The massive use of digital documents due to the substantial trend of paperless initiatives confronted some companies with finding ways to process thousands of documents per day automatically. To achieve this, they use automatic information ...
Multi-page Document VQA with Recurrent Memory Transformer
Document Analysis Systems
Abstract
Multi-page document Visual Question Answering (VQA) poses realistic challenges in the realm of document understanding due to its complexity and volume of information distributed across multiple pages. Current state-of-the-art methods often ...
The IUPR dataset of camera-captured document images
CBDAR'11: Proceedings of the 4th international conference on Camera-Based Document Analysis and Recognition

Major challenges in camera-base document analysis are dealing with uneven shadows, high degree of curl and perspective distortions. In CBDAR 2007, we introduced the first dataset (DFKI-I) of camera-captured document images in conjunction with a page ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '22: Proceedings of the 30th ACM International Conference on Multimedia

October 2022

7537 pages

ISBN:9781450392037

DOI:10.1145/3503161

General Chairs:
João Magalhães
NOVA University of Lisbon, Portugal
,
Alberto del Bimbo
University of Florence, Italy
,
Shin'ichi Satoh
National Institute of Informatics, Japan
,
Nicu Sebe
University of Trento, Italy
,
Program Chairs:
Xavier Alameda-Pineda
Inria, Grenoble, France
,
Qin Jin
Renmin University of China, China
,
Vincent Oria
New Jersey Institute of Technology, USA
,
Laura Toni
University College London, UK

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 October 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

NSFC
NSFC

Conference

MM '22

Sponsor:

SIGMM

MM '22: The 30th ACM International Conference on Multimedia

October 10 - 14, 2022

Lisboa, Portugal

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
242
Total Downloads

Downloads (Last 12 months)45
Downloads (Last 6 weeks)3

Reflects downloads up to 14 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Ding YRen KHuang JLuo SHan SLarson K(2024)MMVQAProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence10.24963/ijcai.2024/690(6243-6251)Online publication date: 3-Aug-2024
https://dl.acm.org/doi/10.24963/ijcai.2024/690
Van Landeghem JMaity SBanerjee ABlaschko MMoens MLladós JBiswas S(2024)DistilDoc: Knowledge Distillation for Visually-Rich Document ApplicationsDocument Analysis and Recognition - ICDAR 202410.1007/978-3-031-70546-5_12(195-217)Online publication date: 11-Sep-2024
https://doi.org/10.1007/978-3-031-70546-5_12
Banerjee ABiswas SLladós JPal U(2024)GraphKD: Exploring Knowledge Distillation Towards Document Object Detection with Structured Graph CreationDocument Analysis and Recognition - ICDAR 202410.1007/978-3-031-70543-4_21(354-373)Online publication date: 9-Sep-2024
https://doi.org/10.1007/978-3-031-70543-4_21
Ahmed SJawade BPandey SSetlur SGovindaraju V(2023)RealCQA: Scientific Chart Question Answering as a Test-Bed for First-Order LogicDocument Analysis and Recognition - ICDAR 202310.1007/978-3-031-41682-8_5(66-83)Online publication date: 21-Aug-2023
https://dl.acm.org/doi/10.1007/978-3-031-41682-8_5

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten