skip to main content
10.1145/3503161.3548172acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

A Region-based Document VQA

Authors Info & Claims
Published:10 October 2022Publication History

ABSTRACT

Practical Document Visual Question Answering (DocVQA) needs not only to recognize and extract the document contents, but also reason on them for answering questions. However, previous DocVQA data mainly focuses on in-line questions, where the answers could be directly extracted after locating keywords in the documents, which needs less reasoning. This paper therefore builds a large-scale dataset named Region-based Document VQA (RDVQA), which includes more practical questions for DocVQA. We then propose a novel Reason-over-In-region-Question-answering (ReIQ) model for addressing the problems. It is a pre-training-based model, where a Spatial-Token Pre-trained Model (STPM) is employed as the backbone. Two novel pre-training tasks, Masked Text Box Regression and Shuffled Triplet Reconstruction, are proposed to learn the entailment relationship between text blocks and tokens as well as contextual information, respectively. Moreover, a DocVQA State Tracking Module (DocST) is also proposed to track the DocVQA state in the fine-tuning stage. Experimental results show that our model improves the performance onRDVQA significantly, although more work should be done for practical DocVQA as shown inRDVQA.

Skip Supplemental Material Section

Supplemental Material

References

  1. Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18--22, 2018. IEEE Computer Society, 6077--6086. https://doi.org/10.1109/CVPR.2018.00636Google ScholarGoogle Scholar
  2. Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. 2015. VQA: Visual Question Answering. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7--13, 2015. IEEE Computer Society, 2425--2433. https://doi.org/10.1109/ICCV.2015.279Google ScholarGoogle Scholar
  3. Srikar Appalaraju, Bhavan Jasani, Bhargava Urala Kota, Yusheng Xie, and R. Manmatha. 20. DocFormer: End-to-End Transformer for Document Understanding. ArXiv preprint, Vol. abs/ (20). https://arxiv.org/abs/Google ScholarGoogle Scholar
  4. Ali Furkan Biten, Rubè n Tito, André s Mafla, Llu'i s Gó mez i Bigorda, Marcc al Rusi n ol, C. V. Jawahar, Ernest Valveny, and Dimosthenis Karatzas. 2019. Scene Text Visual Question Answering. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019. IEEE, 4290--4300. https://doi.org/10.1109/ICCV.2019.00439Google ScholarGoogle Scholar
  5. Feilong Chen, Fandong Meng, Jiaming Xu, Peng Li, Bo Xu, and Jie Zhou. 2020. DMRM: A Dual-Channel Multi-Hop Reasoning Model for Visual Dialog. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7--12, 2020. AAAI Press, 7504--7511. https://aaai.org/ojs/index.php/AAAI/article/view/6248Google ScholarGoogle Scholar
  6. Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José M. F. Moura, Devi Parikh, and Dhruv Batra. 2017. Visual Dialog. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21--26, 2017. IEEE Computer Society, 1080--1089. https://doi.org/10.1109/CVPR.2017.121Google ScholarGoogle Scholar
  7. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 4171--4186. https://doi.org/10.18653/v1/N19--1423Google ScholarGoogle Scholar
  8. Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, and Marcus Rohrbach. 2016. Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Austin, Texas, 457--468. https://doi.org/10.18653/v1/D16--1044Google ScholarGoogle ScholarCross RefCross Ref
  9. Zhe Gan, Yu Cheng, Ahmed Kholy, Linjie Li, Jingjing Liu, and Jianfeng Gao. 2019. Multi-step Reasoning via Recurrent Dual Attention for Visual Dialog. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, 6463--6474. https://doi.org/10.18653/v1/P19--1648Google ScholarGoogle ScholarCross RefCross Ref
  10. Lukasz Garncarek, Rafa? Powalski, Tomasz Stanisawek, Bartosz Topolski, Piotr Halama, and Filip Grali'ski. 2020. LAMBERT: Layout-Aware language Modeling using BERT for information extraction.Google ScholarGoogle Scholar
  11. Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. 2017. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21--26, 2017. IEEE Computer Society, 6325--6334. https://doi.org/10.1109/CVPR.2017.670Google ScholarGoogle Scholar
  12. Danna Gurari, Qing Li, Abigale J. Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P. Bigham. 2018. VizWiz Grand Challenge: Answering Visual Questions From Blind People. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18--22, 2018. IEEE Computer Society, 3608--3617. https://doi.org/10.1109/CVPR.2018.00380Google ScholarGoogle Scholar
  13. Ronghang Hu, Amanpreet Singh, Trevor Darrell, and Marcus Rohrbach. 2020. Iterative Answer Prediction With Pointer-Augmented Multimodal Transformers for TextVQA. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13--19, 2020. IEEE, 9989--9999. https://doi.org/10.1109/CVPR42600.2020.01001Google ScholarGoogle Scholar
  14. Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C. Lawrence Zitnick, and Ross B. Girshick. 2017. CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21--26, 2017. IEEE Computer Society, 1988--1997. https://doi.org/10.1109/CVPR.2017.215Google ScholarGoogle Scholar
  15. Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2020. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26--30, 2020. OpenReview.net. https://openreview.net/forum?id=H1eA7AEtvSGoogle ScholarGoogle Scholar
  16. Chenliang Li, Bin Bi, Ming Yan, Wei Wang, Songfang Huang, Fei Huang, and Luo Si. 2021. StructuralLM: Structural Pre-training for Form Understanding. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Online, 6309--6318. https://doi.org/10.18653/v1/2021.acl-long.493Google ScholarGoogle Scholar
  17. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Michael Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2020. RoBERTa: A Robustly Optimized BERT Pretraining Approach. ArXiv preprint, Vol. abs/ (2020). https://arxiv.org/abs/Google ScholarGoogle Scholar
  18. Jiasen Lu, Anitha Kannan, Jianwei Yang, Devi Parikh, and Dhruv Batra. 2017. Best of Both Worlds: Transferring Knowledge from Discriminative Learning to a Generative Visual Dialog Model. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4--9, 2017, Long Beach, CA, USA, Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (Eds.). 314--324. https://proceedings.neurips.cc/paper/2017/hash/077e29b11be80ab57e1a2ecabb7da330-Abstract.htmlGoogle ScholarGoogle ScholarDigital LibraryDigital Library
  19. Minesh Mathew, Dimosthenis Karatzas, and C. V. Jawahar. 20. DocVQA: A Dataset for VQA on Document Images. ArXiv preprint, Vol. abs/ (20). https://arxiv.org/abs/Google ScholarGoogle Scholar
  20. Rafa Powalski, Lukasz Borchmann, Dawid Jurkiewicz, Tomasz Dwojak, Micha Pietruszka, and Gabriela Paka. 2021. Going full-tilt boogie on document understanding with text-image-layout transformer. In International Conference on Document Analysis and Recognition. Springer, 732--747.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Yuxi Qian, Yuncong Hu, Ruonan Wang, Fangxiang Feng, and Xiaojie Wang. 2022. Question-Driven Graph Fusion Network For Visual Question Answering. arxiv: 2204.00975 [cs.CV]Google ScholarGoogle Scholar
  22. Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know What You Don't Know: Unanswerable Questions for SQuAD. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Association for Computational Linguistics, Melbourne, Australia, 784--789. https://doi.org/10.18653/v1/P18--2124Google ScholarGoogle ScholarCross RefCross Ref
  23. Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000 Questions for Machine Comprehension of Text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Austin, Texas, 2383--2392. https://doi.org/10.18653/v1/D16--1264Google ScholarGoogle ScholarCross RefCross Ref
  24. Min Joon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. 2017. Bidirectional Attention Flow for Machine Comprehension. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24--26, 2017, Conference Track Proceedings. OpenReview.net. https://openreview.net/forum?id=HJ0UKP9geGoogle ScholarGoogle Scholar
  25. Qingyi Si, Zheng Lin, Ming yu Zheng, Peng Fu, and Weiping Wang. 2021. Check It Again:Progressive Visual Question Answering via Visual Entailment. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Online, 4101--4110. https://doi.org/10.18653/v1/2021.acl-long.317Google ScholarGoogle Scholar
  26. Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. 2019. Towards VQA Models That Can Read. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16--20, 2019. Computer Vision Foundation / IEEE, 8317--8326. https://doi.org/10.1109/CVPR.2019.00851Google ScholarGoogle Scholar
  27. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Neural Information Processing Systems.Google ScholarGoogle Scholar
  28. Ruonan Wang, Yuxi Qian, Fangxiang Feng, Xiaojie Wang, and Huixing Jiang. 2022. Co-VQA : Answering by Interactive Sub Question Sequence. https://doi.org/10.48550/ARXIV.2204.00879Google ScholarGoogle Scholar
  29. Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, and Ming Zhou. 2020. LayoutLM: Pre-training of Text and Layout for Document Image Understanding. In KDD '20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, CA, USA, August 23--27, 2020, Rajesh Gupta, Yan Liu, Jiliang Tang, and B. Aditya Prakash (Eds.). ACM, 1192--1200. https://dl.acm.org/doi/10.1145/3394486.3403172Google ScholarGoogle Scholar
  30. Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Wanxiang Che, Min Zhang, and Lidong Zhou. 2021b. LayoutLMv2: Multi-modal Pre-training for Visually-rich Document Understanding. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Online, 2579--2591. https://doi.org/10.18653/v1/2021.acl-long.201Google ScholarGoogle Scholar
  31. Zipeng Xu, Fandong Meng, Xiaojie Wang, Duo Zheng, Chenxu Lv, and Jie Zhou. 2021a. modeling Explicit Concerning States for Reinforcement Learning in Visual Dialogue. In Proceedings of the 32nd British Machine Vision Conference (BMVC).Google ScholarGoogle Scholar
  32. Ikuya Yamada, Akari Asai, Hiroyuki Shindo, Hideaki Takeda, and Yuji Matsumoto. 2020. LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, 6442--6454. https://doi.org/10.18653/v1/2020.emnlp-main.523Google ScholarGoogle ScholarCross RefCross Ref
  33. Adams Wei Yu, David Dohan, Minh-Thang Luong, Rui Zhao, Kai Chen, Mohammad Norouzi, and Quoc V. Le. 2018. QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net. https://openreview.net/forum?id=B14TlG-RWGoogle ScholarGoogle Scholar
  34. Zhou Yu, Jun Yu, Jianping Fan, and Dacheng Tao. 2017. Multi-modal Factorized Bilinear Pooling with Co-attention Learning for Visual Question Answering. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22--29, 2017. IEEE Computer Society, 1839--1848. https://doi.org/10.1109/ICCV.2017.202Google ScholarGoogle Scholar
  35. Li Yulin, Yuxi Qian, Yuechen Yu, Xiameng Qin, Chengquan Zhang, Yan Liu, Yao Kun, Han Junyu, Jingtuo Liu, and Errui Ding. 20. StrucTexT: Structured Text Understanding with Multi-Modal Transformers. ArXiv preprint, Vol. abs/ (20). https://arxiv.org/abs/Google ScholarGoogle Scholar
  36. Zhenrong Zhang, Jiefeng Ma, Jun Du, Licheng Wang, and Jianshu Zhang. 2022. Multimodal Pre-training Based on Graph Attention Network for Document Understanding.Google ScholarGoogle Scholar
  37. Zhuosheng Zhang, Yuwei Wu, Hai Zhao, Zuchao Li, Shuailiang Zhang, Xi Zhou, and Xiang Zhou. 2020a. Semantics-Aware BERT for Language Understanding. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7--12, 2020. AAAI Press, 9628--9635. https://aaai.org/ojs/index.php/AAAI/article/view/6510Google ScholarGoogle ScholarCross RefCross Ref
  38. Zhuosheng Zhang, Yuwei Wu, Junru Zhou, Sufeng Duan, Hai Zhao, and Rui Wang. 2020b. SG-Net: Syntax-Guided Machine Reading Comprehension. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7--12, 2020. AAAI Press, 9636--9643. https://aaai.org/ojs/index.php/AAAI/article/view/6511Google ScholarGoogle ScholarCross RefCross Ref
  39. Zhuosheng Zhang, Junjie Yang, and Hai Zhao. 2020c. Retrospective reader for machine reading comprehension. ArXiv preprint, Vol. abs/2001.09694 (2020). https://arxiv.org/abs/2001.09694Google ScholarGoogle Scholar
  40. Duo Zheng, Zipeng Xu, Fandong Meng, Xiaojie Wang, Jiaan Wang, and Jie Zhou. 2021. Enhancing Visual Dialog Questioner with Entity-based Strategy Learning and Augmented Guesser.. In Empirical Methods in Natural Language Processing.Google ScholarGoogle Scholar

Index Terms

  1. A Region-based Document VQA

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        MM '22: Proceedings of the 30th ACM International Conference on Multimedia
        October 2022
        7537 pages
        ISBN:9781450392037
        DOI:10.1145/3503161

        Copyright © 2022 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 10 October 2022

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate995of4,171submissions,24%

        Upcoming Conference

        MM '24
        MM '24: The 32nd ACM International Conference on Multimedia
        October 28 - November 1, 2024
        Melbourne , VIC , Australia
      • Article Metrics

        • Downloads (Last 12 months)74
        • Downloads (Last 6 weeks)7

        Other Metrics

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader