skip to main content
10.1145/3539618.3591875acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

BizGraphQA: A Dataset for Image-based Inference over Graph-structured Diagrams from Business Domains

Published:18 July 2023Publication History

ABSTRACT

Graph-structured diagrams, such as enterprise ownership charts or management hierarchies, are a challenging medium for deep learning models as they not only require the capacity to model language and spatial relations but also the topology of links between entities and the varying semantics of what those links represent. Devising Question Answering models that automatically process and understand such diagrams have vast applications to many enterprise domains, and can move the state-of-the-art on multimodal document understanding to a new frontier. Curating real-world datasets to train these models can be difficult, due to scarcity and confidentiality of the documents where such diagrams are included. Recently released synthetic datasets are often prone to repetitive structures that can be memorized or tackled using heuristics. In this paper, we present a collection of 10,000 synthetic graphs that faithfully reflect properties of real graphs in four business domains, and are realistically rendered within a PDF document with varying styles and layouts. In addition, we have generated over 130,000 question instances that target complex graphical relationships specific to each domain. We hope this challenge will encourage the development of models capable of robust reasoning about graph structured images, which are ubiquitous in numerous sectors in business and across scientific disciplines.

References

  1. Brian Davis, Bryan Morse, Brian Price, Chris Tensmeyer, and Curtis Wiginton. 2021. Visual FUDGE: Form Understanding via Dynamic Graph Editing. arXiv preprint arXiv:2105.08194 (2021).Google ScholarGoogle Scholar
  2. Jing Gao, Peng Li, Zhikui Chen, and Jianing Zhang. 2020. A Survey on Deep Learning for Multimodal Data Fusion. Neural Computation 32, 5 (05 2020), 829--864. https://doi.org/10.1162/neco_a_01273 arXiv:https://direct.mit.edu/neco/article-pdf/32/5/829/1865303/neco_a_01273.pdfGoogle ScholarGoogle ScholarDigital LibraryDigital Library
  3. Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. 2017. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering. In Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarGoogle Scholar
  4. Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising Diffusion Probabilistic Models. arXiv:2006.11239 [cs.LG]Google ScholarGoogle Scholar
  5. D. A. Hudson and C. D. Manning. 2019. GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE Computer Society, Los Alamitos, CA, USA, 6693--6702. https://doi.org/10.1109/CVPR.2019. 00686Google ScholarGoogle ScholarCross RefCross Ref
  6. Drew A. Hudson and Christopher D. Manning. 2019. Learning by Abstraction: The Neural State Machine.. In NeurIPS, Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d'Alché Buc, Emily B. Fox, and Roman Garnett (Eds.). 5901--5914. http://dblp.uni-trier.de/db/conf/nips/nips2019.html#HudsonM19Google ScholarGoogle Scholar
  7. Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C. Lawrence Zitnick, and Ross Girshick. 2017. CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)Google ScholarGoogle Scholar
  8. Kushal Kafle, Scott Cohen, Brian Price, and Christopher Kanan. 2018. DVQA: Understanding Data Visualizations via Question Answering. In CVPR.Google ScholarGoogle Scholar
  9. Samira Ebrahimi Kahou, Adam Atkinson, Vincent Michalski, Ákos Kádár, Adam Trischler, and Yoshua Bengio. 2017. FigureQA: An Annotated Figure Dataset for Visual Reasoning. ArXiv abs/1710.07300 (2017).Google ScholarGoogle Scholar
  10. Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. 2016. A Diagram is Worth a Dozen Images. In Computer Vision -- ECCV 2016, Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling (Eds.). Springer International Publishing, Cham, 235--251.Google ScholarGoogle ScholarCross RefCross Ref
  11. Geewook Kim, Teakgyu Hong, Moonbin Yim, Jeongyeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, and Seunghyun Park. 2022. OCR-free Document Understanding Transformer. arXiv:2111.15664 [cs.LG]Google ScholarGoogle Scholar
  12. Wonjae Kim, Bokyung Son, and Ildoo Kim. 2021. ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision. In Proceedings of the 38th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 139), Marina Meila and Tong Zhang (Eds.). PMLR, 5583--5594. https://proceedings.mlr.press/v139/kim21k.htmlGoogle ScholarGoogle Scholar
  13. Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Yoshua Bengio and Yann LeCun (Eds.). http://arxiv.org/abs/1412.6980Google ScholarGoogle Scholar
  14. Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, Michael Bernstein, and Li Fei-Fei. 2016. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. https://arxiv.org/abs/1602.07332Google ScholarGoogle Scholar
  15. Chen-Yu Lee, Chun-Liang Li, Chu Wang, Renshen Wang, Yasuhisa Fujii, Siyang Qin, Ashok Popat, and Tomas Pfister. 2021. ROPE: Reading Order Equivariant Positional Encoding for Graph-based Document Information Extraction. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers). Association for Computational Linguistics, Online, 314--321. https://doi.org/10.18653/v1/2021.acl-short.41Google ScholarGoogle ScholarCross RefCross Ref
  16. Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, Yejin Choi, and Jianfeng Gao. 2020. Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks. In Computer Vision - ECCV 2020, Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm (Eds.). Springer International Publishing, Cham, 121--137.Google ScholarGoogle Scholar
  17. Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. In Computer Vision - ECCV 2014, David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars (Eds.). Springer International Publishing, Cham, 740--755.Google ScholarGoogle ScholarCross RefCross Ref
  18. Pan Lu, Liang Qiu, Jiaqi Chen, Tony Xia, Yizhou Zhao, Wei Zhang, Zhou Yu, Xiaodan Liang, and Song-Chun Zhu. 2021. IconQA: A New Benchmark for Abstract Diagram Understanding and Visual Language Reasoning. In The 35th Conference on Neural Information Processing Systems (NeurIPS) Track on Datasets and Benchmarks.Google ScholarGoogle Scholar
  19. Minesh Mathew, Viraj Bagal, Rubèn Tito, Dimosthenis Karatzas, Ernest Valveny, and C. V. Jawahar. 2022. InfographicVQA. In 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). 2582--2591. https://doi.org/10.1109/ WACV51458.2022.00264Google ScholarGoogle Scholar
  20. Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. 2021. Docvqa: A dataset for vqa on document images. In Proceedings of the IEEE/CVF winter conference on applications of computer vision. 2200--2209.Google ScholarGoogle ScholarCross RefCross Ref
  21. Mehdi Mirza and Simon Osindero. 2014. Conditional Generative Adversarial Nets. arXiv:1411.1784 [cs.LG]Google ScholarGoogle Scholar
  22. OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL]Google ScholarGoogle Scholar
  23. Alec Radford, Luke Metz, and Soumith Chintala. 2016. Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, Yoshua Bengio and Yann LeCun (Eds.). http://arxiv.org/abs/1511.06434Google ScholarGoogle Scholar
  24. Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000 Questions for Machine Comprehension of Text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Austin, Texas, 2383--2392. https://doi.org/10.18653/v1/D16-1264Google ScholarGoogle ScholarCross RefCross Ref
  25. Natraj Raman, Sameena Shah, and Manuela Veloso. 2022. Synthetic document generator for annotation-free layout recognition. Pattern Recognition 128 (2022), 108660.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Hao Tan and Mohit Bansal. 2019. LXMERT: Learning Cross-Modality Encoder Representations from Transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, 5100--5111. https://doi.org/10. 18653/v1/D19-1514Google ScholarGoogle Scholar
  27. Simon Tanner, Marcelo Feighelstein, Jasmina Bogojeska, Joseph Shtok, Assef Arbelle, Peter Staar, Anika Schumann, Jonas Kuhn, and Leonid Karlinsky. 2022. FlowchartQA: The First Large-Scale Benchmark for Reasoning over Flowcharts. In Proceedings of DI 2022: The 3rd Workshop on Document Intelligence. KDD, Washington, DC.Google ScholarGoogle Scholar
  28. Terry Winograd et al. 1972. Shrdlu: A system for dialog.Google ScholarGoogle Scholar
  29. Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, and Ming Zhou. 2020. LayoutLM: Pre-training of Text and Layout for Document Image Understanding. Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (2020).Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Wanxiang Che, Min Zhang, and Lidong Zhou. 2021. LayoutLMv2: Multi-modal Pre-training for Visually-rich Document Understanding. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Online, 2579-2591. https://doi.org/10.18653/v1/2021.acl-long.201Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. BizGraphQA: A Dataset for Image-based Inference over Graph-structured Diagrams from Business Domains

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          SIGIR '23: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval
          July 2023
          3567 pages
          ISBN:9781450394086
          DOI:10.1145/3539618

          Copyright © 2023 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 18 July 2023

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          Overall Acceptance Rate792of3,983submissions,20%
        • Article Metrics

          • Downloads (Last 12 months)192
          • Downloads (Last 6 weeks)13

          Other Metrics

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader