skip to main content
10.1145/3570991.3571037acmotherconferencesArticle/Chapter ViewAbstractPublication PagescodsConference Proceedingsconference-collections
short-paper

BUDDI Table Factory: A toolbox for generating synthetic documents with annotated tables and cells

Published:04 January 2023Publication History

ABSTRACT

Tables are the most convenient way to represent structured information in a document. Understanding the table structure is critical to understanding its contents. Several deep learning-based approaches from the literature have shown promising results in understanding table structures, but they require large amounts of annotated data. However, the availability of annotated datasets to train these methods are expensive, laborious, and very limited. Moreover, human-annotated data suffers from inconsistencies in table and cell annotations. We propose BUDDI Table Factory (BTF) for synthetically generating annotated documents with a wide range of variations in table structures. We propose a heuristics-based method to generate a variety of table structures from which we generate synthetic documents using LaTeX. We propose a computer vision-based approach to localize table and cell regions and automatically generate annotations in PASCAL VOC challenge format. We empirically illustrate the advantage of adding synthetic BTF documents with limited original documents to the model training, which can significantly improve the TEDS and IoU performance of the table structure recognition tasks in public and real-world healthcare datasets.

References

  1. Madhav Agarwal, Ajoy Mondal, and C. Jawahar. 2021. CDeC-Net: Composite Deformable Cascade Network for Table Detection in Document Images. In CDeC-Net: Composite Deformable Cascade Network for Table Detection in Document Images. 9491–9498. https://doi.org/10.1109/ICPR48806.2021.9411922Google ScholarGoogle Scholar
  2. Azim Ahmadzadeh, Dustin J. Kempton, Yang Chen, and Rafal A. Angryk. 2021. Multiscale IoU: A Metric for Evaluation of Salient Object Detection with Fine Structures. https://doi.org/10.48550/ARXIV.2105.14572Google ScholarGoogle Scholar
  3. Sanket Biswas, Pau Riba, Josep Lladós, and Umapada Pal. 2021. DocSynth: A Layout Guided Approach for Controllable Document Image Synthesis. https://doi.org/10.48550/ARXIV.2107.02638Google ScholarGoogle Scholar
  4. G. Bradski. 2000. The OpenCV Library. Dr. Dobb’s Journal of Software Tools(2000).Google ScholarGoogle Scholar
  5. Quang Anh Bui, David Mollard, and Salvatore Tabbone. 2019. Automatic Synthetic Document Image Generation using Generative Adversarial Networks: Application in Mobile-Captured Document Analysis. In 2019 International Conference on Document Analysis and Recognition (ICDAR). 393–400. https://doi.org/10.1109/ICDAR.2019.00070Google ScholarGoogle ScholarCross RefCross Ref
  6. Abhishek Dutta and Andrew Zisserman. 2019. The VIA Annotation Software for Images, Audio and Video. In Proceedings of the 27th ACM International Conference on Multimedia (Nice, France) (MM ’19). Association for Computing Machinery, New York, NY, USA, 2276–2279. https://doi.org/10.1145/3343031.3350535Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. David Etter, Stephen Rawls, Cameron Carpenter, and Gregory Sell. 2019. A Synthetic Recipe for OCR. In A Synthetic Recipe for OCR. 864–869. https://doi.org/10.1109/ICDAR.2019.00143Google ScholarGoogle Scholar
  8. Mark Everingham, S. M. Ali Eslami, Luc Van Gool, Christopher K. I. Williams, John Winn, and Andrew Zisserman. 2015. The PASCAL Visual Object Classes Challenge: A Retrospective. International Journal of Computer Vision 111, 1 (1 Jan. 2015), 98–136. https://doi.org/10.1007/s11263-014-0733-5Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Max C. Göbel, Tamir Hassan, Ermelinda Oro, and Giorgio Orsi. 2013. ICDAR 2013 Table Competition. 2013 12th International Conference on Document Analysis and Recognition (2013), 1449–1453.Google ScholarGoogle Scholar
  10. Nicholas Journet, Muriel Visani, Boris Mansencal, Kieu Van-Cuong, and Antoine Billy. 2017. DocCreator: A New Software for Creating Synthetic Ground-Truthed Document Images. Journal of Imaging 3, 4 (2017). https://doi.org/10.3390/jimaging3040062Google ScholarGoogle ScholarCross RefCross Ref
  11. Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou, and Zhoujun Li. 2020. TableBank: Table Benchmark for Image-based Table Detection and Recognition. In Proceedings of the 12th Language Resources and Evaluation Conference. European Language Resources Association, Marseille, France, 1918–1925. https://aclanthology.org/2020.lrec-1.236Google ScholarGoogle Scholar
  12. Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. In Computer Vision – ECCV 2014, David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars (Eds.). Springer International Publishing, Cham, 740–755.Google ScholarGoogle Scholar
  13. C.H. Lun and S. Hou. 2022. Geological Document Layout Analysis via Synthetic Dataset Creation. European Association of Geoscientists & Engineers 2022, 1(2022), 1–5. https://doi.org/10.3997/2214-4609.202239022Google ScholarGoogle Scholar
  14. Shubham Paliwal, Vishwanath D, Rohit Rahul, Monika Sharma, and Lovekesh Vig. 2019. TableNet: Deep Learning Model for End-to-end Table Detection and Tabular Data Extraction from Scanned Document Images. In TableNet: Deep Learning Model for End-to-end Table Detection and Tabular Data Extraction from Scanned Document Images. https://doi.org/10.1109/ICDAR.2019.00029Google ScholarGoogle Scholar
  15. Nandhinee PR, Harinath Krishnamoorthy, Koushik Srivatsan, Anil Goyal, and Sudarsun Santhiappan. 2022. DEXTER: An end-to-end system to extract table contents from electronic medical health documents. https://doi.org/10.48550/ARXIV.2207.06823Google ScholarGoogle Scholar
  16. Natraj Raman, Sameena Shah, and Manuela Veloso. 2022. Synthetic document generator for annotation-free layout recognition. Pattern Recognition 128 (aug 2022), 108660. https://doi.org/10.1016/j.patcog.2022.108660Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. C V Jawahar Sachin Raja, Ajoy Mondal. 2020. Table Structure Recognition using Top-Down and Bottom-Up Cues.Google ScholarGoogle Scholar
  18. Sebastian Schreiber, Stefan Agne, Ivo Wolf, Andreas Dengel, and Sheraz Ahmed. 2017. DeepDeSRT: Deep Learning for Detection and Structure Recognition of Tables in Document Images. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Vol. 01. 1162–1167. https://doi.org/10.1109/ICDAR.2017.192Google ScholarGoogle Scholar
  19. Asif Shahab, Faisal Shafait, Thomas Kieninger, and Andreas Dengel. 2010. An Open Approach towards the Benchmarking of Table Structure Recognition Systems. In An Open Approach towards the Benchmarking of Table Structure Recognition Systems (Boston, Massachusetts, USA) (DAS ’10). Association for Computing Machinery, New York, NY, USA, 113–120. https://doi.org/10.1145/1815330.1815345Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Noah Siegel, Nicholas Lourie, Russell Power, and Waleed Ammar. 2018. Extracting Scientific Figures with Distantly Supervised Neural Networks. In Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries (Fort Worth, Texas, USA) (JCDL ’18). Association for Computing Machinery, New York, NY, USA, 223–232. https://doi.org/10.1145/3197026.3197040Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Lars Vögtlin, Manuel Drazyk, Vinaychandran Pondenkandath, Michele Alberti, and Rolf Ingold. 2021. Generating Synthetic Handwritten Historical Documents With OCR Constrained GANs. https://doi.org/10.48550/ARXIV.2103.08236Google ScholarGoogle Scholar
  22. Lin Wan, Ju Zhou, and Bailing Zhang. 2020. Data Synthesis for Document Layout Analysis. In Data Synthesis for Document Layout Analysis.Google ScholarGoogle Scholar
  23. Xu Zhong, Elaheh ShafieiBavani, and Antonio Jimeno Yepes. 2020. Image-Based Table Recognition: Data, Model, and Evaluation. In Computer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXI (Glasgow, United Kingdom). Springer-Verlag, Berlin, Heidelberg, 564–580. https://doi.org/10.1007/978-3-030-58589-1_34Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Xu Zhong, Jianbin Tang, and Antonio Jimeno-Yepes. 2019. PubLayNet: Largest Dataset Ever for Document Layout Analysis. 2019 International Conference on Document Analysis and Recognition (ICDAR) (2019), 1015–1022.Google ScholarGoogle Scholar

Index Terms

  1. BUDDI Table Factory: A toolbox for generating synthetic documents with annotated tables and cells

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in
            • Published in

              cover image ACM Other conferences
              CODS-COMAD '23: Proceedings of the 6th Joint International Conference on Data Science & Management of Data (10th ACM IKDD CODS and 28th COMAD)
              January 2023
              357 pages
              ISBN:9781450397971
              DOI:10.1145/3570991

              Copyright © 2023 ACM

              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 4 January 2023

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • short-paper
              • Research
              • Refereed limited

              Acceptance Rates

              Overall Acceptance Rate197of680submissions,29%
            • Article Metrics

              • Downloads (Last 12 months)50
              • Downloads (Last 6 weeks)6

              Other Metrics

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader

            HTML Format

            View this article in HTML Format .

            View HTML Format