short-paper

BUDDI Table Factory: A toolbox for generating synthetic documents with annotated tables and cells

Authors:
Bharath Sripathy

BUDDI AI, India

BUDDI AI, India

0000-0002-3094-7954
View Profile

,
Harinath Krishnamoorthy

BUDDI AI, India

BUDDI AI, India

0000-0002-9095-6529
View Profile

,
Sudarsun Santhiappan

BUDDI AI, India

BUDDI AI, India

0000-0001-5769-2405
View Profile

CODS-COMAD '23: Proceedings of the 6th Joint International Conference on Data Science & Management of Data (10th ACM IKDD CODS and 28th COMAD)January 2023Pages 218–222https://doi.org/10.1145/3570991.3571037

Published:04 January 2023Publication History

CODS-COMAD '23: Proceedings of the 6th Joint International Conference on Data Science & Management of Data (10th ACM IKDD CODS and 28th COMAD)

Pages 218–222

ABSTRACT

Tables are the most convenient way to represent structured information in a document. Understanding the table structure is critical to understanding its contents. Several deep learning-based approaches from the literature have shown promising results in understanding table structures, but they require large amounts of annotated data. However, the availability of annotated datasets to train these methods are expensive, laborious, and very limited. Moreover, human-annotated data suffers from inconsistencies in table and cell annotations. We propose BUDDI Table Factory (BTF) for synthetically generating annotated documents with a wide range of variations in table structures. We propose a heuristics-based method to generate a variety of table structures from which we generate synthetic documents using LaTeX. We propose a computer vision-based approach to localize table and cell regions and automatically generate annotations in PASCAL VOC challenge format. We empirically illustrate the advantage of adding synthetic BTF documents with limited original documents to the model training, which can significantly improve the TEDS and IoU performance of the table structure recognition tasks in public and real-world healthcare datasets.

References

Madhav Agarwal, Ajoy Mondal, and C. Jawahar. 2021. CDeC-Net: Composite Deformable Cascade Network for Table Detection in Document Images. In CDeC-Net: Composite Deformable Cascade Network for Table Detection in Document Images. 9491–9498. https://doi.org/10.1109/ICPR48806.2021.9411922Google Scholar
Azim Ahmadzadeh, Dustin J. Kempton, Yang Chen, and Rafal A. Angryk. 2021. Multiscale IoU: A Metric for Evaluation of Salient Object Detection with Fine Structures. https://doi.org/10.48550/ARXIV.2105.14572Google Scholar
Sanket Biswas, Pau Riba, Josep Lladós, and Umapada Pal. 2021. DocSynth: A Layout Guided Approach for Controllable Document Image Synthesis. https://doi.org/10.48550/ARXIV.2107.02638Google Scholar
G. Bradski. 2000. The OpenCV Library. Dr. Dobb’s Journal of Software Tools(2000).Google Scholar
Quang Anh Bui, David Mollard, and Salvatore Tabbone. 2019. Automatic Synthetic Document Image Generation using Generative Adversarial Networks: Application in Mobile-Captured Document Analysis. In 2019 International Conference on Document Analysis and Recognition (ICDAR). 393–400. https://doi.org/10.1109/ICDAR.2019.00070Google ScholarCross Ref
Abhishek Dutta and Andrew Zisserman. 2019. The VIA Annotation Software for Images, Audio and Video. In Proceedings of the 27th ACM International Conference on Multimedia (Nice, France) (MM ’19). Association for Computing Machinery, New York, NY, USA, 2276–2279. https://doi.org/10.1145/3343031.3350535Google ScholarDigital Library
David Etter, Stephen Rawls, Cameron Carpenter, and Gregory Sell. 2019. A Synthetic Recipe for OCR. In A Synthetic Recipe for OCR. 864–869. https://doi.org/10.1109/ICDAR.2019.00143Google Scholar
Mark Everingham, S. M. Ali Eslami, Luc Van Gool, Christopher K. I. Williams, John Winn, and Andrew Zisserman. 2015. The PASCAL Visual Object Classes Challenge: A Retrospective. International Journal of Computer Vision 111, 1 (1 Jan. 2015), 98–136. https://doi.org/10.1007/s11263-014-0733-5Google ScholarDigital Library
Max C. Göbel, Tamir Hassan, Ermelinda Oro, and Giorgio Orsi. 2013. ICDAR 2013 Table Competition. 2013 12th International Conference on Document Analysis and Recognition (2013), 1449–1453.Google Scholar
Nicholas Journet, Muriel Visani, Boris Mansencal, Kieu Van-Cuong, and Antoine Billy. 2017. DocCreator: A New Software for Creating Synthetic Ground-Truthed Document Images. Journal of Imaging 3, 4 (2017). https://doi.org/10.3390/jimaging3040062Google ScholarCross Ref
Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou, and Zhoujun Li. 2020. TableBank: Table Benchmark for Image-based Table Detection and Recognition. In Proceedings of the 12th Language Resources and Evaluation Conference. European Language Resources Association, Marseille, France, 1918–1925. https://aclanthology.org/2020.lrec-1.236Google Scholar
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. In Computer Vision – ECCV 2014, David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars (Eds.). Springer International Publishing, Cham, 740–755.Google Scholar
C.H. Lun and S. Hou. 2022. Geological Document Layout Analysis via Synthetic Dataset Creation. European Association of Geoscientists & Engineers 2022, 1(2022), 1–5. https://doi.org/10.3997/2214-4609.202239022Google Scholar
Shubham Paliwal, Vishwanath D, Rohit Rahul, Monika Sharma, and Lovekesh Vig. 2019. TableNet: Deep Learning Model for End-to-end Table Detection and Tabular Data Extraction from Scanned Document Images. In TableNet: Deep Learning Model for End-to-end Table Detection and Tabular Data Extraction from Scanned Document Images. https://doi.org/10.1109/ICDAR.2019.00029Google Scholar
Nandhinee PR, Harinath Krishnamoorthy, Koushik Srivatsan, Anil Goyal, and Sudarsun Santhiappan. 2022. DEXTER: An end-to-end system to extract table contents from electronic medical health documents. https://doi.org/10.48550/ARXIV.2207.06823Google Scholar
Natraj Raman, Sameena Shah, and Manuela Veloso. 2022. Synthetic document generator for annotation-free layout recognition. Pattern Recognition 128 (aug 2022), 108660. https://doi.org/10.1016/j.patcog.2022.108660Google ScholarDigital Library
C V Jawahar Sachin Raja, Ajoy Mondal. 2020. Table Structure Recognition using Top-Down and Bottom-Up Cues.Google Scholar
Sebastian Schreiber, Stefan Agne, Ivo Wolf, Andreas Dengel, and Sheraz Ahmed. 2017. DeepDeSRT: Deep Learning for Detection and Structure Recognition of Tables in Document Images. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Vol. 01. 1162–1167. https://doi.org/10.1109/ICDAR.2017.192Google Scholar
Asif Shahab, Faisal Shafait, Thomas Kieninger, and Andreas Dengel. 2010. An Open Approach towards the Benchmarking of Table Structure Recognition Systems. In An Open Approach towards the Benchmarking of Table Structure Recognition Systems (Boston, Massachusetts, USA) (DAS ’10). Association for Computing Machinery, New York, NY, USA, 113–120. https://doi.org/10.1145/1815330.1815345Google ScholarDigital Library
Noah Siegel, Nicholas Lourie, Russell Power, and Waleed Ammar. 2018. Extracting Scientific Figures with Distantly Supervised Neural Networks. In Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries (Fort Worth, Texas, USA) (JCDL ’18). Association for Computing Machinery, New York, NY, USA, 223–232. https://doi.org/10.1145/3197026.3197040Google ScholarDigital Library
Lars Vögtlin, Manuel Drazyk, Vinaychandran Pondenkandath, Michele Alberti, and Rolf Ingold. 2021. Generating Synthetic Handwritten Historical Documents With OCR Constrained GANs. https://doi.org/10.48550/ARXIV.2103.08236Google Scholar
Lin Wan, Ju Zhou, and Bailing Zhang. 2020. Data Synthesis for Document Layout Analysis. In Data Synthesis for Document Layout Analysis.Google Scholar
Xu Zhong, Elaheh ShafieiBavani, and Antonio Jimeno Yepes. 2020. Image-Based Table Recognition: Data, Model, and Evaluation. In Computer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXI (Glasgow, United Kingdom). Springer-Verlag, Berlin, Heidelberg, 564–580. https://doi.org/10.1007/978-3-030-58589-1_34Google ScholarDigital Library
Xu Zhong, Jianbin Tang, and Antonio Jimeno-Yepes. 2019. PubLayNet: Largest Dataset Ever for Document Layout Analysis. 2019 International Conference on Document Analysis and Recognition (ICDAR) (2019), 1015–1022.Google Scholar

Index Terms

BUDDI Table Factory: A toolbox for generating synthetic documents with annotated tables and cells
1. Applied computing
  1. Document management and text processing
    1. Document preparation
      1. Annotation
      2. Document scripting languages
2. Computing methodologies

Recommendations

Configurable Table Structure Recognition in Untagged PDF documents
DocEng '16: Proceedings of the 2016 ACM Symposium on Document Engineering

Today, PDF is one of the most popular document formats in the web. Many PDF documents are not images, but remain untagged. They have no tags for identifying the logical reading order, paragraphs, figures, and tables. One of the challenges with these ...
Read More
End-to-end table structure recognition and extraction in heterogeneous documents
Abstract
Automatically detecting and parsing tables into an indexable and searchable format is an important problem in document digitization. It relates to computer vision, machine learning, and optical character recognition. This paper ...
Highlights
- Recognizing tables using object detection in structured and unstructured documents.
Read More
Automatic extraction of table metadata from digital documents
JCDL '06: Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries

Tables are used to present, list, summarize, and structure important data in documents. In scholarly articles, they are often used to present the relationships among data and high-light a collection of results obtained from experiments and scientific ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

CODS-COMAD '23: Proceedings of the 6th Joint International Conference on Data Science & Management of Data (10th ACM IKDD CODS and 28th COMAD)
January 2023
357 pages
ISBN:9781450397971
DOI:10.1145/3570991

Copyright © 2023 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 4 January 2023
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Automated Annotation Extraction
Cell Detection
Computer Vision
Deep Learning
Synthetic Document Generation
Table Detection
Table Structure Recognition
Qualifiers
- short-paper
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate197of680submissions,29%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 77
  Total Downloads
- Downloads (Last 12 months)50
- Downloads (Last 6 weeks)6
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

BUDDI Table Factory: A toolbox for generating synthetic documents with annotated tables and cells

CODS-COMAD '23: Proceedings of the 6th Joint International Conference on Data Science & Management of Data (10th ACM IKDD CODS and 28th COMAD)

ABSTRACT

References

Cited By

Index Terms

Recommendations

Configurable Table Structure Recognition in Untagged PDF documents

End-to-end table structure recognition and extraction in heterogeneous documents

Automatic extraction of table metadata from digital documents

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

HTML Format

Caption

BUDDI Table Factory: A toolbox for generating synthetic documents with annotated tables and cells

CODS-COMAD '23: Proceedings of the 6th Joint International Conference on Data Science & Management of Data (10th ACM IKDD CODS and 28th COMAD)

ABSTRACT

References

Cited By

Index Terms

Recommendations

Configurable Table Structure Recognition in Untagged PDF documents

End-to-end table structure recognition and extraction in heterogeneous documents

Automatic extraction of table metadata from digital documents

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media