ABSTRACT
We propose a PDF document wrapper system that is specifically targeted at table processing applications. We (i) review the PDF specifications and identify particular challenges from the table processing point of view, (ii) specify a table-oriented document model containing the required atomic elements for table extraction and understanding applications. Our evaluation showed that the wrapper was able to detect important features such as page columns, bullets and numbering in all measures, recording over 90% accuracy, leading to better table locating and segmenting.
- Jing Fang, Liangcai Gao, Kun Bai, Ruiheng Qiu, Xin Tao, and Zhi Tang. A table detection method for multipage PDF documents via visual seperators and tabular structures. In Document Analysis and Recognition (ICDAR), pages 779--783. IEEE, 2011. Google ScholarDigital Library
- Jing Fang, Prasenjit Mitra, Zhi Tang, and C Lee Giles. Table header detection and classification. In AAAI, 2012. Google ScholarDigital Library
- Adobe Systems Incorporated. PDF reference. Technical Report Version 1.7, November 2006.Google Scholar
- Ying Liu, Kun Bai, Prasenjit Mitra, and C Lee Giles. Improving the table boundary detection in pdfs by fixing the sequence error of the sparse lines. In Document Analysis and Recognition (ICDAR), pages 1006--1010. IEEE, 2009. Google ScholarDigital Library
- Ermelinda Oro and Massimo Ruffolo. PDF-TREX: An approach for recognizing and extracting tables from pdf documents. In Document Analysis and Recognition (ICDAR), pages 906--910. IEEE, 2009. Google ScholarDigital Library
- Roya Rastan, Hye-Young Paik, and John Shepherd. Texus: A task-based approach for table extraction and understanding. In Symposium on Document Engineering, pages 25--34. ACM, 2015. Google ScholarDigital Library
- Roya Rastan, Hye-Young Paik, John Shepherd, and Armin Haller. Automated table understanding using stub patterns. In Database Systems for Advanced Applications, pages 533--548. Springer, 2016.Google ScholarCross Ref
- Sachin Seth and George Nagy. Segmenting tables via indexing of value cells by table headers. In Document Analysis and Recognition (ICDAR), pages 887--891. IEEE, 2013. Google ScholarDigital Library
- Ana Costa E Silva. New metrics for evaluating performance in document analysis tasks application to the table case. In Document Analysis and Recognition (ICDAR), pages 481--485. IEEE, 2007. Google ScholarDigital Library
Index Terms
- A PDF Wrapper for Table Processing
Recommendations
A wrapper generation system for PDF documents
SAC '08: Proceedings of the 2008 ACM symposium on Applied computingThe widespread use of the PDF format for exchanging print-oriented documents raises new challenges in the research field of information extraction. In this paper we present a novel wrapper generation system for extracting information from PDF documents. ...
A methodology for evaluating algorithms for table understanding in PDF documents
DocEng '12: Proceedings of the 2012 ACM symposium on Document engineeringThis paper presents a methodology for the evaluation of table understanding algorithms for PDF documents. The evaluation takes into account three major tasks: table detection, table structure recognition and functional analysis. We provide a general and ...
Conceptual Modelling for Invoice Document Processing
DEXA '97: Proceedings of the 8th International Workshop on Database and Expert Systems ApplicationsThis paper is concerned with the presentation of a declarative knowledge base, the Conceptual Model, which describes the invoice domain as generally as possible. Such a model is based on a semantic network that is able to describe the invoice domain by ...
Comments