skip to main content
10.1145/1815330.1815339acmotherconferencesArticle/Chapter ViewAbstractPublication PagesdasConference Proceedingsconference-collections
research-article

Table detection in heterogeneous documents

Published: 09 June 2010 Publication History

Abstract

Detecting tables in document images is important since not only do tables contain important information, but also most of the layout analysis methods fail in the presence of tables in the document image. Existing approaches for table detection mainly focus on detecting tables in single columns of text and do not work reliably on documents with varying layouts. This paper presents a practical algorithm for table detection that works with a high accuracy on documents with varying layouts (company reports, newspaper articles, magazine pages, ...). An open source implementation of the algorithm is provided as part of the Tesseract OCR engine. Evaluation of the algorithm on document images from publicly available UNLV dataset shows competitive performance in comparison to the table detection module of a commercial OCR system.

References

[1]
http://www.isri.unlv.edu/ISRI/OCRtk.
[2]
F. Cesarini, S. Marinai, L. Sarti, and G. Soda. Trainable table location in document images. In Proc. Int. Conf. on Pattern Recognition, pages 236--240, Quebec, Canada, Aug. 2002.
[3]
A. C. e Silva. Learning rich hidden markov models in document analysis: Table location. In Proc. Int. Conf. on Document Analysis and Recognition, pages 843--847, Barcelona, Spain, July 2009.
[4]
B. Gatos, D. Danatsas, I. Pratikakis, and S. J. Perantonis. Automatic table detection in document images. In Proc. Int. Conf. on Advances in Pattern Recognition, pages 612--621, Path, UK, Aug. 2005.
[5]
I. Guyon, R. M. Haralick, J. J. Hull, and I. T. Phillips. Data sets for OCR and document image understanding research. In H. Bunke and P. Wang, editors, Handbook of character recognition and document image analysis, pages 779--799. World Scientific, Singapore, 1997.
[6]
J. Hu, R. Kashi, D. Lopresti, and G. Wilfong. Medium-independent table detection. In Proc. SPIE Document Recognition and Retrieval VII, pages 291--302, San Jose, CA, USA, Jan. 2000.
[7]
J. Hu, R. S. Kashi, D. Lopresti, and G. Wilfong. Experiments in table recognition. In Proc. Int. Workshop on Document Layout Interpretation and Applications, Seattle, WA, USA, Sep. 2001.
[8]
J. Hu, R. S. Kashi, D. Lopresti, and G. Wilfong. Evaluating the performance of table processing algorithms. Int. Jour. on Document Analysis and Recognition, 4(3):140--153, 2002.
[9]
D. Keysers, F. Shafait, and T. M. Breuel. Document image zone classification - a simple high-performance approach. In 2nd Int. Conf. on Computer Vision Theory and Applications, pages 44--51, Barcelona, Spain, Mar. 2007.
[10]
T. Kieninger and A. Dengel. A paper-to-HTML table converting system. In Proc. Document Analysis Systems, pages 356--365, Nagano, Japan, Nov. 1998.
[11]
T. Kieninger and A. Dengel. Table recognition and labeling using intrinsic layout features. In Proc. Int. Conf. on Advances in Pattern Recognition, Plymouth, UK, Nov. 1998.
[12]
T. Kieninger and A. Dengel. Applying the T-RECS table recognition system to the business letter domain. In Proc. Int. Conf. on Document Analysis and Recognition, pages 518--522, Seattle, WA, USA, Sep. 2001.
[13]
T. Kieninger and A. Dengel. An approach towards benchmarking of table structure recognition results. In Proc. 8th Int. Conf. on Document Analysis and Recognition, pages 1232--1236, Seoul, Korea, Aug. 2005.
[14]
S. Mandal, S. Chowdhury, A. Das, and B. Chanda. A simple and effective table detection system from document images. Int. Jour. on Document Analysis and Recognition, 8(2--3):172--182, 2006.
[15]
S. V. Rice, F. R. Jenkins, and T. A. Nartker. The fourth annual test of OCR accuracy. Technical report, Information Science Research Institute, University of Nevada, Las Vegas, 1995.
[16]
F. Shafait, D. Keysers, and T. M. Breuel. Performance evaluation and benchmarking of six page segmentation algorithms. IEEE Trans. on Pattern Analysis and Machine Intelligence, 30(6):941--954, 2008.
[17]
F. Shafait, J. van Beusekom, D. Keysers, and T. M. Breuel. Document cleanup using page frame detection. Int. Jour. on Document Analysis and Recognition, 11(2):81--96, 2008.
[18]
R. Smith. An overview of the Tesseract OCR engine. In Proc. 9th Int. Conf. on Document Analysis and Recognition, pages 629--633, Curitiba, Brazil, Sep. 2007.
[19]
R. Smith. Hybrid page layout analysis via tab-stop detection. In Proc. Int. Conf. on Document Analysis and Recognition, pages 241--245, Barcelona, Spain, July 2009.
[20]
Y. Wang, R. Haralick, and I. T. Phillips. Automatic table ground truth generation and a background-analysis-based table structure extraction method. In Proc. Int. Conf. on Document Analysis and Recognition, pages 528--532, Seattle, WA, USA, Sep. 2001.
[21]
Y. Wang, I. Phillips, and R. Haralick. Document zone content classification and its performance evaluation. Pattern Recognition, 39(1):57--73, 2006.

Cited By

View all
  • (2024)Deep Learning for Table Detection and Structure Recognition: A SurveyACM Computing Surveys10.1145/365728156:12(1-41)Online publication date: 10-Apr-2024
  • (2024)An Overview of Data Extraction From InvoicesIEEE Access10.1109/ACCESS.2024.336052812(19872-19886)Online publication date: 2024
  • (2024)Robust page object detection network for heterogeneous document imagesInternational Journal on Document Analysis and Recognition (IJDAR)10.1007/s10032-024-00498-3Online publication date: 16-Aug-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
DAS '10: Proceedings of the 9th IAPR International Workshop on Document Analysis Systems
June 2010
490 pages
ISBN:9781605587738
DOI:10.1145/1815330
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 June 2010

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. document analysis
  2. page segmentation
  3. table detection

Qualifiers

  • Research-article

Conference

DAS '10

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)33
  • Downloads (Last 6 weeks)4
Reflects downloads up to 02 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Deep Learning for Table Detection and Structure Recognition: A SurveyACM Computing Surveys10.1145/365728156:12(1-41)Online publication date: 10-Apr-2024
  • (2024)An Overview of Data Extraction From InvoicesIEEE Access10.1109/ACCESS.2024.336052812(19872-19886)Online publication date: 2024
  • (2024)Robust page object detection network for heterogeneous document imagesInternational Journal on Document Analysis and Recognition (IJDAR)10.1007/s10032-024-00498-3Online publication date: 16-Aug-2024
  • (2023)A Two-stage Approach for Tables Extraction in Invoices2023 IEEE 35th International Conference on Tools with Artificial Intelligence (ICTAI)10.1109/ICTAI59109.2023.00010(10-15)Online publication date: 6-Nov-2023
  • (2023)Document Region SegmentationDocument Layout Analysis10.1007/978-981-99-4277-0_3(31-42)Online publication date: 1-Aug-2023
  • (2023)Tabular Data Extraction From DocumentsProceedings of International Conference on Recent Trends in Computing10.1007/978-981-19-8825-7_37(429-439)Online publication date: 21-Mar-2023
  • (2022)Toward Semi-Supervised Graphical Object Detection in Document ImagesFuture Internet10.3390/fi1406017614:6(176)Online publication date: 8-Jun-2022
  • (2022)Document image analysis and recognition: a surveyComputer Optics10.18287/2412-6179-CO-102046:4(567-589)Online publication date: Aug-2022
  • (2022)Semiautomated Generation of Logic Rules for Tabular Information in Building Codes to Support Automated Code Compliance CheckingJournal of Computing in Civil Engineering10.1061/(ASCE)CP.1943-5487.000100036:1Online publication date: Jan-2022
  • (2022)Automatic recognition system for document digitization in nuclear power plantsNuclear Engineering and Design10.1016/j.nucengdes.2022.111975398(111975)Online publication date: Nov-2022
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media