skip to main content
10.1145/2908446.2908473acmotherconferencesArticle/Chapter ViewAbstractPublication PagesinfosConference Proceedingsconference-collections
research-article

A Divide-and-Merge Approach for Deep Segmentation of Document Tables

Authors Info & Claims
Published:09 May 2016Publication History

ABSTRACT

Document tables are a rich source of implicit semantics hidden in documents which could be exploited for better searching and ranking. However, a central problem in document table processing systems is the segmentation of impure and incomplete table segments. Many existing methods are limited to basic table segments of limited table layout complexity. These methods are passing through the stage of very low level segmentation thus missing many physical and logical structures of tables producing erroneous results. However, these structures have significant impact on the overall table interpretation. In order to gain the full strength of table information reusability, a deep segmentation with a layout independent representation of tables is needed. This paper propose an algorithm which performs divide and merge tasks in different phases of segmentation using matrix as intermediate model. The algorithm covers maximum layout complexity and extracts table schema, data and the set of reading paths which are then represented in a layout independent notation. We evaluated our results using standard measures and open data sets.

References

  1. Babatunde, F.F. et al. 2015. Automatic Table Recognition and Extraction from Heterogeneous Documents. Journal of Computer and Communications. 03, 12 (2015), 100--110. DOI= http://dx.doi.org/10.4236/jcc.2015.312009.Google ScholarGoogle ScholarCross RefCross Ref
  2. Bansal, A. et al. Table Extraction from Document Images using Fixed Point Model. In Proceedings of the Indian Conference on Computer Vision, Graphics and Image Processing (Bangalore, India, December 14-18, 2014), ACM, New York, NY, 1--8. DOI= http://dx.doi.org/10.1145/2683483.2685503. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Crestan, E. and Pantel, P. 2011. Web-scale table census and classification. In Proceedings of the Forth International Conference on Web Search and Web Data Mining (Hong Kong, China, February 9-12, 2011), ACM, New York, NY, 545--554. DOI= http://doi.acm.org/10.1145/1935826.1935904. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. e Silva, A.C. et al. 2006. Design of an end-to-end method to extract information from tables. International Journal of Document Analysis and Recognition (IJDAR). 8, 2-3 (Feb. 2006), 144--171. DOI= http://dx.doi.org/10.1007/s10032-005-0001-x.Google ScholarGoogle ScholarCross RefCross Ref
  5. Embley, D.W. et al. 2006. Table-processing paradigms: A research survey. International Journal of Document Analysis and Recognition (IJDAR). 8, 2-3 (2006), 66--86. DOI= http://dx.doi.org/10.1007/s10032-006-0017-x.Google ScholarGoogle ScholarCross RefCross Ref
  6. Fang, J. et al. 2012. Table Header Detection and Classification. In Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence (Toronto, Ontario, Canada, July 22-26, 2012), 599--605. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Hurst, M. 2006. Towards a theory of tables. International Journal on Document Analysis and Recognition (IJDAR). 8, 2-3 (2006), 123--131. DOI= http://dx.doi.org/10.1007/s10032-006-0016-y.Google ScholarGoogle ScholarCross RefCross Ref
  8. Jha, P. and Nagy, G. 2008. Wang Notation Tool: Layout independent representation of tables. In 19th International ICPR Conference on Pattern Recognition (Tampa, Florida, USA, December 8-11, 2008), IEEE, 1--4. DOI= http://dx.doi.org/10.1109/ICPR.2008.4761550.Google ScholarGoogle Scholar
  9. Khusro, S. et al. 2014. On Methods and Tools of Table Detection, Extraction and Annotation in PDF Documents. Journal of Information Science (JIS). 41, 1 (2014), 41--57. DOI= http://dx.doi.org/10.1177/0165551514551903. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Liu, Y. et al. 2007. TableSeer: Automatic Table Metadata Extraction and Searching in Digital Libraries Categories and Subject Descriptors. In Proceedings of the 7th ACM/IEEECS Joint Conference on Digital libraries (Vancouver, British Columbia, Canada, June 18-23, 2007), ACM, New York, NY, 91--100. DOI= http://doi.acm.org/10.1145/1255175.1255193. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Nurminen, A. 2013. Algorithmic Extraction of Data in Tables in PDF Documents. Tampere University of Technology.Google ScholarGoogle Scholar
  12. Rastan, R. et al. 2015. TEXUS: A Task-based Approach for Table Extraction and Understanding. Proceedings of the 2015 ACM Symposium on Document Engineering (Lausanne, Switzerland, 2015), ACM, New York, NY, 25--34. DOI= http://doi.acm.org/10.1145/2682571.2797069. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Seth, S. et al. 2010. Analysis and taxonomy of column header categories for web tables. In Proceedings of the 8th IAPR International Workshop on Document Analysis Systems (Boston, Massachusetts, USA, June 9-11, 2010), ACM, New York, NY, 81--88. DOI= http://dx.doi.org/10.1145/1815330.1815341. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Wang, Y. et al. 2004. Table Structure Understanding and its Performance Evaluation. Pattern Recognition. 37, 7 (2004), 1479--1497. DOI= http://dx.doi.org/10.1016/j.patcog.2004.01.012.Google ScholarGoogle ScholarCross RefCross Ref
  15. Xinxin, W. 1996. Tabular Extraction, Editing and Formating. University of Waterloo Waterloo, Ont., Canada.Google ScholarGoogle Scholar
  16. Yildiz, B. et al. 2005. PDF2Table: A Method to Extract Table Information from PDF Files. In Proceedings of the 2nd Indian International Conference on Artificial Intelligence (December 20-22, Pune, India, 2005), Research Gate, 1773--178.Google ScholarGoogle Scholar

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Other conferences
    INFOS '16: Proceedings of the 10th International Conference on Informatics and Systems
    May 2016
    347 pages
    ISBN:9781450340625
    DOI:10.1145/2908446

    Copyright © 2016 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 9 May 2016

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • research-article
    • Research
    • Refereed limited

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader