skip to main content
10.1145/1815330.1815341acmotherconferencesArticle/Chapter ViewAbstractPublication PagesdasConference Proceedingsconference-collections
research-article

Analysis and taxonomy of column header categories for web tables

Published: 09 June 2010 Publication History

Abstract

We describe a component of a document analysis system for constructing ontologies for domain-specific web tables imported into Excel. This component automates extraction of the Wang Notation for the column header of a table. Using column-header specific rules for XY cutting we convert the geometric structure of the column header to a linear string denoting cell attributes and directions of cuts. The string representation is parsed by a context-free grammar and the parse tree is further processed to produce an abstract data-type representation (the Wang notation tree) of each column category. Experiments were carried out to evaluate this scheme on the original and edited column headers of Excel tables drawn from a collection of 200 used in our earlier work. The transformed headers were obtained by editing the original column headers to conform to the format targeted by our grammar. Forty-four original headers and their reformatted versions were submitted as input to our software system. Our grammar was able to parse and the extract Wang notation tree for all the edited headers, but for only four of the original headers. We suggest extensions to our table grammar that would enable processing a larger fraction of headers without manual editing.

References

[1]
Embley, D. W., Hurst, M., Lopresti, D., Nagy, G. 2006. Table Processing Paradigms: A Research Survey. Int. J. Doc. Anal. Recognit. 8 (2--3), Springer, Heidelberg, 66--86.
[2]
Embley, D. W., Lopresti, D., Nagy, G. 2006. Notes on Contemporary Table Recognition. In: Procs. Document Analysis Systems VII, H. Bunke and A. L. Spitz, Eds., Nelson, New Zealand, LNCS # 3872, Springer, Heidelberg, 164--175.
[3]
Embley, D., Tao, C., Liddle, S. 2005. Automating the extraction of data from HTML tables with unknown structure. Data Knowl. Eng., 54(1), July 2005, 3--28.
[4]
Gatterbauer, W., Bohunsky, P., Herzog, Krupl, M., Pollak, B. 2007. Towards Domain-Independent Information Extraction from Web Tables. In Proceedings of the 16th International Conference on World Wide Web, Banff, Alberta, Canada, 71--80.
[5]
Green, E. A., Krishnamoorthy, M., 1995. Model-based analysis of printed tables. In Procs. Third International Conference on Document Analysis and Recognition, (ICDAR), Montreal, Canada, pp. 214--217.
[6]
Handley, J. C. 2001. Table analysis for multiline cell identification. In: Kantor, P. B., Lopresti, D. P., Zhou, J. (eds.) Proceedings of Document Recognition and Retrieval VIII (IS&T/SPIE Electronic Imaging), vol. 4307, San Jose, CA, 34--43.
[7]
Hu J., Kashi R., Lopresti D., Nagy G., and Wilfong G. 2001. Why table ground-truthing is hard. In Proceedings of the Sixth International Conference on Document Analysis and Recognition, Seattle, WA, 129--133.
[8]
Hurst, M. 2000. The Interpretation of Tables in Texts. Ph.D. thesis, University of Edinburgh.
[9]
Itonori, K. 1993. A table structure recognition based on textblock arrangement and ruled line position. Proceedings of the Second International Conference on Document Analysis and Recognition (ICDAR'93), Tsukuba Science City, Japan, 765--768.
[10]
Jandhyala, R. C., Nagy, G., Seth, S., Silversmith, W., Krishnamoorthy, M., Padmanabhan, R. 2009. From tessellations to table interpretation. In L. Dixon et al. (Eds.): Calculemus/MKM 2009, Springer-Verlag, Berlin, 2009, vol. 5625 of Lecture Notes in Artificial Intelligence, 422--437.
[11]
Kanai, J., Krishnamoorthy, M. S., and Spencer, T., 1986. Algorithms for manipulating nested block represented images, SPSE's 26th Fall Symposium, Arlington VA, USA, pp. 190--193.
[12]
Kieninger, T., Dengel, A. 1998. A paper-to-HTML table converting system. In: Proceedings of Document Analysis Systems (DAS) 98, Nagano, Japan.
[13]
Krüpl, B., Herzog, M., Gatterbauer, W. 2005. Using visual cues for extraction of tabular data from arbitrary HTML documents. Proceedings. of the 14th Int'l Conf. on World Wide Web, 1000--1001.
[14]
Klink, S., Kieninger, T. 2001. Rule-based document structure understanding with a fuzzy combination of layout and textual features. International Journal of Document Analysis and Recognition, 4(1), 18--26.
[15]
Klarner, D. A. Magliveras, S. S. 1988. Tilings of a Block with Blocks. Europ. J. Combinatorics, 9, 317--330.
[16]
Krishnamoorthy, M., Nagy, G., Seth, S., and Viswanathan, M. 1993. Syntactic Segmentation and Labeling of Digitized Pages from Technical Journals. IEEE Transactions on Pattern Analysis and Machine Intelligence, 15(7), 737--747.
[17]
Kyriazis, G. 1990. Table Analysis. RPI DocLab Internal Report.
[18]
Laurentini, A., Viada, P. 1992. Identifying and understanding tabular material in compound documents. Proceedings of the Eleventh International Conference on Pattern Recognition (ICPR'92), The Hague, 405--409.
[19]
Nagy, G., Seth, S. 1984. Hierarchical Image Representation with Application to Optically Scanned Documents. In: Proceedings of the International Conference on Pattern Recognition VII, Montreal, 347--349.
[20]
Nagy, G., Seth, S., and Viswanathan, M. 1992. A Prototype Document Image Analysis System for Technical Journals. IEEE Computer 25, July 1992, 10--22.
[21]
Pyreddy, P., Croft, W. B. 1997. TINTIN: A System for Retrieval in Text Tables. In Proceedings of the Second ACM International Conference on Digital Libraries, New York, NY, 193--200.
[22]
A. Pivk, P. Ciamiano, Y. Sure, M. Gams, V. Rahkovic, R. Studer. 2007. Transforming arbitrary tables into logical form with TARTAR. Data and Knowledge Engineering 60(3), 567--595.
[23]
R. Padmanabhan, R. C. Jandhyala, M. Krishnamoorthy, G. Nagy, S. Seth, W. Silversmith. 2009. Interactive Conversion of Large Web Tables. Proceedings of Eighth International Workshop on Graphics Recognition, GREC 2009, Published by City University of La Rochelle, La Rochelle, France, July 22--23, 2009.
[24]
Samet, H. 2006. Foundations of Multidimensional and Metric Data Structures. Morgan Kaufman.
[25]
Silva, E. C., Jorge, A. M., Torgo, L. 2006. Design of an end-to-end method to extract information from tables. Int. J. Doc. Anal. Recognit. 8(2), Springer, 144--171.
[26]
C. Tao and D. W. Embley. 2009. Automatic hidden-web table interpretation, conceptualization, and semantic annotation. Data & Knowledge Engineering, 68(7), July 2009, 683--703.
[27]
X. Wang, "Tabular Abstraction, Editing, and Formatting," Ph.D Dissertation, University of Waterloo, Waterloo, ON, Canada, 1996.
[28]
Zanibbi, R., Blostein, D., Cordy, J. R. 2004. A survey of table recognition: Models, observations, transformations, and inferences. International Journal of Document Analysis and Recognition, 7(1), 1--16.

Cited By

View all
  • (2023)Aligning Benchmark Datasets for Table Structure RecognitionDocument Analysis and Recognition - ICDAR 202310.1007/978-3-031-41734-4_23(371-386)Online publication date: 19-Aug-2023
  • (2022)Table understanding: Problem overviewWIREs Data Mining and Knowledge Discovery10.1002/widm.148213:1Online publication date: 21-Nov-2022
  • (2016)A Divide-and-Merge Approach for Deep Segmentation of Document TablesProceedings of the 10th International Conference on Informatics and Systems10.1145/2908446.2908473(43-49)Online publication date: 9-May-2016
  • Show More Cited By

Index Terms

  1. Analysis and taxonomy of column header categories for web tables

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    DAS '10: Proceedings of the 9th IAPR International Workshop on Document Analysis Systems
    June 2010
    490 pages
    ISBN:9781605587738
    DOI:10.1145/1815330
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 09 June 2010

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Wang notation
    2. column-header grammar
    3. conversion
    4. parsing
    5. table ontology
    6. web tables

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    DAS '10

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)7
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 02 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Aligning Benchmark Datasets for Table Structure RecognitionDocument Analysis and Recognition - ICDAR 202310.1007/978-3-031-41734-4_23(371-386)Online publication date: 19-Aug-2023
    • (2022)Table understanding: Problem overviewWIREs Data Mining and Knowledge Discovery10.1002/widm.148213:1Online publication date: 21-Nov-2022
    • (2016)A Divide-and-Merge Approach for Deep Segmentation of Document TablesProceedings of the 10th International Conference on Informatics and Systems10.1145/2908446.2908473(43-49)Online publication date: 9-May-2016
    • (2016)Automated Table Understanding Using Stub PatternsDatabase Systems for Advanced Applications10.1007/978-3-319-32025-0_33(533-548)Online publication date: 25-Mar-2016
    • (2015)TEXUSProceedings of the 2015 ACM Symposium on Document Engineering10.1145/2682571.2797069(25-34)Online publication date: 8-Sep-2015
    • (2014)Transforming Web Tables to a Relational DatabaseProceedings of the 2014 22nd International Conference on Pattern Recognition10.1109/ICPR.2014.479(2781-2786)Online publication date: 24-Aug-2014
    • (2014)Recognition of Tables and FormsHandbook of Document Image Processing and Recognition10.1007/978-0-85729-859-1_20(647-677)Online publication date: 30-Apr-2014
    • (2012)Automatic transformation of multi-dimensional web tables into data cubesProceedings of the 14th international conference on Data Warehousing and Knowledge Discovery10.1007/978-3-642-32584-7_7(81-92)Online publication date: 3-Sep-2012
    • (2010)Interactive Conversion of Web TablesGraphics Recognition. Achievements, Challenges, and Evolution10.1007/978-3-642-13728-0_3(25-36)Online publication date: 2010
    • (2009)Interactive conversion of web tablesProceedings of the 8th international conference on Graphics recognition: achievements, challenges, and evolution10.5555/1875532.1875535(25-36)Online publication date: 22-Jul-2009

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media