skip to main content
10.1145/2509558.2509576acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
poster

Unsupervised discovery and extraction of semi-structured regions in text via self-information

Published: 27 October 2013 Publication History

Abstract

We describe a general method for identifying and extracting information from semi-structured regions of text embedded within a natural language document. These regions encode information according to ad hoc schemas and visual cues, instead of using the grammatical and presentational conventions of normal sentential language. Examples include tables, key-value listings, or repeated enumerations of properties. Because of their generally non-sentential nature, these regions can present problems for standard information extraction algorithms. Unlike previous work in table extraction, which relies on a relatively noiseless two-dimensional layout, our aim is to accommodate a wide variety of structure types. Our approach for identifying semi-structured regions is an unsupervised one, based on scoring unusual regularity inside the document. As content in semi-structured regions are governed by a schema, the occurrence of features encompassing textual content and visual appearance would be unusual compared to those seen in sentential language. Regularity refers to repetition of these unusual features, as semi-structured regions commonly encode more than a single row or group of information. To score this, we present a measure based on expected self-information, derived from statistics over patterns of textual categories and visual layout. We describe the results of an initial study to assess the ability of these measures to detect semi-structured text in a corpus culled from the web, and show that this measure outperform baseline methods on an average precision measure. We present initial work that uses these significant patterns to generate extraction rules, and conclude with a discussion of future directions.

References

[1]
K. Fisher, D. Walker, K. Q. Zhu, and P. White. From dirt to shovels: Fully automatic tool generation from ad hoc data. In In POPL, 2008.
[2]
M. Hurst. Layout and language: An efficient algorithm for detecting text blocks based on spatial and linguistic evidence. In In Document Recognition and Retrieval VIII, 2001.
[3]
C. D. Manning and H. Schijtze. Foundations of statistical natural language processing. MIT Press, Cambridge Mass., 1999.
[4]
D. Pinto, A. Mccallum, X. Wei, and W. B. Croft. Table extraction using conditional random fields, 2003.
[5]
A. C. Silva, A. M. Jorge, and L. Torgo. Design of an end-to-end method to extract information from tables. International Journal Document Analysis Research, 8:144--171, 2006.
[6]
S. Soderland, C. Cardie, and R. Mooney. Learning information extraction rules for semi-structured and free text. In Machine Learning, pages 233--272, 1999.
[7]
K. Toutanova, D. Klein, C. D. Manning, and Y. Singer. Feature-rich part-of-speech tagging with a cyclic dependency network. In IN PROCEEDINGS OF HLT-NAACL, pages 252--259, 2003.

Cited By

View all
  • (2018)Large-scale predictive modeling and analytics through regression queries in data management systemsInternational Journal of Data Science and Analytics10.1007/s41060-018-0163-59:1(17-55)Online publication date: 27-Dec-2018

Index Terms

  1. Unsupervised discovery and extraction of semi-structured regions in text via self-information

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    AKBC '13: Proceedings of the 2013 workshop on Automated knowledge base construction
    October 2013
    124 pages
    ISBN:9781450324113
    DOI:10.1145/2509558
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 27 October 2013

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tag

    1. semi-structured information extraction

    Qualifiers

    • Poster

    Conference

    CIKM'13
    Sponsor:

    Acceptance Rates

    AKBC '13 Paper Acceptance Rate 9 of 19 submissions, 47%;
    Overall Acceptance Rate 9 of 19 submissions, 47%

    Upcoming Conference

    CIKM '25

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 17 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2018)Large-scale predictive modeling and analytics through regression queries in data management systemsInternational Journal of Data Science and Analytics10.1007/s41060-018-0163-59:1(17-55)Online publication date: 27-Dec-2018

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media