poster

Unsupervised discovery and extraction of semi-structured regions in text via self-information

Authors:

Eric Yeh,

John Niekrasz,

Dayne FreitagAuthors Info & Claims

AKBC '13: Proceedings of the 2013 workshop on Automated knowledge base construction

Pages 103 - 108

https://doi.org/10.1145/2509558.2509576

Published: 27 October 2013 Publication History

Get Access

Abstract

We describe a general method for identifying and extracting information from semi-structured regions of text embedded within a natural language document. These regions encode information according to ad hoc schemas and visual cues, instead of using the grammatical and presentational conventions of normal sentential language. Examples include tables, key-value listings, or repeated enumerations of properties. Because of their generally non-sentential nature, these regions can present problems for standard information extraction algorithms. Unlike previous work in table extraction, which relies on a relatively noiseless two-dimensional layout, our aim is to accommodate a wide variety of structure types. Our approach for identifying semi-structured regions is an unsupervised one, based on scoring unusual regularity inside the document. As content in semi-structured regions are governed by a schema, the occurrence of features encompassing textual content and visual appearance would be unusual compared to those seen in sentential language. Regularity refers to repetition of these unusual features, as semi-structured regions commonly encode more than a single row or group of information. To score this, we present a measure based on expected self-information, derived from statistics over patterns of textual categories and visual layout. We describe the results of an initial study to assess the ability of these measures to detect semi-structured text in a corpus culled from the web, and show that this measure outperform baseline methods on an average precision measure. We present initial work that uses these significant patterns to generate extraction rules, and conclude with a discussion of future directions.

References

[1]

K. Fisher, D. Walker, K. Q. Zhu, and P. White. From dirt to shovels: Fully automatic tool generation from ad hoc data. In In POPL, 2008.

Digital Library

Google Scholar

[2]

M. Hurst. Layout and language: An efficient algorithm for detecting text blocks based on spatial and linguistic evidence. In In Document Recognition and Retrieval VIII, 2001.

Google Scholar

[3]

C. D. Manning and H. Schijtze. Foundations of statistical natural language processing. MIT Press, Cambridge Mass., 1999.

Digital Library

Google Scholar

[4]

D. Pinto, A. Mccallum, X. Wei, and W. B. Croft. Table extraction using conditional random fields, 2003.

Google Scholar

[5]

A. C. Silva, A. M. Jorge, and L. Torgo. Design of an end-to-end method to extract information from tables. International Journal Document Analysis Research, 8:144--171, 2006.

Crossref

Google Scholar

[6]

S. Soderland, C. Cardie, and R. Mooney. Learning information extraction rules for semi-structured and free text. In Machine Learning, pages 233--272, 1999.

Digital Library

Google Scholar

[7]

K. Toutanova, D. Klein, C. D. Manning, and Y. Singer. Feature-rich part-of-speech tagging with a cyclic dependency network. In IN PROCEEDINGS OF HLT-NAACL, pages 252--259, 2003.

Digital Library

Google Scholar

Cited By

View all

Anagnostopoulos CTriantafillou P(2018)Large-scale predictive modeling and analytics through regression queries in data management systemsInternational Journal of Data Science and Analytics10.1007/s41060-018-0163-59:1(17-55)Online publication date: 27-Dec-2018
https://doi.org/10.1007/s41060-018-0163-5

Index Terms

Unsupervised discovery and extraction of semi-structured regions in text via self-information
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Language resources

Recommendations

A semi-structured document model for text mining
Abstract
A semi-structured document has more structured information compared to an ordinary document, and the relation among semi-structured documents can be fully utilized. In order to take advantage of the structure and link information in a semi-...
Logical structure based semantic relationship extraction from semi-structured documents
WWW '06: Proceedings of the 15th international conference on World Wide Web

Addressed in this paper is the issue of semantic relationship extraction from semi-structured documents. Many research efforts have been made so far on the semantic information extraction. However, much of the previous work focuses on detecting `...
List data extraction in semi-structured document
WISE'05: Proceedings of the 6th international conference on Web Information Systems Engineering

The amount of semi-structured documents is tremendous online, such as business annual reports, online airport listings, catalogs, hotel directories, etc. List, which has structured characteristics, is used to store highly structured and database-like ...

Comments

Information & Contributors

Information

Published In

AKBC '13: Proceedings of the 2013 workshop on Automated knowledge base construction

October 2013

124 pages

ISBN:9781450324113

DOI:10.1145/2509558

Program Chairs:
Fabian M. Suchanek
Max Planck Institute for Informatics, Germany
,
Sebastian Riedel
University College London, UK
,
Sameer Singh
University of Massachusetts Amherst, USA
,
Partha Pratim Talukdar
Carnegie Mellon University, USA

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2013

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tag

semi-structured information extraction

Qualifiers

Poster

Conference

CIKM'13

Sponsor:

CIKM'13: 22nd ACM International Conference on Information and Knowledge Management

October 27 - 28, 2013

California, San Francisco, USA

Acceptance Rates

AKBC '13 Paper Acceptance Rate 9 of 19 submissions, 47%;

Overall Acceptance Rate 9 of 19 submissions, 47%

Upcoming Conference

CIKM '25

Sponsor:
sigir
sigir

The 34th ACM International Conference on Information and Knowledge Management

November 10 - 14, 2025

Seoul , Republic of Korea

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
156
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 17 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

View all

Anagnostopoulos CTriantafillou P(2018)Large-scale predictive modeling and analytics through regression queries in data management systemsInternational Journal of Data Science and Analytics10.1007/s41060-018-0163-59:1(17-55)Online publication date: 27-Dec-2018
https://doi.org/10.1007/s41060-018-0163-5

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Abstract

References

Cited By

Index Terms

Recommendations

A semi-structured document model for text mining

Logical structure based semantic relationship extraction from semi-structured documents

List data extraction in semi-structured document

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tag

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations