poster

Collective extraction from heterogeneous web lists

Authors:
Ashwin Machanavajjhala

Yahoo! Research, Santa Clara, CA, USA

Yahoo! Research, Santa Clara, CA, USA
View Profile

,
Arun Shankar Iyer

Yahoo! Research, Bangalore, India

Yahoo! Research, Bangalore, India
View Profile

,
Philip Bohannon

Yahoo! Research, Santa Clara, CA, USA

Yahoo! Research, Santa Clara, CA, USA
View Profile

,
Srujana Merugu

Yahoo! Research, Santa Clara, CA, USA

Yahoo! Research, Santa Clara, CA, USA
View Profile

WSDM '11: Proceedings of the fourth ACM international conference on Web search and data miningFebruary 2011Pages 445–454https://doi.org/10.1145/1935826.1935894

Published:09 February 2011Publication History

WSDM '11: Proceedings of the fourth ACM international conference on Web search and data mining

Pages 445–454

ABSTRACT

Automatic extraction of structured records from inconsistently formatted lists on the web is challenging: different lists present disparate sets of attributes with variations in the ordering of attributes; many lists contain additional attributes and noise that can confuse the extraction process; and formatting within a list may be inconsistent due to missing attributes or manual formatting on some sites.

We present a novel solution to this extraction problem that is based on i) collective extraction from multiple lists simultaneously and ii) careful exploitation of a small database of seed entities. Our approach addresses the layout homogeneity within the individual lists, content redundancy across some snippets from different sources, and the noisy attribute rendering process. We experimentally evaluate variants of this algorithm on real world data sets and show that our approach is a promising direction for extraction from noisy lists, requiring mild and thus inexpensive supervision suitable for extraction from the tail of the web.

References

E. Agichtein and V. Ganti. Mining reference tables for automatic text segmentation. In KDD, pages 20--29, 2004. Google ScholarDigital Library
M. Alvarez, A. Pan, J. Raposo, F. Bellas, and F. Cacheda. Extracting lists of data records from semi-structured web pages. Data Knowl. Engg., 2008. Google ScholarDigital Library
A. Arasu and H. Garcia-Molina. Extracting structured data from web pages. In SIGMOD, 2003. ACM, 2003. Google ScholarDigital Library
V. Borkar, K. Deshmukh, and S. Sarawagi. Automatic segmentation of text into structured records. SIGMOD Rec., 30(2), 2001. Google ScholarDigital Library
S. Canisius and C. Sporleder. Bootstrapping information extraction from field books. In EMNLP, pages 827--836, 2007.Google Scholar
C. Chang, M. Kayed, M. R. Girgis, and K. F. Shaalan. A survey of web information extraction systems. IEEE Trans. Knowl. Data Eng., 2006. Google ScholarDigital Library
S.-L. Chuang, K. C.-C. Chang, and C. Zhai. Context-aware wrapping: Synchronized data extraction. In VLDB, 2007. Google ScholarDigital Library
W. W. Cohen. Data integration using similarity joins and a word-based information representation language. ACM Trans. Inf. Syst., 18(3), 2000. Google ScholarDigital Library
V. Crescenzi, G. Mecca, and P. Merialdo. Roadrunner: Towards automatic data extraction from large web sites. In VLDB, 2001. Google ScholarDigital Library
P. DeRose, W. Shen, F. Chen, A. Doan, and R. Ramakrishnan. Building structured web community portals: A top-down, compositional, and incremental approach. In VLDB, pages 399--410, 2007. Google ScholarDigital Library
H. Elmeleegy, J. Madhavan, and A. Halevy. Harvesting relational tables from lists on the web. In Proceedings of the VLDB Endowment (PVLDB), pages 1078--1089, 2009. Google ScholarDigital Library
P. Gulhane, R. Rastogi, S. Sengamedu, and A. Tengli. Exploiting content redundancy for web information extraction. In VLDB, 2010. Google ScholarDigital Library
R. Gupta and S. Sarawagi. Answering table augmentation queries from unstructured lists on the web. In VLDB, 2009. Google ScholarDigital Library
N. Kushmerick, D. Weld, and R. Doorenbos. Wrapper induction for information extraction. In IJCAI, 1997.Google ScholarDigital Library
I. R. Mansuri and S. Sarawagi. Integrating unstructured data into relational databases. In ICDE '06: Proceedings of the 22nd International Conference on Data Engineering, page 29, Washington, DC, USA, 2006. Google ScholarDigital Library
S. B. Needleman and C. D. Wunsch. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol. Bio., 1970.Google Scholar
P. Papotti, V. Crescenzi, P. Merialdo, M. Bronzi, and L. Blanco. Redundancy-driven web data extraction and integration. In WebDB, 2010.Google Scholar
A. Rajaraman. Kosmix: Exploring the deep web using taxonomies and categorization. IEEE Data Eng. Bull., 32(2):12--19, 2009.Google Scholar
P. Ravikumar and W. Cohen. A hierarchical graphical model for record linkage. In UAI '04: Proceedings of the 20th conference on Uncertainty in Artificial Intelligence, pages 454--461, 2004. Google ScholarDigital Library
C. Sutton and A. Mccallum. An introduction to conditional random fields for relational learning. In Introduction to Statistical Relational Learning, chapter 4. MIT Press, 2007.Google Scholar
A. J. Viterbi. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Transactions on Information Theory, 13(2), 1967.Google ScholarDigital Library
Y. Zhai and B. Liu. Web data extraction based on partial tree alignment. In WWW. ACM, 2005. Google ScholarDigital Library
J. Zhu, Z. Nie, J. Wen, B. Zhang, and W. Ma. Simultaneous record detection and attribute labeling in web data extraction. In KDD, 2006. Google ScholarDigital Library

Index Terms

Collective extraction from heterogeneous web lists
1. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

Unsupervised named-entity extraction from the Web: An experimental study

The KnowItAll system aims to automate the tedious process of extracting large collections of facts (e.g., names of scientists or politicians) from the Web in an unsupervised, domain-independent, and scalable manner. The paper presents an overview of ...
Read More
A robust web personal name information extraction system

Highlights Features are extracted with various lightweight methods and from broad resources. The unsupervised features improve the robustness of a disambiguation system. Our AE system integrates various extraction approaches with high precision. Each ...
Read More
Information extraction meets the Semantic Web: A survey

We provide a comprehensive survey of the research literature that applies Information Extraction techniques in a Semantic Web setting. Works in the intersection of these two areas can be seen from two overlapping perspectives: using Semantic Web resources ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WSDM '11: Proceedings of the fourth ACM international conference on Web search and data mining
February 2011
870 pages
ISBN:9781450304931
DOI:10.1145/1935826
General Chair:
Irwin King
CUHK, Hong Kong
,
Program Chairs:
Wolfgang Nejdl
L3S and University of Hannover, Germany
,
Hang Li
Microsoft Research Asia, China
Copyright © 2011 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 9 February 2011
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
collective bayesian models
hidden markov models
incremental
information extraction
Qualifiers
- poster
Conference

Acceptance Rates
WSDM '11 Paper Acceptance Rate83of372submissions,22%Overall Acceptance Rate498of2,863submissions,17%
More
Upcoming Conference
WSDM '25

Sponsor:

sigir

sigir

sigir

sigir

The Eighteenth ACM International Conference on Web Search and Data Mining

April 7 - 11, 2025

Hannover , Germany
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 11
  Total Citations
  View Citations
- 327
  Total Downloads
- Downloads (Last 12 months)0
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Collective extraction from heterogeneous web lists

WSDM '11: Proceedings of the fourth ACM international conference on Web search and data mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

Unsupervised named-entity extraction from the Web: An experimental study

A robust web personal name information extraction system

Information extraction meets the Semantic Web: A survey