ABSTRACT
Automatic extraction of structured records from inconsistently formatted lists on the web is challenging: different lists present disparate sets of attributes with variations in the ordering of attributes; many lists contain additional attributes and noise that can confuse the extraction process; and formatting within a list may be inconsistent due to missing attributes or manual formatting on some sites.
We present a novel solution to this extraction problem that is based on i) collective extraction from multiple lists simultaneously and ii) careful exploitation of a small database of seed entities. Our approach addresses the layout homogeneity within the individual lists, content redundancy across some snippets from different sources, and the noisy attribute rendering process. We experimentally evaluate variants of this algorithm on real world data sets and show that our approach is a promising direction for extraction from noisy lists, requiring mild and thus inexpensive supervision suitable for extraction from the tail of the web.
- E. Agichtein and V. Ganti. Mining reference tables for automatic text segmentation. In KDD, pages 20--29, 2004. Google ScholarDigital Library
- M. Alvarez, A. Pan, J. Raposo, F. Bellas, and F. Cacheda. Extracting lists of data records from semi-structured web pages. Data Knowl. Engg., 2008. Google ScholarDigital Library
- A. Arasu and H. Garcia-Molina. Extracting structured data from web pages. In SIGMOD, 2003. ACM, 2003. Google ScholarDigital Library
- V. Borkar, K. Deshmukh, and S. Sarawagi. Automatic segmentation of text into structured records. SIGMOD Rec., 30(2), 2001. Google ScholarDigital Library
- S. Canisius and C. Sporleder. Bootstrapping information extraction from field books. In EMNLP, pages 827--836, 2007.Google Scholar
- C. Chang, M. Kayed, M. R. Girgis, and K. F. Shaalan. A survey of web information extraction systems. IEEE Trans. Knowl. Data Eng., 2006. Google ScholarDigital Library
- S.-L. Chuang, K. C.-C. Chang, and C. Zhai. Context-aware wrapping: Synchronized data extraction. In VLDB, 2007. Google ScholarDigital Library
- W. W. Cohen. Data integration using similarity joins and a word-based information representation language. ACM Trans. Inf. Syst., 18(3), 2000. Google ScholarDigital Library
- V. Crescenzi, G. Mecca, and P. Merialdo. Roadrunner: Towards automatic data extraction from large web sites. In VLDB, 2001. Google ScholarDigital Library
- P. DeRose, W. Shen, F. Chen, A. Doan, and R. Ramakrishnan. Building structured web community portals: A top-down, compositional, and incremental approach. In VLDB, pages 399--410, 2007. Google ScholarDigital Library
- H. Elmeleegy, J. Madhavan, and A. Halevy. Harvesting relational tables from lists on the web. In Proceedings of the VLDB Endowment (PVLDB), pages 1078--1089, 2009. Google ScholarDigital Library
- P. Gulhane, R. Rastogi, S. Sengamedu, and A. Tengli. Exploiting content redundancy for web information extraction. In VLDB, 2010. Google ScholarDigital Library
- R. Gupta and S. Sarawagi. Answering table augmentation queries from unstructured lists on the web. In VLDB, 2009. Google ScholarDigital Library
- N. Kushmerick, D. Weld, and R. Doorenbos. Wrapper induction for information extraction. In IJCAI, 1997.Google ScholarDigital Library
- I. R. Mansuri and S. Sarawagi. Integrating unstructured data into relational databases. In ICDE '06: Proceedings of the 22nd International Conference on Data Engineering, page 29, Washington, DC, USA, 2006. Google ScholarDigital Library
- S. B. Needleman and C. D. Wunsch. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol. Bio., 1970.Google Scholar
- P. Papotti, V. Crescenzi, P. Merialdo, M. Bronzi, and L. Blanco. Redundancy-driven web data extraction and integration. In WebDB, 2010.Google Scholar
- A. Rajaraman. Kosmix: Exploring the deep web using taxonomies and categorization. IEEE Data Eng. Bull., 32(2):12--19, 2009.Google Scholar
- P. Ravikumar and W. Cohen. A hierarchical graphical model for record linkage. In UAI '04: Proceedings of the 20th conference on Uncertainty in Artificial Intelligence, pages 454--461, 2004. Google ScholarDigital Library
- C. Sutton and A. Mccallum. An introduction to conditional random fields for relational learning. In Introduction to Statistical Relational Learning, chapter 4. MIT Press, 2007.Google Scholar
- A. J. Viterbi. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Transactions on Information Theory, 13(2), 1967.Google ScholarDigital Library
- Y. Zhai and B. Liu. Web data extraction based on partial tree alignment. In WWW. ACM, 2005. Google ScholarDigital Library
- J. Zhu, Z. Nie, J. Wen, B. Zhang, and W. Ma. Simultaneous record detection and attribute labeling in web data extraction. In KDD, 2006. Google ScholarDigital Library
Index Terms
- Collective extraction from heterogeneous web lists
Recommendations
Unsupervised named-entity extraction from the Web: An experimental study
The KnowItAll system aims to automate the tedious process of extracting large collections of facts (e.g., names of scientists or politicians) from the Web in an unsupervised, domain-independent, and scalable manner. The paper presents an overview of ...
A robust web personal name information extraction system
Highlights Features are extracted with various lightweight methods and from broad resources. The unsupervised features improve the robustness of a disambiguation system. Our AE system integrates various extraction approaches with high precision. Each ...
Information extraction meets the Semantic Web: A survey
We provide a comprehensive survey of the research literature that applies Information Extraction techniques in a Semantic Web setting. Works in the intersection of these two areas can be seen from two overlapping perspectives: using Semantic Web resources ...
Comments