skip to main content
10.1145/1935826.1935894acmconferencesArticle/Chapter ViewAbstractPublication PageswsdmConference Proceedingsconference-collections
poster

Collective extraction from heterogeneous web lists

Published:09 February 2011Publication History

ABSTRACT

Automatic extraction of structured records from inconsistently formatted lists on the web is challenging: different lists present disparate sets of attributes with variations in the ordering of attributes; many lists contain additional attributes and noise that can confuse the extraction process; and formatting within a list may be inconsistent due to missing attributes or manual formatting on some sites.

We present a novel solution to this extraction problem that is based on i) collective extraction from multiple lists simultaneously and ii) careful exploitation of a small database of seed entities. Our approach addresses the layout homogeneity within the individual lists, content redundancy across some snippets from different sources, and the noisy attribute rendering process. We experimentally evaluate variants of this algorithm on real world data sets and show that our approach is a promising direction for extraction from noisy lists, requiring mild and thus inexpensive supervision suitable for extraction from the tail of the web.

References

  1. E. Agichtein and V. Ganti. Mining reference tables for automatic text segmentation. In KDD, pages 20--29, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. M. Alvarez, A. Pan, J. Raposo, F. Bellas, and F. Cacheda. Extracting lists of data records from semi-structured web pages. Data Knowl. Engg., 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. A. Arasu and H. Garcia-Molina. Extracting structured data from web pages. In SIGMOD, 2003. ACM, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. V. Borkar, K. Deshmukh, and S. Sarawagi. Automatic segmentation of text into structured records. SIGMOD Rec., 30(2), 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. S. Canisius and C. Sporleder. Bootstrapping information extraction from field books. In EMNLP, pages 827--836, 2007.Google ScholarGoogle Scholar
  6. C. Chang, M. Kayed, M. R. Girgis, and K. F. Shaalan. A survey of web information extraction systems. IEEE Trans. Knowl. Data Eng., 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. S.-L. Chuang, K. C.-C. Chang, and C. Zhai. Context-aware wrapping: Synchronized data extraction. In VLDB, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. W. W. Cohen. Data integration using similarity joins and a word-based information representation language. ACM Trans. Inf. Syst., 18(3), 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. V. Crescenzi, G. Mecca, and P. Merialdo. Roadrunner: Towards automatic data extraction from large web sites. In VLDB, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. P. DeRose, W. Shen, F. Chen, A. Doan, and R. Ramakrishnan. Building structured web community portals: A top-down, compositional, and incremental approach. In VLDB, pages 399--410, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. H. Elmeleegy, J. Madhavan, and A. Halevy. Harvesting relational tables from lists on the web. In Proceedings of the VLDB Endowment (PVLDB), pages 1078--1089, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. P. Gulhane, R. Rastogi, S. Sengamedu, and A. Tengli. Exploiting content redundancy for web information extraction. In VLDB, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. R. Gupta and S. Sarawagi. Answering table augmentation queries from unstructured lists on the web. In VLDB, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. N. Kushmerick, D. Weld, and R. Doorenbos. Wrapper induction for information extraction. In IJCAI, 1997.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. I. R. Mansuri and S. Sarawagi. Integrating unstructured data into relational databases. In ICDE '06: Proceedings of the 22nd International Conference on Data Engineering, page 29, Washington, DC, USA, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. S. B. Needleman and C. D. Wunsch. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol. Bio., 1970.Google ScholarGoogle Scholar
  17. P. Papotti, V. Crescenzi, P. Merialdo, M. Bronzi, and L. Blanco. Redundancy-driven web data extraction and integration. In WebDB, 2010.Google ScholarGoogle Scholar
  18. A. Rajaraman. Kosmix: Exploring the deep web using taxonomies and categorization. IEEE Data Eng. Bull., 32(2):12--19, 2009.Google ScholarGoogle Scholar
  19. P. Ravikumar and W. Cohen. A hierarchical graphical model for record linkage. In UAI '04: Proceedings of the 20th conference on Uncertainty in Artificial Intelligence, pages 454--461, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. C. Sutton and A. Mccallum. An introduction to conditional random fields for relational learning. In Introduction to Statistical Relational Learning, chapter 4. MIT Press, 2007.Google ScholarGoogle Scholar
  21. A. J. Viterbi. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Transactions on Information Theory, 13(2), 1967.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Y. Zhai and B. Liu. Web data extraction based on partial tree alignment. In WWW. ACM, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. J. Zhu, Z. Nie, J. Wen, B. Zhang, and W. Ma. Simultaneous record detection and attribute labeling in web data extraction. In KDD, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Collective extraction from heterogeneous web lists

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      WSDM '11: Proceedings of the fourth ACM international conference on Web search and data mining
      February 2011
      870 pages
      ISBN:9781450304931
      DOI:10.1145/1935826

      Copyright © 2011 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 9 February 2011

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • poster

      Acceptance Rates

      WSDM '11 Paper Acceptance Rate83of372submissions,22%Overall Acceptance Rate498of2,863submissions,17%

      Upcoming Conference

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader