Extracting Web Data Using Instance-Based Learning

Zhai, Yanhong; Liu, Bing

doi:10.1007/11581062_24

Yanhong Zhai²¹ &
Bing Liu²¹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3806))

Included in the following conference series:

International Conference on Web Information Systems Engineering

1235 Accesses
15 Citations

Abstract

This paper studies structured data extraction from Web pages, e.g., online product description pages. Existing approaches to data extraction include wrapper induction and automatic methods. In this paper, we propose an instance-based learning method, which performs extraction by comparing each new instance (or page) to be extracted with labeled instances (or pages). The key advantage of our method is that it does not need an initial set of labeled pages to learn extraction rules as in wrapper induction. Instead, the algorithm is able to start extraction from a single labeled instance (or page). Only when a new page cannot be extracted does the page need labeling. This avoids unnecessary page labeling, which solves a major problem with inductive learning (or wrapper induction), i.e., the set of labeled pages may not be representative of all other pages. The instance-based approach is very natural because structured data on the Web usually follow some fixed templates and pages of the same template usually can be extracted using a single page instance of the template. The key issue is the similarity or distance measure. Traditional measures based on the Euclidean distance or text similarity are not easily applicable in this context because items to be extracted from different pages can be entirely different. This paper proposes a novel similarity measure for the purpose, which is suitable for templated Web pages. Experimental results with product data extraction from 1200 pages in 24 diverse Web sites show that the approach is surprisingly effective. It outperforms the state-of-the-art existing systems significantly.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Cohen, W., Hurst, M., Jensen, L.: A flexible learning system for wrapping tables and lists in html documents. In: The Eleventh International World Wide Web Conference WWW 2002 (2002)
Google Scholar
Feldman, R., Aumann, Y., Finkelstein-Landau, M., Hurvitz, E., Regev, Y., Yaroshevich, A.: A comparative study of information extraction strategies. In: Gelbukh, A. (ed.) CICLing 2002. LNCS, vol. 2276, pp. 349–359. Springer, Heidelberg (2002)
Chapter Google Scholar
Freitag, D., Kushmerick, N.: Boosted wrapper induction. In: Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence, pp. 577–583 (2000)
Google Scholar
Freitag, D., McCallum, A.K.: Information extraction with hmms and shrinkage. In: Proceedings of the AAAI 1999 Workshop on Machine Learning for Informatino Extraction (1999)
Google Scholar
Hsu, C.N., Dung, M.T.: Generating finite-state transducers for semi-structured data extraction from the web. Information Systems 23, 521–538 (1998)
Article Google Scholar
Kushmerick, N.: Wrapper induction for information extraction. PhD thesis, Chairperson-Daniel S. Weld (1997)
Google Scholar
Lerman, K., Getoor, L., Minton, S., Knoblock, C.: Using the structure of web sites for automatic segmentation of tables. In: SIGMOD 2004: Proceedings of the 2004 ACM SIGMOD international conference on Management of data, pp. 119–130 (2004)
Google Scholar
Muslea, I., Minton, S., Knoblock, C.: A hierarchical approach to wrapper induction. In: AGENTS 1999: Proceedings of the third annual conference on Autonomous Agents, pp. 190–197 (1999)
Google Scholar
Pinto, D., McCallum, A., Wei, X., Croft, W.B.: Table extraction using conditional random fields. In: SIGIR 2003: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pp. 235–242 (2003)
Google Scholar
Zhai, Y., Liu, B.: Web data extraction based on partial tree alignment. In: WWW 2005: Proceedings of the 14th international conference on World Wide Web, pp. 76–85 (2005)
Google Scholar
Knoblock, C.A., Lerman, K., Minton, S., Muslea, I.: Accurately and reliably extracting data from the web: a machine learning approach, pp. 275–287 (2003)
Google Scholar
Muslea, I., Minton, S., Knoblock, C.: Active learning with strong and weak views: A case study on wrapper induction. In: Proceedings of the 18th International Joint Conference on Artificial Intelligence, IJCAI 2003 (2003)
Google Scholar
Mitchell, T.: Machine Learning. McGraw-Hill, New York (1997)
MATH Google Scholar
(Fetch technologies), http://www.fetch.com/
Muslea, I., Minton, S., Knoblock, C.: Adaptive view validation: A first step towards automatic view detection. In: Proceedings of ICM 2002, pp. 443–450 (2002)
Google Scholar
Kushmerick, N.: Wrapper induction: efficiency and expressiveness. Artif. Intell., 15–68 (2000)
Google Scholar
Chang, C.H., Kuo, S.C.: Olera: Semi-supervised web-data extraction with visual support. In: IEEE Intelligent systems (2004)
Google Scholar
Chang, C.H., Lui, S.C.: Iepad: information extraction based on pattern discovery. In: WWW 2001: Proceedings of the 10th international conference on World Wide Web, pp. 681–688 (2001)
Google Scholar
Lerman, K., Minton, S.: Learning the common structure of data. In: Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence, pp. 609–614 (2000)
Google Scholar
Arasu, A., Garcia-Molina, H.: Extracting structured data from web pages. In: SIGMOD 2003: Proceedings of the 2003 ACM SIGMOD international conference on Management of data (2003)
Google Scholar
Crescenzi, V., Mecca, G., Merialdo, P.: Roadrunner: Towards automatic data extraction from large web sites. In: VLDB 2001: Proceedings of the 27th International Conference on Very Large Data Bases, pp. 109–118 (2001)
Google Scholar
Embley, D.W., Jiang, Y., Ng, Y.K.: Record-boundary discovery in web documents. In: SIGMOD (1999)
Google Scholar
Bunescu, R., Ge, R., Kate, R.J., Mooney, R.J., Wong, Y.W., Marcotte, E.M., Ramani, A.: Learning to extract proteins and their interactions from medline abstracts. In: ICML 2003 Workshop on Machine Learning in Bioinformatics (2003)
Google Scholar
Califf, M.E., Mooney, R.J.: Relational learning of pattern-match rules for information extraction. In: AAAI 1999/IAAI 1999: Proceedings of the sixteenth national conference on Artificial intelligence and the eleventh Innovative applications of artificial intelligence conference innovative applications of artificial intelligence, pp. 328–334 (1999)
Google Scholar
McCallum, A., Freitag, D., Pereira, F.C.N.: Maximum entropy markov models for information extraction and segmentation. In: ICML 2000: Proceedings of the Seventeenth International Conference on Machine Learning, pp. 591–598 (2000)
Google Scholar
Nahm, U.Y., Mooney, R.J.: A mutually beneficial integration of data mining and information extraction. In: Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence, pp. 627–632 (2000)
Google Scholar
Hammer, J., Garcia-Molina, H., Cho, J., Crespo, A., Aranha, R.: Extracting semistructured information from the web. In: Proceedings of the Workshop on Management for Semistructured Data (1997)
Google Scholar
Liu, L., Pu, C., Han, W.: Xwrap: An xml-enabled wrapper construction system for web information sources. In: ICDE 2000: Proceedings of the 16th International Conference on Data Engineering, p. 611 (2000)
Google Scholar
Sahuguet, A., Azavant, F.: Wysiwyg web wrapper factory (w4f). In: WWW8 (1999)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Illinois at Chicago, 851 S. Morgan Street, Chicago, IL, 60607, USA
Yanhong Zhai & Bing Liu

Authors

Yanhong Zhai
View author publications
You can also search for this author in PubMed Google Scholar
Bing Liu
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Texas State University, San Marcos, TX,
Anne H. H. Ngu
Institute of Industrial Science, The University of Tokyo, 4-6-1 Komaba, Meguro-ku, 153-8505, Tokyo, Japan
Masaru Kitsuregawa
University of Vienna, Vienna, Austria
Erich J. Neuhold
IBM Research Division, Thomas J. Watson Research Center, P.O. Box 218, 10598, New York, Yorktown Heights, USA
Jen-Yao Chung
School of Computer Science and Engineering, University of New South Wales, NSW 2052, Sydney, Australia
Quan Z. Sheng

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhai, Y., Liu, B. (2005). Extracting Web Data Using Instance-Based Learning. In: Ngu, A.H.H., Kitsuregawa, M., Neuhold, E.J., Chung, JY., Sheng, Q.Z. (eds) Web Information Systems Engineering – WISE 2005. WISE 2005. Lecture Notes in Computer Science, vol 3806. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11581062_24

Download citation

DOI: https://doi.org/10.1007/11581062_24
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-30017-5
Online ISBN: 978-3-540-32286-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics