Skip to main content

Wrapper Maintenance for Web-Data Extraction Based on Pages Features

  • Conference paper
  • 598 Accesses

Part of the book series: Advances in Soft Computing ((AINSC,volume 35))

Abstract

Extracting data from Web pages using wrappers is a fundamental problem arising in a large variety of applications of vast practical interest. There are two main issues relevant to Web-data extraction, namely wrapper generation and wrapper maintenance. In this paper, we propose a novel approach to automatic wrapper maintenance. It is based on the observation that despite various page changes, many important features of the pages are preserved, such as text pattern features, annotations, and hyperlinks. Our approach uses these preserved features to identify the locations of the desired values in the changed pages, and repairs wrappers correspondingly. Experiments over several real-world Web sites show that the proposed automatic approach can effectively maintain wrappers to extract desired data with high accuracy.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   259.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   329.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1. Baumgartner R, Flesca S, Gottlob G. Visual Web Information Extraction with Lixto. In Proceedings of the Very Large Data Bases; 2001, 119–128.

    Google Scholar 

  2. 2. Chidlovskii B. Automatic repairing of Web Wrappers. In 3rd International Workshop on Web Information and Data Management, 2001, 24–30.

    Google Scholar 

  3. 3. Hammer J, Brenning M, Garcia-Molina H, Nestorov S, VassalosV, Yemeni R,. Template-based wrappers in the TSIMMIS system. In Proceedings of ACM SIGMOD Conference, 1997, 532–535.

    Google Scholar 

  4. 4. Knoblock C A, Lerman K, Minton S, Muslea I. Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 2000, 23(4): 33–41.

    Google Scholar 

  5. 5. Kristina Lerman, Steven Minton, Craig A. Knoblock: Wrapper Maintenance: A Machine Learning Approach. J. Artif. Intell. Res. (JAIR.) 18: 149–181 (2003)

    Google Scholar 

  6. 6. Kushmerick N. Regression testing for wrapper maintenance. In Proceedings of AAAI, 1999, 74–79

    Google Scholar 

  7. 7. Kushmerick N. Wrapper verification. World Wide Web Journal, 2000, 3(2): 79–94.

    Article  MATH  Google Scholar 

  8. 8. Lerman K. and Minton S. Learning the common structure of data. In AAAI2000.

    Google Scholar 

  9. 9. Muslea, I., Minton, S. and Knoblock, C., (2001). Hierarchical wrapper induction for semistructured information sources. Journal of Autonomous Agents and Multi-Agent Systems, 4:93–114.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer

About this paper

Cite this paper

Zhou, S., Lin, Y., Wang, J., Yang, X. (2006). Wrapper Maintenance for Web-Data Extraction Based on Pages Features. In: Kłopotek, M.A., Wierzchoń, S.T., Trojanowski, K. (eds) Intelligent Information Processing and Web Mining. Advances in Soft Computing, vol 35. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-33521-8_31

Download citation

  • DOI: https://doi.org/10.1007/3-540-33521-8_31

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-33520-7

  • Online ISBN: 978-3-540-33521-4

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics