Skip to main content

Robust Web Data Extraction: A Novel Approach Based on Minimum Cost Script Edit Model

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7529))

Abstract

Many documents share common HTML tree structure on script generated websites, allowing us to effectively extract interested information from deep webpage by wrappers. Since tree structure evolves over time, the wrappers break frequently and need to be re-learned. In this paper, we explore the problem of constructing robust wrappers for deep web information extraction. In order to keep web extraction robust when webpage changes, a minimum cost script edit model based on machine learning techniques is proposed. With the method, we consider three edit operations under structural changes, i.e., inserting nodes, deleting nodes and substituting nodes’ labels. Firstly, we obtain the change frequencies of three edit operations for each HTML label according to the frequency of webpage change on real web data with machine learning method. Then, we compute the corresponding edit costs for three edit operations on the basis of change frequencies and minimum cost model. Finally, we choose the most proper data to extract the interested information by applying the minimum cost script. Experimental results show that the proposed approach can accomplish robust web extraction with high accuracy.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Myllymaki, J., Jackson, J.: Robust web data extraction with XML path expressions. CiteSeer (2002)

    Google Scholar 

  2. Dalvi, N., Bohannon, P., Sha, F.: Robust web extraction: an approach based on a probabilistic tree-edit model. In: SIGMOD (2009)

    Google Scholar 

  3. Parameswaran, A., Dalvi, N., Garcia-Molina, H., Rastogi, R.: Optimal Schemes for Robust Web Extraction. In: VLDB (2011)

    Google Scholar 

  4. Dalvi, N., Kumar, R., Soliman, M.: Automatic Wrappers for Large Scale Web Extraction. In: VLDB (2011)

    Google Scholar 

  5. Baumgartner, R., Gottlob, G., Herzog, M.: Scalable Web Data Extraction for Online Market Intelligence. In: VLDB (2009)

    Google Scholar 

  6. Gupta, R., Sarawagi, S.: Domain Adaptation of Information Extraction Models. SIGMOD Record 37(4), 35–40 (2008)

    Article  Google Scholar 

  7. Cafarella, M.J., Madhavan, J., Halevy, A.: Web-Scale Extraction of Structured Data. In: SIGMOD (2008)

    Google Scholar 

  8. Cafarella, M.J., Halevy, A., Khoussainova, N.: Data Integration for the Relational Web. In: VLDB (2009)

    Google Scholar 

  9. Kasneci, G., Ramanath, M., Suchanek, F., Weikum, G.: The YAGO-NAGA Approach to Knowledge Discovery. SIGMOD Record 37(4), 41–47 (2008)

    Article  Google Scholar 

  10. Kim, Y., Park, J., Kim, T., Choi, J.: Web Information Extraction by HTML Tree Edit Distance Matching. In: ICCIT (2007)

    Google Scholar 

  11. Anton, T.: Xpath-wrapper induction by generating tree traversal patterns. In: LWA, pp. 126–133 (2005)

    Google Scholar 

  12. van Rijsbergen, C.: Information Retrieval. Butterworths (1979)

    Google Scholar 

  13. Chidlovskii, B., Roustant, B., Brette, M.: Documentum ECI self-repairing wrappers: performance analysis. In: SIGMOD, pp. 708–717 (2006)

    Google Scholar 

  14. de Castro Reis, D., Golgher, P.B., da Silve, A.S.: Automatic web news extraction using tree edit distance. In: WWW, pp. 502–511 (2004)

    Google Scholar 

  15. Wang, W., Xiao, C., Lin, X., Zhang, C.: Efficient approximate entity extraction with edit distance constraints. In: SIGMOD, pp. 759–770 (2009)

    Google Scholar 

  16. Liu, D., Wang, X., Li, H., Yan, Z.: Robust Web Extraction Based on Minimum Cost Script Edit Model. Procedia Engineering 29, 1119–1125 (2012)

    Article  Google Scholar 

  17. Hao, Q., Cai, R., Pang, Y., Zhang, L.: From one tree to a forest: a unified solution for structured Web data extraction. In: Proceeding of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval SIGIR, pp. 775–784 (2011)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Liu, D., Wang, X., Yan, Z., Li, Q. (2012). Robust Web Data Extraction: A Novel Approach Based on Minimum Cost Script Edit Model. In: Wang, F.L., Lei, J., Gong, Z., Luo, X. (eds) Web Information Systems and Mining. WISM 2012. Lecture Notes in Computer Science, vol 7529. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-33469-6_62

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-33469-6_62

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-33468-9

  • Online ISBN: 978-3-642-33469-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics