Robust Web Data Extraction: A Novel Approach Based on Minimum Cost Script Edit Model

Liu, Donglan; Wang, Xinjun; Yan, Zhongmin; Li, Qiuyan

doi:10.1007/978-3-642-33469-6_62

Robust Web Data Extraction: A Novel Approach Based on Minimum Cost Script Edit Model

Donglan Liu^20,21,
Xinjun Wang^20,21,
Zhongmin Yan^20,21 &
…
Qiuyan Li²²

Conference paper

2723 Accesses
1 Citations

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7529))

Abstract

Many documents share common HTML tree structure on script generated websites, allowing us to effectively extract interested information from deep webpage by wrappers. Since tree structure evolves over time, the wrappers break frequently and need to be re-learned. In this paper, we explore the problem of constructing robust wrappers for deep web information extraction. In order to keep web extraction robust when webpage changes, a minimum cost script edit model based on machine learning techniques is proposed. With the method, we consider three edit operations under structural changes, i.e., inserting nodes, deleting nodes and substituting nodes’ labels. Firstly, we obtain the change frequencies of three edit operations for each HTML label according to the frequency of webpage change on real web data with machine learning method. Then, we compute the corresponding edit costs for three edit operations on the basis of change frequencies and minimum cost model. Finally, we choose the most proper data to extract the interested information by applying the minimum cost script. Experimental results show that the proposed approach can accomplish robust web extraction with high accuracy.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Myllymaki, J., Jackson, J.: Robust web data extraction with XML path expressions. CiteSeer (2002)
Google Scholar
Dalvi, N., Bohannon, P., Sha, F.: Robust web extraction: an approach based on a probabilistic tree-edit model. In: SIGMOD (2009)
Google Scholar
Parameswaran, A., Dalvi, N., Garcia-Molina, H., Rastogi, R.: Optimal Schemes for Robust Web Extraction. In: VLDB (2011)
Google Scholar
Dalvi, N., Kumar, R., Soliman, M.: Automatic Wrappers for Large Scale Web Extraction. In: VLDB (2011)
Google Scholar
Baumgartner, R., Gottlob, G., Herzog, M.: Scalable Web Data Extraction for Online Market Intelligence. In: VLDB (2009)
Google Scholar
Gupta, R., Sarawagi, S.: Domain Adaptation of Information Extraction Models. SIGMOD Record 37(4), 35–40 (2008)
Article Google Scholar
Cafarella, M.J., Madhavan, J., Halevy, A.: Web-Scale Extraction of Structured Data. In: SIGMOD (2008)
Google Scholar
Cafarella, M.J., Halevy, A., Khoussainova, N.: Data Integration for the Relational Web. In: VLDB (2009)
Google Scholar
Kasneci, G., Ramanath, M., Suchanek, F., Weikum, G.: The YAGO-NAGA Approach to Knowledge Discovery. SIGMOD Record 37(4), 41–47 (2008)
Article Google Scholar
Kim, Y., Park, J., Kim, T., Choi, J.: Web Information Extraction by HTML Tree Edit Distance Matching. In: ICCIT (2007)
Google Scholar
Anton, T.: Xpath-wrapper induction by generating tree traversal patterns. In: LWA, pp. 126–133 (2005)
Google Scholar
van Rijsbergen, C.: Information Retrieval. Butterworths (1979)
Google Scholar
Chidlovskii, B., Roustant, B., Brette, M.: Documentum ECI self-repairing wrappers: performance analysis. In: SIGMOD, pp. 708–717 (2006)
Google Scholar
de Castro Reis, D., Golgher, P.B., da Silve, A.S.: Automatic web news extraction using tree edit distance. In: WWW, pp. 502–511 (2004)
Google Scholar
Wang, W., Xiao, C., Lin, X., Zhang, C.: Efficient approximate entity extraction with edit distance constraints. In: SIGMOD, pp. 759–770 (2009)
Google Scholar
Liu, D., Wang, X., Li, H., Yan, Z.: Robust Web Extraction Based on Minimum Cost Script Edit Model. Procedia Engineering 29, 1119–1125 (2012)
Article Google Scholar
Hao, Q., Cai, R., Pang, Y., Zhang, L.: From one tree to a forest: a unified solution for structured Web data extraction. In: Proceeding of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval SIGIR, pp. 775–784 (2011)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science and Technology, Shandong University, 1500 Shunhua Road, Jinan, 250101, P.R. China
Donglan Liu, Xinjun Wang & Zhongmin Yan
Shandong Provincial Key Laboratory of Software Engineering, 1500 Shunhua Road, Jinan, 250101, P.R. China
Donglan Liu, Xinjun Wang & Zhongmin Yan
Changchun Institute of Engineering Technology, Changchun, P.R. China
Qiuyan Li

Authors

Donglan Liu
View author publications
You can also search for this author in PubMed Google Scholar
Xinjun Wang
View author publications
You can also search for this author in PubMed Google Scholar
Zhongmin Yan
View author publications
You can also search for this author in PubMed Google Scholar
Qiuyan Li
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Deaprtment of Business Administration, Caritas Institute of Higher Education, 18 Chui Ling Road, Tseung Kwan O, Hong Kong, China
Fu Lee Wang
School of Computer and Information Engineering, Shanghai University of Electric Power, 200090, Shanghai, China
Jingsheng Lei
Department of Computer and Inforamtion Science, University of Macau, Av. Padre Tomás Pereira, Taipa, Macau, China
Zhiguo Gong
School of Computer, Shanghai University, 200444, Shanghai, China
Xiangfeng Luo

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Liu, D., Wang, X., Yan, Z., Li, Q. (2012). Robust Web Data Extraction: A Novel Approach Based on Minimum Cost Script Edit Model. In: Wang, F.L., Lei, J., Gong, Z., Luo, X. (eds) Web Information Systems and Mining. WISM 2012. Lecture Notes in Computer Science, vol 7529. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-33469-6_62

Download citation

DOI: https://doi.org/10.1007/978-3-642-33469-6_62
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-33468-9
Online ISBN: 978-3-642-33469-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics