CCWrapper: Adaptive Predefined Schema Guided Web Extraction

Gao, Jun; Yang, Dongqing; Wang, Tengjiao

doi:10.1007/11775300_24

Jun Gao¹⁹,
Dongqing Yang¹⁹ &
Tengjiao Wang¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4016))

Included in the following conference series:

International Conference on Web-Age Information Management

Abstract

In this paper, we propose a method called CCWrapper (Classification-Cluster) to extract target data items from web pages under the guide of the predefined schema. CCWrapper extracts and combines the different HTML nodes features, including the style, structure, thesaurus and data type attributes into one unified model, and generates the extraction rules with Bayes classification in the training step. When the new HTML page is handled, CCWrapper generates the probability of the target element for each HTML node and clusters the HTML nodes for extraction based on the intra-document relationship in the HTML document tree. The preliminary experimental results on real-life web sites demonstrate CCWrapper is a promising extraction method.

Project 60503037 supported by NSFC, Project 4062018 supported by Beijing Natural Science Foundation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Exploiting Multi-Category Characteristics and Unified Framework to Extract Web Content

Article Open access 07 June 2018

Extracting Web Content by Exploiting Multi-Category Characteristics

Web Content Extraction Using Clustering with Web Structure

References

Laender, A.H.F., Ribeiro-Neto, B.A., Silva, A.S., Teixeira, J.S.: A Brief Survey of Web Data Extraction Tools. SIGMOD Record 31(2) (2002)
Google Scholar
Crescenzi, V., Mecca, G., Merialdo, P.: RoadRunner: Towards Automatic Data Extraction from Large Web Sites. In: Proc. of VLDB, pp. 109–118 (2001)
Google Scholar
Arasu, A., Molina, H.G.: Extracting Structured Data from Web Pages. In: Proc. of SIGMOD 2003, pp. 337–348 (2003)
Google Scholar
Zhao, H., Meng, W., Wu, Z., Raghavan, V., Yu, C.: Fully Automatic Wrapper Generation for Search Engines. In: Proc. of WWW (2005)
Google Scholar
Buttler, D., Liu, L., Pu, C.: A Fully Automated Object Extraction System for the World Wide Web. In: Proc. of ICDCS (2001)
Google Scholar
Liu, B., Grossman, R., Zhai, Y.: Mining Data Records in Web Pages. In: Proc. Of SIGKDD (2003)
Google Scholar
Freitag, D., Kushmerick, N.: Boosted wrapper generation. In: Proc. of AAAI (2000)
Google Scholar
Wang, J.J., Wen, J.R., Lochovsky, F.H., Ma, W.Y.: Instance-based Schema Matching for Web Databases by Domain-specific Query Probing. In: Proc. of VLDB, pp. 408–419 (2004)
Google Scholar
Liu, L., Pu, C., Han, W.: XWRAP: An XML-enabled Wrapper Construction System for Web Information Sources. In: Proc. of ICDE, pp. 611–621 (2000)
Google Scholar
Wang, T.J., Tang, S.W., Yang, D.Q., Gao, J., et al.: COMMIX: Towards Effective Web Information Extraction, Integration and Query Answering. In: Proc. of SIGMOD, p. 620 (2002)
Google Scholar
Li, L.Y., Tang, S.W., Yang, D.Q., Wang, T.J., Su, Z.H.: EGA: An algorithm for automatic semi-structured web documents extraction. In: Lee, Y., Li, J., Whang, K.-Y., Lee, D. (eds.) DASFAA 2004. LNCS, vol. 2973, pp. 787–798. Springer, Heidelberg (2004)
Chapter Google Scholar
Dhamankar, R.B., Lee, Y., Doan, A.H., Halevy, A.Y., Domingos, P.: iMAP: Discovering Complex Mappings between Database Schemas. In: Proc. of SIGMOD, pp. 383–394 (2004)
Google Scholar
Java Bayes Package, http://jbnc.sourceforge.net

Download references

Author information

Authors and Affiliations

Department of Computer Science and Technology, Peking University, Beijing, China
Jun Gao, Dongqing Yang & Tengjiao Wang

Authors

Jun Gao
View author publications
You can also search for this author in PubMed Google Scholar
Dongqing Yang
View author publications
You can also search for this author in PubMed Google Scholar
Tengjiao Wang
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Chinese University of Hong Kong, Hong Kong, China
Jeffrey Xu Yu
Institute of Industrial Science, The University of Tokyo, 4-6-1 Komaba, Meguro-ku, 153-8505, Tokyo, Japan
Masaru Kitsuregawa
Department of Computing, Hong Kong Polytechnic University, Hong Kong
Hong Va Leong

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gao, J., Yang, D., Wang, T. (2006). CCWrapper: Adaptive Predefined Schema Guided Web Extraction. In: Yu, J.X., Kitsuregawa, M., Leong, H.V. (eds) Advances in Web-Age Information Management. WAIM 2006. Lecture Notes in Computer Science, vol 4016. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11775300_24

Download citation

DOI: https://doi.org/10.1007/11775300_24
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-35225-9
Online ISBN: 978-3-540-35226-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics