Skip to main content

CCWrapper: Adaptive Predefined Schema Guided Web Extraction

  • Conference paper
Advances in Web-Age Information Management (WAIM 2006)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4016))

Included in the following conference series:


In this paper, we propose a method called CCWrapper (Classification-Cluster) to extract target data items from web pages under the guide of the predefined schema. CCWrapper extracts and combines the different HTML nodes features, including the style, structure, thesaurus and data type attributes into one unified model, and generates the extraction rules with Bayes classification in the training step. When the new HTML page is handled, CCWrapper generates the probability of the target element for each HTML node and clusters the HTML nodes for extraction based on the intra-document relationship in the HTML document tree. The preliminary experimental results on real-life web sites demonstrate CCWrapper is a promising extraction method.

Project 60503037 supported by NSFC, Project 4062018 supported by Beijing Natural Science Foundation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others


  1. Laender, A.H.F., Ribeiro-Neto, B.A., Silva, A.S., Teixeira, J.S.: A Brief Survey of Web Data Extraction Tools. SIGMOD Record 31(2) (2002)

    Google Scholar 

  2. Crescenzi, V., Mecca, G., Merialdo, P.: RoadRunner: Towards Automatic Data Extraction from Large Web Sites. In: Proc. of VLDB, pp. 109–118 (2001)

    Google Scholar 

  3. Arasu, A., Molina, H.G.: Extracting Structured Data from Web Pages. In: Proc. of SIGMOD 2003, pp. 337–348 (2003)

    Google Scholar 

  4. Zhao, H., Meng, W., Wu, Z., Raghavan, V., Yu, C.: Fully Automatic Wrapper Generation for Search Engines. In: Proc. of WWW (2005)

    Google Scholar 

  5. Buttler, D., Liu, L., Pu, C.: A Fully Automated Object Extraction System for the World Wide Web. In: Proc. of ICDCS (2001)

    Google Scholar 

  6. Liu, B., Grossman, R., Zhai, Y.: Mining Data Records in Web Pages. In: Proc. Of SIGKDD (2003)

    Google Scholar 

  7. Freitag, D., Kushmerick, N.: Boosted wrapper generation. In: Proc. of AAAI (2000)

    Google Scholar 

  8. Wang, J.J., Wen, J.R., Lochovsky, F.H., Ma, W.Y.: Instance-based Schema Matching for Web Databases by Domain-specific Query Probing. In: Proc. of VLDB, pp. 408–419 (2004)

    Google Scholar 

  9. Liu, L., Pu, C., Han, W.: XWRAP: An XML-enabled Wrapper Construction System for Web Information Sources. In: Proc. of ICDE, pp. 611–621 (2000)

    Google Scholar 

  10. Wang, T.J., Tang, S.W., Yang, D.Q., Gao, J., et al.: COMMIX: Towards Effective Web Information Extraction, Integration and Query Answering. In: Proc. of SIGMOD, p. 620 (2002)

    Google Scholar 

  11. Li, L.Y., Tang, S.W., Yang, D.Q., Wang, T.J., Su, Z.H.: EGA: An algorithm for automatic semi-structured web documents extraction. In: Lee, Y., Li, J., Whang, K.-Y., Lee, D. (eds.) DASFAA 2004. LNCS, vol. 2973, pp. 787–798. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  12. Dhamankar, R.B., Lee, Y., Doan, A.H., Halevy, A.Y., Domingos, P.: iMAP: Discovering Complex Mappings between Database Schemas. In: Proc. of SIGMOD, pp. 383–394 (2004)

    Google Scholar 

  13. Java Bayes Package,

Download references

Author information

Authors and Affiliations


Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Gao, J., Yang, D., Wang, T. (2006). CCWrapper: Adaptive Predefined Schema Guided Web Extraction. In: Yu, J.X., Kitsuregawa, M., Leong, H.V. (eds) Advances in Web-Age Information Management. WAIM 2006. Lecture Notes in Computer Science, vol 4016. Springer, Berlin, Heidelberg.

Download citation

  • DOI:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-35225-9

  • Online ISBN: 978-3-540-35226-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics