Abstract
In this paper, we propose a method called CCWrapper (Classification-Cluster) to extract target data items from web pages under the guide of the predefined schema. CCWrapper extracts and combines the different HTML nodes features, including the style, structure, thesaurus and data type attributes into one unified model, and generates the extraction rules with Bayes classification in the training step. When the new HTML page is handled, CCWrapper generates the probability of the target element for each HTML node and clusters the HTML nodes for extraction based on the intra-document relationship in the HTML document tree. The preliminary experimental results on real-life web sites demonstrate CCWrapper is a promising extraction method.
Project 60503037 supported by NSFC, Project 4062018 supported by Beijing Natural Science Foundation.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Laender, A.H.F., Ribeiro-Neto, B.A., Silva, A.S., Teixeira, J.S.: A Brief Survey of Web Data Extraction Tools. SIGMOD Record 31(2) (2002)
Crescenzi, V., Mecca, G., Merialdo, P.: RoadRunner: Towards Automatic Data Extraction from Large Web Sites. In: Proc. of VLDB, pp. 109–118 (2001)
Arasu, A., Molina, H.G.: Extracting Structured Data from Web Pages. In: Proc. of SIGMOD 2003, pp. 337–348 (2003)
Zhao, H., Meng, W., Wu, Z., Raghavan, V., Yu, C.: Fully Automatic Wrapper Generation for Search Engines. In: Proc. of WWW (2005)
Buttler, D., Liu, L., Pu, C.: A Fully Automated Object Extraction System for the World Wide Web. In: Proc. of ICDCS (2001)
Liu, B., Grossman, R., Zhai, Y.: Mining Data Records in Web Pages. In: Proc. Of SIGKDD (2003)
Freitag, D., Kushmerick, N.: Boosted wrapper generation. In: Proc. of AAAI (2000)
Wang, J.J., Wen, J.R., Lochovsky, F.H., Ma, W.Y.: Instance-based Schema Matching for Web Databases by Domain-specific Query Probing. In: Proc. of VLDB, pp. 408–419 (2004)
Liu, L., Pu, C., Han, W.: XWRAP: An XML-enabled Wrapper Construction System for Web Information Sources. In: Proc. of ICDE, pp. 611–621 (2000)
Wang, T.J., Tang, S.W., Yang, D.Q., Gao, J., et al.: COMMIX: Towards Effective Web Information Extraction, Integration and Query Answering. In: Proc. of SIGMOD, p. 620 (2002)
Li, L.Y., Tang, S.W., Yang, D.Q., Wang, T.J., Su, Z.H.: EGA: An algorithm for automatic semi-structured web documents extraction. In: Lee, Y., Li, J., Whang, K.-Y., Lee, D. (eds.) DASFAA 2004. LNCS, vol. 2973, pp. 787–798. Springer, Heidelberg (2004)
Dhamankar, R.B., Lee, Y., Doan, A.H., Halevy, A.Y., Domingos, P.: iMAP: Discovering Complex Mappings between Database Schemas. In: Proc. of SIGMOD, pp. 383–394 (2004)
Java Bayes Package, http://jbnc.sourceforge.net
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Gao, J., Yang, D., Wang, T. (2006). CCWrapper: Adaptive Predefined Schema Guided Web Extraction. In: Yu, J.X., Kitsuregawa, M., Leong, H.V. (eds) Advances in Web-Age Information Management. WAIM 2006. Lecture Notes in Computer Science, vol 4016. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11775300_24
Download citation
DOI: https://doi.org/10.1007/11775300_24
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-35225-9
Online ISBN: 978-3-540-35226-6
eBook Packages: Computer ScienceComputer Science (R0)