Abstract
We propose a general hierarchical vertical classification framework, which can automatically discover the inherent hierarchical structure of relationships among verticals based on flat datasets, and then build a hierarchical classifier. We conducted a set of comparison experiments to verify the performance of it, such as with flat vs hierarchical structure of relationships, as well as among different feature selection and classification methods. Experimental results show that the hierarchical classifiers built on the basis of the proposed framework make big improvements over the flat classifiers when classifying unseen web pages. Among them, the Support Vector Machine using Odds Ratio to select discriminative features performs best.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Zhai, Y., Liu, B.: Structured Data Extraction from the Web Based on Partial Tree Alignment. TKDE 18(12), 1614–1628 (2006)
Wong, T.L., Lam, W.: Learning to Adapt Web Information Extraction Knowledge and Discovering New Attributes via a Bayesian Approach. TKDE 22(4), 523–536 (2010)
Hao, Q., Cai, R., Pang, Y., Zhang, L.: From One Tree to a Forest: a Unified Solution for Structured Web Data Extraction Categories and Subject Descriptors. In: SIGIR, pp. 775–784 (2011)
Ceci, M., Malerba, D.: Classifying web documents in a hierarchy of categories: a comprehensive study. JIIS 28(1), 37–78 (2007)
Dumais, S., Chen, H.: Hierarchical classification of Web content. In: SIGIR, pp. 256–263. ACM, New York (2000)
Cai, L., Hofmann, T.: Hierarchical document categorization with support vector machines. In: CIKM, pp. 78–87. ACM, New York (2004)
Ben Choi, Z.Y.: Web Page Classification. In: Chu, W., Lin, T.Y. (eds.) Foundations and Advances in Data Mining. STUDFUZZ, vol. 180, pp. 221–274. Springer, Heidelberg (2005)
Finn, A., Kushmerick, N.: Learning to classify documents according to genre: Special Topic Section on Computational Analysis of Style. JASIS 57(11), 1506–1518 (2006)
Jiang, L., Zhang, H., Cai, Z.: A Novel Bayes Model: Hidden Naive Bayes. TKDE 21(10), 1361–1371 (2009)
Gentile, C., Zaniboni, L.: Hierarchical Classification: Combining Bayes with SVM. In: ICML (2006)
Weigend, A.S., Wiener, E.D., Pedersen, J.O.: Exploiting Hierarchy in Text Categorization. IR 1(3), 193–216 (1999)
Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: a review. CSUR 31(3), 264–323 (1999)
Mladenic, D., Grobelnik, M.: Feature Selection for Unbalanced Class Distribution and Naive Bayes. In: ICML, pp. 258–267. Morgan Kaufmann Publishers Inc., San Francisco (1999)
Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: ICML, pp. 412–420. Morgan Kaufmann Publishers Inc., San Francisco (1997)
Platt, J.C.: Fast training of support vector machines using sequential minimal optimization. In: Schölkopf, B., Burges, C.J.C., Smola, A.J. (eds.) Advances in Kernel Methods, pp. 185–208. MIT Press, Cambridge (1999)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Li, L., Song, D., Liao, L. (2012). Vertical Classification of Web Pages for Structured Data Extraction. In: Hou, Y., Nie, JY., Sun, L., Wang, B., Zhang, P. (eds) Information Retrieval Technology. AIRS 2012. Lecture Notes in Computer Science, vol 7675. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-35341-3_44
Download citation
DOI: https://doi.org/10.1007/978-3-642-35341-3_44
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-35340-6
Online ISBN: 978-3-642-35341-3
eBook Packages: Computer ScienceComputer Science (R0)