Abstract
Web page classification is facing great challenges since there is a huge repository and diversity of information. As known, each web page varies both in content and quality, just as PageRank suggested. Typical machine learning algorithms take advantage of positive and negative examples to train a classifier; however, it has been neglected that each instance has a different weight, which can be user pre-defined. This paper presents an effective algorithm based on Cost-Sensitive Support Vector Machine (CS-SVM) to improve the accuracy of classification. During the training process of CS-SVM, different cost factors are attached on the training errors to generate an optimized hyperplane. Our experiments show that CS-SVM outperforms SVM on the standard ODP data set. The web pages with relative high PageRank values contribute most to the classifier and using them for training can exceed the random sampling technique.
This work was conducted while the author was doing internship at Microsoft Research Asia.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Roush, W.: Search Beyond Google. MIT technology review, 34–35 (2004)
Yiming, Y., Xin, L.: A Reexamination of Text Categorization Methods. In: Proceedings of the 22th International Conference on Research and Development in Information Retrieval, University of California, Berkeley, USA, pp. 42–49 (1999)
McCallum, A., Nigam, K.: A Comparison of Event Models for Naive Bayes Text Classification. In: Proceedings of AAAI 1998 Workshop on Learning for Text Categorization, Madison, WI, pp. 41–48 (1998)
Lewis, D.D., Ringuette, M.: A Classification of Two Learning Algorithms for Text Categorization. In: Proceedings of Third Annual Symposium on Document Analysis and Information Retrieval, pp. 81–93 (1994)
Joachims, T.: Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In: Proceedings of 10th European Conference on Machine Learning, pp. 137–142 (1998)
Ruiz, M.E., Srinivasan, P.: Hierarchical text categorization using neural networks. Information Retrieval 5(1), 87–118 (2002)
Burges, C.: A Tutorial on Support Vector Machines for Pattern Recognition. Data Mining and Knowledge Discovery 2(2), 121–167 (1998)
Joachims, T.: Transductive Inference for Text Classification using Support Vector Machines. In: Proceedings of the 16th International Conference on Machine Learning (ICML), Bled, Slovenia, pp. 200–209 (1999)
Suykens, J.A.K., Vandewalle, J.: Least Squares Support Vector Machine Classifiers. Neural Processing Letters 9(3), 293–300 (1999)
Bing, L., Yang, D., Xiaoli, L., Wee Sum, L.: Building Text Classifiers Using Positive and Unlabeled Examples. In: Proceedings of International Conference on Data Mining, pp. 179–186 (2003)
Brin, S., Page, L.: The Anatomy of a Large-scale Hypertextual Web Search Engine. In: Proceedings of the Seventh International World Wide Web Conference, Brisbane, Australia (1998)
Bernhard, E.B., Isabelle, M.G., Vladimir, N.V.: A Training Algorithm for Optimal Margin Classifiers. In: Proceedings of International Conference on Computational Learning Theory, pp. 144–152 (1992)
Kuhn, H., Tucker, A.: Nonlinear Programming. In: Proceedings of 2nd Berkeley Symposium on Mathematical Statistics and Probabilistics, pp. 481–492. University of California Press (1951)
Platt, J.: Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines. In: Advances in Kernel Methods - Support Vector Learning, pp. 185–208 (1998)
Joachims, T.: Making large-Scale SVM Learning Practical Advances in Kernel Methods - Support Vector Learning. MIT-Press, Cambridge (1999)
Yiming, Y.: An Evaluation of Statistical Approaches to Text Categorization. Information Retrieval 1(1-2), 69–90 (1999)
Chakrabarti, S., Dom, B., Indyk, P.: Enhanced Hypertext Categorization Using Hyperlinks. In: Proceedings of ACM Special Interest Group on Management of Data, June 1998, vol. 27(2), pp. 307–318 (1998)
Attardi, G., Gull, A., Sebastiani, F.: Automatic Web Page Categorization by Link and Context Analysis. In: Proceedings of 1st European Symposium on Telematics, Hypermedia and Artificial Intelligence (Varese, IT), p. 12 (1999)
Shih, L.k., Karger, D.R.: Using URLs and Table Layout for Web Classification Tasks. In: Proceedings of the 13th international conference on World Wide Web (2004)
Belkin, M., Niyogi, P., Sindhwani, V.: Manifold Regularization: a Geometric Framework for Learning from Examples, University of Chicago Computer Science Technical Report TR-2004-06 (2004)
Carroll, R.J., Ruppert, D.: Transformation and Weighting in Regression. Chapman and Hall, New York (1998)
Paredes, R., Vidal, E.: A Nearest Neighbor Weighted Measure in Classification Problems. In: Proceedings of VIII Simposium Nacional de Reconocimiento de Formas y An alisis de Im agenes, Bilbao, Spain, May 1999, vol. 1, pp. 437–444 (1999)
Shen, H., Gui-Rong, X., Yong, Y., Benyu, Z., Zheng, C., Wei-Ying, M.: Multi-type Features based Web Document Clustering. In: Proceedings of the 5th International Conference on Web Information Systems Engineering, Brisbane, Australia, November 22-24 (2004)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Liu, W., Xue, Gr., Yu, Y., Zeng, Hj. (2005). Importance-Based Web Page Classification Using Cost-Sensitive SVM. In: Fan, W., Wu, Z., Yang, J. (eds) Advances in Web-Age Information Management. WAIM 2005. Lecture Notes in Computer Science, vol 3739. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11563952_12
Download citation
DOI: https://doi.org/10.1007/11563952_12
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-29227-2
Online ISBN: 978-3-540-32087-6
eBook Packages: Computer ScienceComputer Science (R0)