Importance-Based Web Page Classification Using Cost-Sensitive SVM

Liu, Wei; Xue, Gui-rong; Yu, Yong; Zeng, Hua-jun

doi:10.1007/11563952_12

Wei Liu¹⁹,
Gui-rong Xue¹⁹,
Yong Yu²⁰ &
…
Hua-jun Zeng²¹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3739))

Included in the following conference series:

International Conference on Web-Age Information Management

Abstract

Web page classification is facing great challenges since there is a huge repository and diversity of information. As known, each web page varies both in content and quality, just as PageRank suggested. Typical machine learning algorithms take advantage of positive and negative examples to train a classifier; however, it has been neglected that each instance has a different weight, which can be user pre-defined. This paper presents an effective algorithm based on Cost-Sensitive Support Vector Machine (CS-SVM) to improve the accuracy of classification. During the training process of CS-SVM, different cost factors are attached on the training errors to generate an optimized hyperplane. Our experiments show that CS-SVM outperforms SVM on the standard ODP data set. The web pages with relative high PageRank values contribute most to the classifier and using them for training can exceed the random sampling technique.

This work was conducted while the author was doing internship at Microsoft Research Asia.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Efficient Machine Learning Technique for Web Page Classification

Article 08 September 2015

Multi-layer Filtering Webpage Classification Method Based on SVM

Web page classification based on heterogeneous features and a combination of multiple classifiers

Article 29 July 2020

References

Roush, W.: Search Beyond Google. MIT technology review, 34–35 (2004)
Google Scholar
Yiming, Y., Xin, L.: A Reexamination of Text Categorization Methods. In: Proceedings of the 22th International Conference on Research and Development in Information Retrieval, University of California, Berkeley, USA, pp. 42–49 (1999)
Google Scholar
McCallum, A., Nigam, K.: A Comparison of Event Models for Naive Bayes Text Classification. In: Proceedings of AAAI 1998 Workshop on Learning for Text Categorization, Madison, WI, pp. 41–48 (1998)
Google Scholar
Lewis, D.D., Ringuette, M.: A Classification of Two Learning Algorithms for Text Categorization. In: Proceedings of Third Annual Symposium on Document Analysis and Information Retrieval, pp. 81–93 (1994)
Google Scholar
Joachims, T.: Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In: Proceedings of 10th European Conference on Machine Learning, pp. 137–142 (1998)
Google Scholar
Ruiz, M.E., Srinivasan, P.: Hierarchical text categorization using neural networks. Information Retrieval 5(1), 87–118 (2002)
Article MATH Google Scholar
Burges, C.: A Tutorial on Support Vector Machines for Pattern Recognition. Data Mining and Knowledge Discovery 2(2), 121–167 (1998)
Article Google Scholar
Joachims, T.: Transductive Inference for Text Classification using Support Vector Machines. In: Proceedings of the 16th International Conference on Machine Learning (ICML), Bled, Slovenia, pp. 200–209 (1999)
Google Scholar
Suykens, J.A.K., Vandewalle, J.: Least Squares Support Vector Machine Classifiers. Neural Processing Letters 9(3), 293–300 (1999)
Article MathSciNet Google Scholar
Bing, L., Yang, D., Xiaoli, L., Wee Sum, L.: Building Text Classifiers Using Positive and Unlabeled Examples. In: Proceedings of International Conference on Data Mining, pp. 179–186 (2003)
Google Scholar
Brin, S., Page, L.: The Anatomy of a Large-scale Hypertextual Web Search Engine. In: Proceedings of the Seventh International World Wide Web Conference, Brisbane, Australia (1998)
Google Scholar
Bernhard, E.B., Isabelle, M.G., Vladimir, N.V.: A Training Algorithm for Optimal Margin Classifiers. In: Proceedings of International Conference on Computational Learning Theory, pp. 144–152 (1992)
Google Scholar
Kuhn, H., Tucker, A.: Nonlinear Programming. In: Proceedings of 2nd Berkeley Symposium on Mathematical Statistics and Probabilistics, pp. 481–492. University of California Press (1951)
Google Scholar
Platt, J.: Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines. In: Advances in Kernel Methods - Support Vector Learning, pp. 185–208 (1998)
Google Scholar
Joachims, T.: Making large-Scale SVM Learning Practical Advances in Kernel Methods - Support Vector Learning. MIT-Press, Cambridge (1999)
Google Scholar
Yiming, Y.: An Evaluation of Statistical Approaches to Text Categorization. Information Retrieval 1(1-2), 69–90 (1999)
Google Scholar
Chakrabarti, S., Dom, B., Indyk, P.: Enhanced Hypertext Categorization Using Hyperlinks. In: Proceedings of ACM Special Interest Group on Management of Data, June 1998, vol. 27(2), pp. 307–318 (1998)
Google Scholar
Attardi, G., Gull, A., Sebastiani, F.: Automatic Web Page Categorization by Link and Context Analysis. In: Proceedings of 1st European Symposium on Telematics, Hypermedia and Artificial Intelligence (Varese, IT), p. 12 (1999)
Google Scholar
Shih, L.k., Karger, D.R.: Using URLs and Table Layout for Web Classification Tasks. In: Proceedings of the 13th international conference on World Wide Web (2004)
Google Scholar
Belkin, M., Niyogi, P., Sindhwani, V.: Manifold Regularization: a Geometric Framework for Learning from Examples, University of Chicago Computer Science Technical Report TR-2004-06 (2004)
Google Scholar
Carroll, R.J., Ruppert, D.: Transformation and Weighting in Regression. Chapman and Hall, New York (1998)
Google Scholar
Paredes, R., Vidal, E.: A Nearest Neighbor Weighted Measure in Classification Problems. In: Proceedings of VIII Simposium Nacional de Reconocimiento de Formas y An alisis de Im agenes, Bilbao, Spain, May 1999, vol. 1, pp. 437–444 (1999)
Google Scholar
Shen, H., Gui-Rong, X., Yong, Y., Benyu, Z., Zheng, C., Wei-Ying, M.: Multi-type Features based Web Document Clustering. In: Proceedings of the 5th International Conference on Web Information Systems Engineering, Brisbane, Australia, November 22-24 (2004)
Google Scholar

Download references

Author information

Authors and Affiliations

Shanghai Jiao Tong University, No.800, Dongchuan Road, Min Hang Shanghai, 200240, China
Wei Liu & Gui-rong Xue
Computer Science Department, Shanghai Jiao Tong University, Shanghai, 200030, China
Yong Yu
Microsoft Research Asia, 5/F, Beijing Sigma Center, No.49, Zhichun Road, Hai Dian District, Beijing, 100080, China
Hua-jun Zeng

Authors

Wei Liu
View author publications
You can also search for this author in PubMed Google Scholar
Gui-rong Xue
View author publications
You can also search for this author in PubMed Google Scholar
Yong Yu
View author publications
You can also search for this author in PubMed Google Scholar
Hua-jun Zeng
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

University of Edinburgh & Bell Laboratories,
Wenfei Fan
College of Computer Science, Zhejiang University, 310027, Hangzhou, Zhejiang, China
Zhaohui Wu
Dept. of E. I. E, Huazhong University of Science and Technology, Wuhan, China
Jun Yang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Liu, W., Xue, Gr., Yu, Y., Zeng, Hj. (2005). Importance-Based Web Page Classification Using Cost-Sensitive SVM. In: Fan, W., Wu, Z., Yang, J. (eds) Advances in Web-Age Information Management. WAIM 2005. Lecture Notes in Computer Science, vol 3739. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11563952_12

Download citation

DOI: https://doi.org/10.1007/11563952_12
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-29227-2
Online ISBN: 978-3-540-32087-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics