Skip to main content

Importance-Based Web Page Classification Using Cost-Sensitive SVM

  • Conference paper
Advances in Web-Age Information Management (WAIM 2005)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3739))

Included in the following conference series:

Abstract

Web page classification is facing great challenges since there is a huge repository and diversity of information. As known, each web page varies both in content and quality, just as PageRank suggested. Typical machine learning algorithms take advantage of positive and negative examples to train a classifier; however, it has been neglected that each instance has a different weight, which can be user pre-defined. This paper presents an effective algorithm based on Cost-Sensitive Support Vector Machine (CS-SVM) to improve the accuracy of classification. During the training process of CS-SVM, different cost factors are attached on the training errors to generate an optimized hyperplane. Our experiments show that CS-SVM outperforms SVM on the standard ODP data set. The web pages with relative high PageRank values contribute most to the classifier and using them for training can exceed the random sampling technique.

This work was conducted while the author was doing internship at Microsoft Research Asia.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Roush, W.: Search Beyond Google. MIT technology review, 34–35 (2004)

    Google Scholar 

  2. Yiming, Y., Xin, L.: A Reexamination of Text Categorization Methods. In: Proceedings of the 22th International Conference on Research and Development in Information Retrieval, University of California, Berkeley, USA, pp. 42–49 (1999)

    Google Scholar 

  3. McCallum, A., Nigam, K.: A Comparison of Event Models for Naive Bayes Text Classification. In: Proceedings of AAAI 1998 Workshop on Learning for Text Categorization, Madison, WI, pp. 41–48 (1998)

    Google Scholar 

  4. Lewis, D.D., Ringuette, M.: A Classification of Two Learning Algorithms for Text Categorization. In: Proceedings of Third Annual Symposium on Document Analysis and Information Retrieval, pp. 81–93 (1994)

    Google Scholar 

  5. Joachims, T.: Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In: Proceedings of 10th European Conference on Machine Learning, pp. 137–142 (1998)

    Google Scholar 

  6. Ruiz, M.E., Srinivasan, P.: Hierarchical text categorization using neural networks. Information Retrieval 5(1), 87–118 (2002)

    Article  MATH  Google Scholar 

  7. Burges, C.: A Tutorial on Support Vector Machines for Pattern Recognition. Data Mining and Knowledge Discovery 2(2), 121–167 (1998)

    Article  Google Scholar 

  8. Joachims, T.: Transductive Inference for Text Classification using Support Vector Machines. In: Proceedings of the 16th International Conference on Machine Learning (ICML), Bled, Slovenia, pp. 200–209 (1999)

    Google Scholar 

  9. Suykens, J.A.K., Vandewalle, J.: Least Squares Support Vector Machine Classifiers. Neural Processing Letters 9(3), 293–300 (1999)

    Article  MathSciNet  Google Scholar 

  10. Bing, L., Yang, D., Xiaoli, L., Wee Sum, L.: Building Text Classifiers Using Positive and Unlabeled Examples. In: Proceedings of International Conference on Data Mining, pp. 179–186 (2003)

    Google Scholar 

  11. Brin, S., Page, L.: The Anatomy of a Large-scale Hypertextual Web Search Engine. In: Proceedings of the Seventh International World Wide Web Conference, Brisbane, Australia (1998)

    Google Scholar 

  12. Bernhard, E.B., Isabelle, M.G., Vladimir, N.V.: A Training Algorithm for Optimal Margin Classifiers. In: Proceedings of International Conference on Computational Learning Theory, pp. 144–152 (1992)

    Google Scholar 

  13. Kuhn, H., Tucker, A.: Nonlinear Programming. In: Proceedings of 2nd Berkeley Symposium on Mathematical Statistics and Probabilistics, pp. 481–492. University of California Press (1951)

    Google Scholar 

  14. Platt, J.: Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines. In: Advances in Kernel Methods - Support Vector Learning, pp. 185–208 (1998)

    Google Scholar 

  15. Joachims, T.: Making large-Scale SVM Learning Practical Advances in Kernel Methods - Support Vector Learning. MIT-Press, Cambridge (1999)

    Google Scholar 

  16. Yiming, Y.: An Evaluation of Statistical Approaches to Text Categorization. Information Retrieval 1(1-2), 69–90 (1999)

    Google Scholar 

  17. Chakrabarti, S., Dom, B., Indyk, P.: Enhanced Hypertext Categorization Using Hyperlinks. In: Proceedings of ACM Special Interest Group on Management of Data, June 1998, vol. 27(2), pp. 307–318 (1998)

    Google Scholar 

  18. Attardi, G., Gull, A., Sebastiani, F.: Automatic Web Page Categorization by Link and Context Analysis. In: Proceedings of 1st European Symposium on Telematics, Hypermedia and Artificial Intelligence (Varese, IT), p. 12 (1999)

    Google Scholar 

  19. Shih, L.k., Karger, D.R.: Using URLs and Table Layout for Web Classification Tasks. In: Proceedings of the 13th international conference on World Wide Web (2004)

    Google Scholar 

  20. Belkin, M., Niyogi, P., Sindhwani, V.: Manifold Regularization: a Geometric Framework for Learning from Examples, University of Chicago Computer Science Technical Report TR-2004-06 (2004)

    Google Scholar 

  21. Carroll, R.J., Ruppert, D.: Transformation and Weighting in Regression. Chapman and Hall, New York (1998)

    Google Scholar 

  22. Paredes, R., Vidal, E.: A Nearest Neighbor Weighted Measure in Classification Problems. In: Proceedings of VIII Simposium Nacional de Reconocimiento de Formas y An alisis de Im agenes, Bilbao, Spain, May 1999, vol. 1, pp. 437–444 (1999)

    Google Scholar 

  23. Shen, H., Gui-Rong, X., Yong, Y., Benyu, Z., Zheng, C., Wei-Ying, M.: Multi-type Features based Web Document Clustering. In: Proceedings of the 5th International Conference on Web Information Systems Engineering, Brisbane, Australia, November 22-24 (2004)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Liu, W., Xue, Gr., Yu, Y., Zeng, Hj. (2005). Importance-Based Web Page Classification Using Cost-Sensitive SVM. In: Fan, W., Wu, Z., Yang, J. (eds) Advances in Web-Age Information Management. WAIM 2005. Lecture Notes in Computer Science, vol 3739. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11563952_12

Download citation

  • DOI: https://doi.org/10.1007/11563952_12

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-29227-2

  • Online ISBN: 978-3-540-32087-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics