Abstract
Webpage categorization has turned out to be an important topic in recent years. In a webpage, text is usually the main content, so that auto text categorization (ATC) becomes the key technique to such a task. For Chinese text categorization as well as Chinese webpage categorization, one of the basic and urgent problems is the construction of a good benchmark corpus. In this study, a machine learning approach is presented to refine a corpus for Chinese webpage categorization, where the AdaBoost algorithm is adopted to identify outliers in the corpus. The standard k nearest neighbor (kNN) algorithm under a vector space model (VSM) is adopted to construct a webpage categorization system. Simulation results as well as manual investigation of the identified outliers reveal that the presented method works well.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Dumais, S., Chen, H.: Hierarchical classification of Web content. In: Proceedings of the 23rd annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 256–263 (2000)
Lewis, D., Ringuette, M.: A Comparison of Two Learning Algorithms for Text Classification. In: Third Annual Symposium on Document Analysis and Information Retrieval, pp. 81–93 (1994)
Yang, Y., Pedersen, J.P.: A Comparative Study on Feature Selection in Text Categorization. In: The 14th International Conference on Machine Learning, pp. 412–420 (1997)
Cohen, W.J., Singer, Y.: Context-sensitive Learning Methods for Text Categorization. In: Proceedings of 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 307–315 (1996)
Joachims, T.: Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)
Weiss, S.M., Apte, C., Damerau, F.J.: Maximizing Text-mining Performance. IEEE Intelligent Systems 14(4), 63–69 (1999)
Wiener, E., Pedersen, J.O., Weigend, A.S.: A Neural Network Approach to Topic Spotting. In: Proceedings of the 4th Annual Symposium on Document Analysis and Information Retrieval, pp. 22–34 (1993)
He, J., Tan, A.H., Tan, C.L.: On Machine Learning Methods for Chinese Document Categorization. Applied Intelligence 18(3), 311–322 (2003)
Luo, D.S., Wu, X.H., Chi, H.S.: On Outlier Problem of Statistical Ensemble Learning. In: Proceedings of the IASTED International Conference on Artificial Intelligence and Applications, pp. 281–286 (2004)
Schapire, R.E.: The Strength of Weak Learnability. Machine Learning 5, 197–227 (1990)
Freund, Y., Schapire, R.E.: Experiments with a New Boosting Algorithm. In: Machine Learning: Proceedings of the Thirteenth International Conference, pp. 148–156 (1996)
Friedman, J., Hastie, T., Tibshirani, R.: Additive Logistic Regression: A Statistical View of Boosting. The Annals of Statistics 38(2), 337–374 (2000)
Rätsch, G., Onoda, T., Müller, K.R.: Soft Margins for AdaBoost. Machine Learning 42(3), 287–320 (2001)
Freund, Y.: An Adaptive Version of the Boost by Majority Algorithm. Machine Learning 43(3), 293–318 (2001)
Salton, G., McGill, M.J.: An Introduction to Modern Information Retrieval. McGraw-Hill, New York (1983)
Aas, K., Eikvil, L.: Text Categorization: A Survey. Technique Report, No. 941, Norwegian Computing Center (1999), http://citeseer.nj.nec.com/aas99text.html
Zhang, H.P., Liu, Q., Cheng, X.Q., Zhang, H., Yu, H.K.: Chinese Lexical Analysis Using Hierarchical Hidden Markov Model. In: Second SIGHAN Workshop on Chinese Language Processing, pp. 63–70 (2003)
Wu, X.H., Luo, D.S., Wang, X.H., Chi, H.S.: WrodsGroup based Scheme for Chinese Text Categorization. Submitted to Journal of Chinese Information Processing
Dong, D.N.: The Modern Chinese Classification Dictionary. The Publishing House of the Unabridged Chinese Dictionary (1999)
Yang, Y.M.: An Evaluation of Statistical Approaches to Text Categorization. Journal of Information Retrieval 1(1/2), 67–88 (1999)
Yang, Y.M., Liu, X.: A Re-examination of Text Categorization Methods. In: Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 42–49 (1999)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Luo, D., Wang, X., Wu, X., Chi, H. (2005). Learning Outliers to Refine a Corpus for Chinese Webpage Categorization. In: Wang, L., Chen, K., Ong, Y.S. (eds) Advances in Natural Computation. ICNC 2005. Lecture Notes in Computer Science, vol 3610. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11539087_19
Download citation
DOI: https://doi.org/10.1007/11539087_19
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-28323-2
Online ISBN: 978-3-540-31853-8
eBook Packages: Computer ScienceComputer Science (R0)