Learning Outliers to Refine a Corpus for Chinese Webpage Categorization

Luo, Dingsheng; Wang, Xinhao; Wu, Xihong; Chi, Huisheng

doi:10.1007/11539087_19

Dingsheng Luo¹⁹,
Xinhao Wang¹⁹,
Xihong Wu¹⁹ &
…
Huisheng Chi¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 3610))

Included in the following conference series:

International Conference on Natural Computation

2029 Accesses

Abstract

Webpage categorization has turned out to be an important topic in recent years. In a webpage, text is usually the main content, so that auto text categorization (ATC) becomes the key technique to such a task. For Chinese text categorization as well as Chinese webpage categorization, one of the basic and urgent problems is the construction of a good benchmark corpus. In this study, a machine learning approach is presented to refine a corpus for Chinese webpage categorization, where the AdaBoost algorithm is adopted to identify outliers in the corpus. The standard k nearest neighbor (kNN) algorithm under a vector space model (VSM) is adopted to construct a webpage categorization system. Simulation results as well as manual investigation of the identified outliers reveal that the presented method works well.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Automated Document Categorization Model

Semi-supervised learning in large scale text categorization

Article 30 May 2017

Hierarchical Multidimensional Classification of Web Documents with MultiWebClass

References

Dumais, S., Chen, H.: Hierarchical classification of Web content. In: Proceedings of the 23rd annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 256–263 (2000)
Google Scholar
Lewis, D., Ringuette, M.: A Comparison of Two Learning Algorithms for Text Classification. In: Third Annual Symposium on Document Analysis and Information Retrieval, pp. 81–93 (1994)
Google Scholar
Yang, Y., Pedersen, J.P.: A Comparative Study on Feature Selection in Text Categorization. In: The 14th International Conference on Machine Learning, pp. 412–420 (1997)
Google Scholar
Cohen, W.J., Singer, Y.: Context-sensitive Learning Methods for Text Categorization. In: Proceedings of 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 307–315 (1996)
Google Scholar
Joachims, T.: Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)
Chapter Google Scholar
Weiss, S.M., Apte, C., Damerau, F.J.: Maximizing Text-mining Performance. IEEE Intelligent Systems 14(4), 63–69 (1999)
Article Google Scholar
Wiener, E., Pedersen, J.O., Weigend, A.S.: A Neural Network Approach to Topic Spotting. In: Proceedings of the 4th Annual Symposium on Document Analysis and Information Retrieval, pp. 22–34 (1993)
Google Scholar
He, J., Tan, A.H., Tan, C.L.: On Machine Learning Methods for Chinese Document Categorization. Applied Intelligence 18(3), 311–322 (2003)
Article MATH Google Scholar
Luo, D.S., Wu, X.H., Chi, H.S.: On Outlier Problem of Statistical Ensemble Learning. In: Proceedings of the IASTED International Conference on Artificial Intelligence and Applications, pp. 281–286 (2004)
Google Scholar
Schapire, R.E.: The Strength of Weak Learnability. Machine Learning 5, 197–227 (1990)
Google Scholar
Freund, Y., Schapire, R.E.: Experiments with a New Boosting Algorithm. In: Machine Learning: Proceedings of the Thirteenth International Conference, pp. 148–156 (1996)
Google Scholar
Friedman, J., Hastie, T., Tibshirani, R.: Additive Logistic Regression: A Statistical View of Boosting. The Annals of Statistics 38(2), 337–374 (2000)
Article MathSciNet Google Scholar
Rätsch, G., Onoda, T., Müller, K.R.: Soft Margins for AdaBoost. Machine Learning 42(3), 287–320 (2001)
Article MATH Google Scholar
Freund, Y.: An Adaptive Version of the Boost by Majority Algorithm. Machine Learning 43(3), 293–318 (2001)
Article MATH Google Scholar
Salton, G., McGill, M.J.: An Introduction to Modern Information Retrieval. McGraw-Hill, New York (1983)
Google Scholar
Aas, K., Eikvil, L.: Text Categorization: A Survey. Technique Report, No. 941, Norwegian Computing Center (1999), http://citeseer.nj.nec.com/aas99text.html
Zhang, H.P., Liu, Q., Cheng, X.Q., Zhang, H., Yu, H.K.: Chinese Lexical Analysis Using Hierarchical Hidden Markov Model. In: Second SIGHAN Workshop on Chinese Language Processing, pp. 63–70 (2003)
Google Scholar
Wu, X.H., Luo, D.S., Wang, X.H., Chi, H.S.: WrodsGroup based Scheme for Chinese Text Categorization. Submitted to Journal of Chinese Information Processing
Google Scholar
Dong, D.N.: The Modern Chinese Classification Dictionary. The Publishing House of the Unabridged Chinese Dictionary (1999)
Google Scholar
Yang, Y.M.: An Evaluation of Statistical Approaches to Text Categorization. Journal of Information Retrieval 1(1/2), 67–88 (1999)
Google Scholar
Yang, Y.M., Liu, X.: A Re-examination of Text Categorization Methods. In: Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 42–49 (1999)
Google Scholar

Download references

Author information

Authors and Affiliations

National Laboratory on Machine Perception, School of Electronics Engineering & Computer Science, Peking University, Beijing, 100871, China
Dingsheng Luo, Xinhao Wang, Xihong Wu & Huisheng Chi

Authors

Dingsheng Luo
View author publications
You can also search for this author in PubMed Google Scholar
Xinhao Wang
View author publications
You can also search for this author in PubMed Google Scholar
Xihong Wu
View author publications
You can also search for this author in PubMed Google Scholar
Huisheng Chi
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Electrical and Electronic Engineering, Nanyang Technological University, Block S1, Nanyang Avenue, 639798, Singapore
Lipo Wang
School of Software, Sun Yat-Sen University, 510275, Guangzhou, China
Ke Chen
School of Computer Engineering, Nanyang Technological University, BLK N4, 2b-39, Nanyang Avenue, 639798, Singapore
Yew Soon Ong

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Luo, D., Wang, X., Wu, X., Chi, H. (2005). Learning Outliers to Refine a Corpus for Chinese Webpage Categorization. In: Wang, L., Chen, K., Ong, Y.S. (eds) Advances in Natural Computation. ICNC 2005. Lecture Notes in Computer Science, vol 3610. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11539087_19

Download citation

DOI: https://doi.org/10.1007/11539087_19
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-28323-2
Online ISBN: 978-3-540-31853-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics