Hierarchically Classifying Chinese Web Documents without Dictionary Support and Segmentation Procedure1

Zhou, Shuigeng; Fan, Ye; Hu, Jiangtao; Yu, Fang; Hu, Yunfa

doi:10.1007/3-540-45151-X_20

Shuigeng Zhou⁶,
Ye Fan⁶,
Jiangtao Hu⁶,
Fang Yu⁶ &
…
Yunfa Hu⁶

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1846))

Included in the following conference series:

International Conference on Web-Age Information Management

368 Accesses

Abstract

This paper reports a system that hierarchically classifies Chinese web documents without dictionary support and segmentation procedure. In our classifier, Web documents are represented by N-grams (N≤4) that are easy to be extracted. A boosting machine learning approach is applied to classifying Web Chinese documents that share a topic hierarchy. The open and modularized system architecture makes our classifier be extendible. Experimental results show that our system can effectively and efficiently classify Chinese Web documents.

This work is supported by the 973 High-Tech Projects Foundation of China and partially supported by a grant (No. 69933010) from NSFC.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Yahoo! On-line guide for the Internet. http://www.yahoo.com/ (1995)
Yang Y. and Liu X. A re-examination of text categorization methods. Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR) (1999)
Google Scholar
Zhao B. and Xu L. Processing Chinese Information with Computer, Vol. 2. Space Publisher House (1988)
Google Scholar
Yang Y. and Pederson J. Feature selection in statistical learning of text categorization. In ICML-97 (1997) 412–420.
Google Scholar
Lewis D.D. Naive (Bayes) at forty: The independence assumption in information retrieval. In Machine Learning: ECML-98, 10^th European Conference on Machine Learning (1998) 4–15
Google Scholar
Joachims T. Text categorization with support vector machines: learning with many relevant features. In Machine Learning: ECML-98, 10^th European Conference on Machine Learning (1998) 137–142
Google Scholar
Schapire R. E. and Singer Y. Improved boosting algorithms using confidence-rated predictions. In Proceedings of 11^th Annual Conference on Computational Learning Theory (1998) 80–91
Google Scholar
Cohen W. W. and Singer Y. Context-sensitive learning methods for text categorization. In SIGIR’96: Proceedings of the 9^th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (1996) 307–315
Google Scholar
Koller D. and Sahami M. Hierarchically classifying documents using very few words.
Google Scholar
Mladenic D., et al. Feature selection in text learning. Proc. Of 10^th European Conference on Machine Learning ECML98 (1998)
Google Scholar
McCallum A., et al. Improving text classification by shrinkage in a hierarchy of classes. In ICML-98 (1998) 359–367
Google Scholar
Chakrabarti S., et al. Using taxonomy, discriminants, and signatures for navigating in text databases. Proc. Of the 23^rd VLDB Conference Athene, Greece (1997)
Google Scholar
Moor J. and Han E. H (Sam). Web page categorization and feature selection using association rule and principal component clustering (1998)
Google Scholar
Quek C. Y. Classification of World Wide Web documents. Senior Honors Thesis, CMU (1997)
Google Scholar
Koller D. and Sahami M. Toward optimal feature selection. In Lorenza Saita, ed., Machine Learning: Proc. of the 13^th International Conference, Morgan Kaufman (1996)
Google Scholar

Download references

Author information

Authors and Affiliations

Computer Science Department, Fudan University, Shanghai, 200433, China
Shuigeng Zhou, Ye Fan, Jiangtao Hu, Fang Yu & Yunfa Hu

Authors

Shuigeng Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Ye Fan
View author publications
You can also search for this author in PubMed Google Scholar
Jiangtao Hu
View author publications
You can also search for this author in PubMed Google Scholar
Fang Yu
View author publications
You can also search for this author in PubMed Google Scholar
Yunfa Hu
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong, China
Hongjun Lu
Department of Computer Science, Fudan University, 220 Handan Road, Shanghai, China
Aoying Zhou

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhou, S., Fan, Y., Hu, J., Yu, F., Hu, Y. (2000). Hierarchically Classifying Chinese Web Documents without Dictionary Support and Segmentation Procedure¹ . In: Lu, H., Zhou, A. (eds) Web-Age Information Management. WAIM 2000. Lecture Notes in Computer Science, vol 1846. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45151-X_20

Download citation

DOI: https://doi.org/10.1007/3-540-45151-X_20
Published: 07 November 2002
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-67627-0
Online ISBN: 978-3-540-45151-8
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics