Skip to main content

Hierarchically Classifying Chinese Web Documents without Dictionary Support and Segmentation Procedure1

  • Conference paper
  • First Online:
Web-Age Information Management (WAIM 2000)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1846))

Included in the following conference series:

  • 368 Accesses

Abstract

This paper reports a system that hierarchically classifies Chinese web documents without dictionary support and segmentation procedure. In our classifier, Web documents are represented by N-grams (N≤4) that are easy to be extracted. A boosting machine learning approach is applied to classifying Web Chinese documents that share a topic hierarchy. The open and modularized system architecture makes our classifier be extendible. Experimental results show that our system can effectively and efficiently classify Chinese Web documents.

This work is supported by the 973 High-Tech Projects Foundation of China and partially supported by a grant (No. 69933010) from NSFC.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Yahoo! On-line guide for the Internet. http://www.yahoo.com/ (1995)

  2. Yang Y. and Liu X. A re-examination of text categorization methods. Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR) (1999)

    Google Scholar 

  3. Zhao B. and Xu L. Processing Chinese Information with Computer, Vol. 2. Space Publisher House (1988)

    Google Scholar 

  4. Yang Y. and Pederson J. Feature selection in statistical learning of text categorization. In ICML-97 (1997) 412–420.

    Google Scholar 

  5. Lewis D.D. Naive (Bayes) at forty: The independence assumption in information retrieval. In Machine Learning: ECML-98, 10th European Conference on Machine Learning (1998) 4–15

    Google Scholar 

  6. Joachims T. Text categorization with support vector machines: learning with many relevant features. In Machine Learning: ECML-98, 10th European Conference on Machine Learning (1998) 137–142

    Google Scholar 

  7. Schapire R. E. and Singer Y. Improved boosting algorithms using confidence-rated predictions. In Proceedings of 11th Annual Conference on Computational Learning Theory (1998) 80–91

    Google Scholar 

  8. Cohen W. W. and Singer Y. Context-sensitive learning methods for text categorization. In SIGIR’96: Proceedings of the 9th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (1996) 307–315

    Google Scholar 

  9. Koller D. and Sahami M. Hierarchically classifying documents using very few words.

    Google Scholar 

  10. Mladenic D., et al. Feature selection in text learning. Proc. Of 10th European Conference on Machine Learning ECML98 (1998)

    Google Scholar 

  11. McCallum A., et al. Improving text classification by shrinkage in a hierarchy of classes. In ICML-98 (1998) 359–367

    Google Scholar 

  12. Chakrabarti S., et al. Using taxonomy, discriminants, and signatures for navigating in text databases. Proc. Of the 23rd VLDB Conference Athene, Greece (1997)

    Google Scholar 

  13. Moor J. and Han E. H (Sam). Web page categorization and feature selection using association rule and principal component clustering (1998)

    Google Scholar 

  14. Quek C. Y. Classification of World Wide Web documents. Senior Honors Thesis, CMU (1997)

    Google Scholar 

  15. Koller D. and Sahami M. Toward optimal feature selection. In Lorenza Saita, ed., Machine Learning: Proc. of the 13th International Conference, Morgan Kaufman (1996)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2000 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Zhou, S., Fan, Y., Hu, J., Yu, F., Hu, Y. (2000). Hierarchically Classifying Chinese Web Documents without Dictionary Support and Segmentation Procedure1 . In: Lu, H., Zhou, A. (eds) Web-Age Information Management. WAIM 2000. Lecture Notes in Computer Science, vol 1846. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45151-X_20

Download citation

  • DOI: https://doi.org/10.1007/3-540-45151-X_20

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-67627-0

  • Online ISBN: 978-3-540-45151-8

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics