Skip to main content

Experimental Study on Representing Units in Chinese Text Categorization

  • Conference paper
  • First Online:
Computational Linguistics and Intelligent Text Processing (CICLing 2003)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2588))

Abstract

This paper is a comparative study on representing units in Chinese text categorization. Several kinds of representing units, including byte 3-gram, Chinese character, Chinese word, and Chinese word with part of speech tag, were investigated. Empirical evidence shows that when the size of training data is large enough, representations of higher-level or with larger feature spaces result in better performance than those of lower level or with smaller feature spaces, whereas when the training data is limited the conclusion may be the reverse. In general, representations of higher-level or with larger feature spaces need more training data to reach the best performance. But, as to a specific representation, the size of training data and the categorization performance are not always positively correlated.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Christopher D. Manning, Hinrich Schutze: Foundations of Statistical Natural Language Processing. MIT Press (1999)

    Google Scholar 

  2. Wang Mengyun, Cao Suqing: The System for Automatic Text Categorization Based on Chinese Character Vector. Journal of Informatics (in Chinese), 19:6 (2000) 644–649

    Google Scholar 

  3. Pang Jianfeng, et al.: Research and Implementation of Text Categorization System Based on VSM. Journal of Research on Computer Application (in Chinese), 9 (2001) 23–26

    Google Scholar 

  4. Marc Damashek: Gauging Similarity with n-Grams: Language-Independent Categorization of Text. Science, 267:10(1995) 843–848

    Article  Google Scholar 

  5. Palmer D., Burger J.: Chinese Word Segmentation and Information Retrieval. In AAAI Symposium Cross-Language Text and Speech Retrieval (1997)

    Google Scholar 

  6. Peng Fuchun, et al.: Investigating the Relationship between Word Segmentation Performance and Retrieval Performance in Chinese IR. In the Proceedings of the 19th International Conference on Computational Linguistics (2002)

    Google Scholar 

  7. Joachims T.: Learning to Classify Text Using SVM: Methods, Theory and Algorithms. Kluwer Academic Publishers (2002)

    Google Scholar 

  8. Li Baoli, et al.: A Comparative Study on Automatic Categorization Methods for Chinese Search Engine. In the Proceedings of the Eighth Joint International Computer Conference (2002) 117–120

    Google Scholar 

  9. Liu Yuan, et al.: Segmentation Standard for Modern Chinese Information Processing and Automatic Segmentation Methodology. Tsinghua University Press (1994)

    Google Scholar 

  10. Yang Y., Pedersen J.O.: A Comparative Study on Feature Selection in Text Categorization. In the Proceedings of Fourteenth International Conference on Machine Learning (1997) 412–420

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2003 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Baoli, L., Yuzhong, C., Xiaojing, B., Shiwen, Y. (2003). Experimental Study on Representing Units in Chinese Text Categorization. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2003. Lecture Notes in Computer Science, vol 2588. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-36456-0_67

Download citation

  • DOI: https://doi.org/10.1007/3-540-36456-0_67

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-00532-2

  • Online ISBN: 978-3-540-36456-6

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics