A Comparative Study on Representing Units in Chinese Text Clustering

Hongjun, Wang; Shiwen, Yu; Xueqiang, Lv; Shuicai, Shi; Shibin, Xiao

doi:10.1007/11811220_39

A Comparative Study on Representing Units in Chinese Text Clustering

Wang Hongjun^21,22,
Yu Shiwen²¹,
Lv Xueqiang²²,
Shi Shuicai²² &
…
Xiao Shibin²²

Conference paper

1076 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4092))

Abstract

Words and n-grams are commonly used Chinese text representing units and are proved to be good features for Chinese Text Categorization and Information Retrieval. But the effectiveness of applying these representing units for Chinese Text Clustering is still uncovered. This paper is a comparative study of representing units in Chinese Text Clustering. With K-means algorithm, several representing units were evaluated including Chinese character N-gram features, word features and their combinations. We found Chinese word features, Chinese character unigram features and bi-gram features most effective in our experiments. The combination of features didn’t improve the results. Detailed experimental results on several public Chinese Text Categorization datasets are provided in the paper.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Cutting, D., Karger, D., Pedersen, J., Tukey, J.W.: Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections. SIGIR 92(5), 318–329 (1992)
Google Scholar
Larsen, B., Aone, C.: Fast and effective text mining using linear-time document clustering. In: Proc. 5th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (1999)
Google Scholar
Steinbach, M., Karypis, G., Kumar, V.: A comparison of document clustering techniques. In: KDD Workshop on Text Mining (2000)
Google Scholar
Liu, T., Liu, S., Chen, Z., Ma, W.-Y.: An Evaluation on Feature Selection for Text Clustering. In: ICML 2003 (2003)
Google Scholar
Kummamuru, K., Lotlikar, R., Roy, S.: A Hierarchical Monothetic Document Clustering Algorithm for Summarization and Browsing Search Results. In: WWW 2004, New York, USA, May 17-22 (2004)
Google Scholar
Zhang, H., Liu, Q., Zhang, H., Cheng, X.: Automatic Recognition of Chinese Unknown Words Based on Role. In: Tagging 19th International Conference on Computational Linguistics, SigHan Workshop (2002)
Google Scholar
Baoli, L., Yuzhong, C., Xiaojing, B., Yu, S.: Experimental Study on Representing Units in Chinese Text Categorization. In: Gelbukh, A. (ed.) CICLing 2003. LNCS, vol. 2588, pp. 602–614. Springer, Heidelberg (2003)
Chapter Google Scholar
Xue, D.-j.: A Study on Key Issues of Automated Text Categorization for Chinese Documents. PHD theses, Tsinghua University (2004)
Google Scholar
Nie, J.-Y., Ren, F.: Chinese information retrieval: using characters or words? Information Processing and Management 35, 443–462 (1999)
Article Google Scholar
Salton, G.: Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, Reading (1989)
Google Scholar
Faber, V.: Clustering and the Continuous k-Means Algorithm. Los Alamos Science, November 22 (1994)
Google Scholar
Bradley, P., Fayyad, U.: Refining Initial Points for K-Means Clustering. In: Proc. of ICML 1998, pp. 91–99 (1998)
Google Scholar
Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Proc. Of ICML 1997, pp. 412–420 (1997)
Google Scholar
Zhao, Y., Karypis, G.: Evaluation of hierarchical clustering algorithms for document datasets. In: Proceedings of the International Conference on Information and Knowledge Management (2002)
Google Scholar
Zhao, Y., Karypis, G.: Empirical and theoretical comparisons of selected criterion functions for document clustering. Machine Learning 55(3) (2004)
Google Scholar
Surdeanu, M., Turmo, J., Ageno, A.: A hybrid unsupervised approach for document clustering. In: Proceeding of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, August 21-24 (2005)
Google Scholar
Chen, J., Palmer, M.S.: Chinese Verb Sense Discrimination Using an EM Clustering Model with Rich Linguistic Features. In: ACL 2004, pp. 295–302 (2004)
Google Scholar

Download references

Author information

Authors and Affiliations

Institute Of Computing Linguistics Peking University, Beijing, 100080
Wang Hongjun & Yu Shiwen
Chinese Information Processing Center Beijing Information Technology Institute, Beijing, 100101
Wang Hongjun, Lv Xueqiang, Shi Shuicai & Xiao Shibin

Authors

Wang Hongjun
View author publications
You can also search for this author in PubMed Google Scholar
Yu Shiwen
View author publications
You can also search for this author in PubMed Google Scholar
Lv Xueqiang
View author publications
You can also search for this author in PubMed Google Scholar
Shi Shuicai
View author publications
You can also search for this author in PubMed Google Scholar
Xiao Shibin
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

IRIT, UPS,, F-31062, Toulouse Cédex 9, France
Jérôme Lang
Department of Computer Science, Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong
Fangzhen Lin
Guangxi Normal University, Guilin, China
Ju Wang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hongjun, W., Shiwen, Y., Xueqiang, L., Shuicai, S., Shibin, X. (2006). A Comparative Study on Representing Units in Chinese Text Clustering. In: Lang, J., Lin, F., Wang, J. (eds) Knowledge Science, Engineering and Management. KSEM 2006. Lecture Notes in Computer Science(), vol 4092. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11811220_39

Download citation

DOI: https://doi.org/10.1007/11811220_39
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-37033-8
Online ISBN: 978-3-540-37035-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics