Abstract
This paper evaluates four unsupervised Chinese word clustering methods, respectively maximum mutual information (MMI), function word (FW), high frequent word (HFW), and word cluster (WC). Two evaluation measures, part-of-speech (POS) precision and semantic precision, are employed. Testing results show that MMI reaches the best performance: 79.09% on POS precision and 49.75% on semantic precision, while the other three exceed 51.09% and 29.78% respectively. When applying word clusters generated by the methods mentioned above to the alignment-based automatic Chinese syntactic induction, the performance is further improved.
Keywords
This work is supported by National Natural Science Foundation of China (No. 60473138, No.60675035) and National Social Science Foundation of China (No. 05BYY043).
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Roberts, A.: Automatic Acquisition of Word Classification using Distributional Analysis of Content Words with Respect to Function Words, November 17 (2002)
Clark, A.: Unsupervised induction of stochastic context-free grammars using distributional clustering. In: Proc. of CoNLL 2001, Toulouse, France, July 2001, pp. 105–112 (2001)
Finch, S., Chater, N.: Bootstrapping syntactic categories. In: Proceedings of the 14th Annual Meeting of the Cognitive Science Society, pp. 820–825 (1992a)
Pereira, F., Tibshy, N., Lillian, L.: Distributional Clustering of English Words, CL (1993)
Klein, D.: The Unsupervised learning of natural language structure. PHD thesis, Stanford University (2005)
Brown, P.F., Della Pietra, V.J., de Souza, P.V., Lai, J.C., Mercer, R.L.: Class-based n-gram models of natural language. In: Proceedings of the IBM Natural Language ITL, Paris, France, March, pp. 283–298
Hu, R., Zong, C., Xu, B.: Semi-automatic Acquisition of Translation Templates from Monolingual Unannotated Corpora, pp. 163–173. IEEE, Los Alamitos (2003)
Schütze, H.: Ambiguity Resolution in Language Learning. CSLI Publications, Stanford (1997)
Yu, S.: The Grammatical Knowledge-base of Contemporary Chinese-A Complete Specification. Tsinghua University publishing company, Beijing (2003)
Martin, S., Liermann, J., Ney, H.: Algorithms For Bigram And Trigram Word Clustering, October 05 (1995)
Van Zaanen, M., Adriaans, P.: Alignment-Based Learning versus EMILE: A comparison. In: Proc. of the Belgian-Dutch Conference on Artificial Intelligence (BNAIC), Amsterdam, the Netherlands, pp. 315–322 (2001)
Van Zaanen, M.: ABL: Alignment-based learning. In: Proceedings of the 18th International Conference on Computational Linguistics (COLING 18), pp. 961–967 (2000)
Wang, Y.-Y., Lafferty, J., Waibel, A.: Word clustering with parallel spoken language corpora. In: Proceedings of the 4th International Conference on Spoken Language Procesing (ICSLP 1996), pp. 2364–2367 (1996)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Wang, B., Wang, H. (2006). A Comparative Study on Chinese Word Clustering. In: Matsumoto, Y., Sproat, R.W., Wong, KF., Zhang, M. (eds) Computer Processing of Oriental Languages. Beyond the Orient: The Research Challenges Ahead. ICCPOL 2006. Lecture Notes in Computer Science(), vol 4285. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11940098_16
Download citation
DOI: https://doi.org/10.1007/11940098_16
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-49667-0
Online ISBN: 978-3-540-49668-7
eBook Packages: Computer ScienceComputer Science (R0)