Skip to main content

Word Extraction Based on Semantic Constraints in Chinese Word-Formation

  • Conference paper
Computational Linguistics and Intelligent Text Processing (CICLing 2005)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 3406))


This paper presents a novel approach to Chinese word extraction based on semantic information of characters. A thesaurus of Chinese characters is conducted. A Chinese lexicon with 63,738 two-character words, together with the thesaurus of characters, are explored to learn semantic constraints between characters in Chinese word-formation, forming a semantic-tag-based HMM. The Baum-Welch re-estimation scheme is then chosen to train parameters of the HMM in the way of unsupervised learning. Various statistical measures for estimating the likelihood of a character string being a word are further tested. Large-scale experiments show that the results are promising: the F-score of this word extraction method can reach 68.5% whereas its counterpart, the character-based mutual information method, can only reach 47.5%.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others


  1. Calzolari, N., Bindi, R.: Acquision of Lexical Information from a Large Textual Italian Corpus. In: Proc. of COLING 1990, Helsinki, Finland, pp. 54–59 (1990)

    Google Scholar 

  2. Chien, L.F.: PAT-tree-based Adaptive Keyphrase Extraction for Intelligent Chinese Information Retrieval. In: Information Processing and Management, special issue: Information Retrieval with Asian Language (1998)

    Google Scholar 

  3. Daille, B.: Study and Implementation of Combined Techniques Automatic Extraction of Terminology. In: Proc. of the Balancing Act Workshop at 32nd Annual Meeting of the ACL, pp. 29–36 (1994)

    Google Scholar 

  4. Dunning, T.: Accurate Method for the Statistics of Surprise and Coincidence. Computational Linguistics 19(1), 61–75 (1993)

    Google Scholar 

  5. Gelbukh, A., Sidorov, G.: Approach to Construction of Automatic Morphological Analysis Systems for Inflective Languages with Little Effort. In: Gelbukh, A. (ed.) CICLing 2003. LNCS, vol. 2588, pp. 215–220. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  6. Hajic, J.: HMM Parameters Estimation: The Baum-Welch Algorithm (2000), http://www.cs.jhuedu/ hajic

    Google Scholar 

  7. Johansson, C.: Good Bigrams. In: Proc. of COLING 1996, Copenhagen, Denmark (1996)

    Google Scholar 

  8. Mei, J.J.: Tong Yi Ci Ci Lin. Shanghai Cishu Press (1983)

    Google Scholar 

  9. Merkel, M., Andersson, M.: Knowledge-lite Extraction of Multi-word Units with Language Filters and Entropy Thresholds. In: Proc. of RIAO 2000, Paris, France, pp. 737–746 (2000)

    Google Scholar 

  10. Nie, J.Y., Hannan, M.L., Jin, W.: Unknown Word Detection and Segmentation of Chinese Using Statistical and Heuristic Knowledge. Communications of COLIPS 5, 47–57 (1999)

    Google Scholar 

  11. Sornlertlamvanich, V., Potipiti, T., Charoenporn, T.: Automatic Corpus-based Thai Word Extraction with the C4.5 Learning Algorithm. In: Proc. of COLING 2000, Saarbrucken, Germany, pp. 802–807 (2000)

    Google Scholar 

  12. Sun, M.S., Shen, D.Y., Huang, C.N.: CSeg&Tag1.0: A Practical Word Segmenter and POS Tagger for Chinese Texts. In: Proc. of the 5th Int’l Conference on Applied Natural Language Processing, Washington DC, USA, pp. 119–126 (1997)

    Google Scholar 

Download references

Author information

Authors and Affiliations


Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Sun, M., Luo, S., T’sou, B.K. (2005). Word Extraction Based on Semantic Constraints in Chinese Word-Formation. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2005. Lecture Notes in Computer Science, vol 3406. Springer, Berlin, Heidelberg.

Download citation

  • DOI:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-24523-0

  • Online ISBN: 978-3-540-30586-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics