Skip to main content

Automatic Recognition of Chinese Separable Words Based on CRFs

  • Conference paper
  • First Online:
Chinese Lexical Semantics (CLSW 2019)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11831))

Included in the following conference series:

  • 1542 Accesses

Abstract

Currently, most of the automatic recognition tasks of separable words adopt a rule-based method, which relies on automatic word segmentation results and lexical patterns generated from common inserted constituents. However, they suffer from incorrect word segmentation results and inaccurate and limited rules. Moreover, they ignore the rich information contained in the context. To address these issues, this paper proposes a CRFs-based method which employs nine features, such as character, POS tag, punctuation, word boundary, keyword and POS sequential rule. Experimental results on real-world datasets show that our approach can make full use of rich information and achieve significant improvements on recognition efficiency compared to all the baselines.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 99.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 129.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    As combined cases can be directly recognized by word segmentation.

  2. 2.

    * refers to characters other than separated morphemes.

  3. 3.

    More precisely, in some cases, A and B are joined with other morphemes to form different words, as shown in (2). No matter A/B is a word or XA/BX is a word, it has a meaning distinct from that of separable word “AB”.

  4. 4.

    All examples are collected from [11]. Positive separated forms are underlined, whereas negative ones are not.

  5. 5.

    (3) is given by the following Chinese automatic word segmentation tools: jieba (https://pypi.org/project/jieba/), Thulac (THU Lexical Analyzer for Chinese, http://thulac.thunlp.org/), CUCBst (中国传媒大学文本切分标注系统). All three tools made wrong segmentations on “向/v (to face)”, whereas the correct one should be 面/n (presence) 向/p (to)”. The former is a verb. The latter are two words: a noun and a preposition.

  6. 6.

    Please refer to Sect. 4.1 for details.

  7. 7.

    BIEO marks: “B” represents the beginning or the left boundary of the separable word; “I” represents the inserted constituents in the middle; “E” represents the end or the right boundary of the separable word; “O” represents the outside or non-separable constituents.

  8. 8.

    The meanings of words are as follows: 洗澡 to take a shower; 睡觉 to sleep; 放假 to have a vacation; 请假 to ask for leave; 上课 to go to class; 跳舞 to dance; 起床 to get up; 握手 to shake hands; 照相 to take a picture; 下课 to get out of class; 见面 to meet; 看病 to see a doctor; 上学 to go to school.

References

  1. Bo, L.: Research on automatic recognition of separable words based on corpus. Master thesis, Hebei University, Baoding (2015)

    Google Scholar 

  2. Haibo, R., Gang, W.: The analysis of modern Chinese separated forms based on corpus. Lang. Sci. 06, 75–87 (2005)

    Google Scholar 

  3. Jiaojiao, Z., Endong X.: Automatic recognition of separable words based on BCC. J. Chin. Inf. Sci. 31(01), 75–83+93 (2017)

    Google Scholar 

  4. Weihua, Z.: Information processing oriented researches on the semantic collocation between the verbs and objects in modern Chinese language. Master thesis, Central China Normal University, Beijing (2007)

    Google Scholar 

  5. Aiping, F.: Word recognition in source language analysis of Chinese-English machine translation. J. Chin. Inf. Process. 05, 7–13 (1999)

    Google Scholar 

  6. ChunXia, W.: Researches of separable words based on corpus. Master thesis, Beijing University of Language and Culture, Beijing (2001)

    Google Scholar 

  7. Haifeng, W.: The study on separable words’ separated form function of modern Chinese. Master thesis, Beijing Language and Culture University, Beijing (2008)

    Google Scholar 

  8. Silei, H.: Automatic identification of Chinese prepositional phrase based on CRF. Master thesis, Dalian University of Technology, Dalian (2008)

    Google Scholar 

  9. Gaohui, H., Tianfang, Y., Quansheng, L.: Mining Chinese comparative sentences and relations based on CRF algorithm. Appl. Res. Comput. 27(06), 2061–2064 (2010)

    Google Scholar 

  10. Lafferty, J., Mccallum, A., Pereira, F., et al.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the 18th International Conference on Machine Learning 2001, pp. 282–289. Morgan Kaufmann Publishers Inc., San Francisco (2001)

    Google Scholar 

  11. BCC. http://bcc.blcu.edu.cn/. Accessed 21 May 2019

  12. Office of Examination Center of National Chinese Proficiency Examination Committee: Modern Chinese Dictionary, 5th edn. Economic Science Press, Beijing (2001)

    Google Scholar 

  13. CCRMA. http://www.jubenwei.com/. Accessed 21 May 2019

  14. Institute of Linguistics: CASS: Modern Chinese Dictionary, 6th edn. The Commercial Press, Beijing (2012)

    Google Scholar 

  15. Song, T., Peng, W., Song, J., Guo, D., He, J.: The construction of sentence-based diagrammatic treebank. Chinese Lexical Semantics. LNCS (LNAI), vol. 10085, pp. 306–314. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-49508-8_29

    Chapter  Google Scholar 

  16. Daoqin, W., Zhongchu, L.: On grammatical features and distinctive principles of split words. Soc. Sci. J. Xiangtan Polytechnic Univ. 03(03), 47–50 (2001)

    Google Scholar 

  17. Qinghui, Y.: Dictionary of the Usage of Separable Words, 1st edn. Beijing Normal University Press, Beijing (1995)

    Google Scholar 

Download references

Acknowledgments

This research is supported by: National Natural Science Foundation of China (61877004); Natural Science Foundation for the Higher Education Institutions of Anhui Province of China (No: KJ2019A0592).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Weiming Peng .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Dong, N., Peng, W. (2020). Automatic Recognition of Chinese Separable Words Based on CRFs. In: Hong, JF., Zhang, Y., Liu, P. (eds) Chinese Lexical Semantics. CLSW 2019. Lecture Notes in Computer Science(), vol 11831. Springer, Cham. https://doi.org/10.1007/978-3-030-38189-9_47

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-38189-9_47

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-38188-2

  • Online ISBN: 978-3-030-38189-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics