Abstract
Currently, most of the automatic recognition tasks of separable words adopt a rule-based method, which relies on automatic word segmentation results and lexical patterns generated from common inserted constituents. However, they suffer from incorrect word segmentation results and inaccurate and limited rules. Moreover, they ignore the rich information contained in the context. To address these issues, this paper proposes a CRFs-based method which employs nine features, such as character, POS tag, punctuation, word boundary, keyword and POS sequential rule. Experimental results on real-world datasets show that our approach can make full use of rich information and achieve significant improvements on recognition efficiency compared to all the baselines.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
As combined cases can be directly recognized by word segmentation.
- 2.
* refers to characters other than separated morphemes.
- 3.
More precisely, in some cases, A and B are joined with other morphemes to form different words, as shown in (2). No matter A/B is a word or XA/BX is a word, it has a meaning distinct from that of separable word “AB”.
- 4.
All examples are collected from [11]. Positive separated forms are underlined, whereas negative ones are not.
- 5.
(3) is given by the following Chinese automatic word segmentation tools: jieba (https://pypi.org/project/jieba/), Thulac (THU Lexical Analyzer for Chinese, http://thulac.thunlp.org/), CUCBst (中国传媒大学文本切分标注系统). All three tools made wrong segmentations on “面向/v (to face)”, whereas the correct one should be “面/n (presence) 向/p (to)”. The former is a verb. The latter are two words: a noun and a preposition.
- 6.
Please refer to Sect. 4.1 for details.
- 7.
BIEO marks: “B” represents the beginning or the left boundary of the separable word; “I” represents the inserted constituents in the middle; “E” represents the end or the right boundary of the separable word; “O” represents the outside or non-separable constituents.
- 8.
The meanings of words are as follows: 洗澡 to take a shower; 睡觉 to sleep; 放假 to have a vacation; 请假 to ask for leave; 上课 to go to class; 跳舞 to dance; 起床 to get up; 握手 to shake hands; 照相 to take a picture; 下课 to get out of class; 见面 to meet; 看病 to see a doctor; 上学 to go to school.
References
Bo, L.: Research on automatic recognition of separable words based on corpus. Master thesis, Hebei University, Baoding (2015)
Haibo, R., Gang, W.: The analysis of modern Chinese separated forms based on corpus. Lang. Sci. 06, 75–87 (2005)
Jiaojiao, Z., Endong X.: Automatic recognition of separable words based on BCC. J. Chin. Inf. Sci. 31(01), 75–83+93 (2017)
Weihua, Z.: Information processing oriented researches on the semantic collocation between the verbs and objects in modern Chinese language. Master thesis, Central China Normal University, Beijing (2007)
Aiping, F.: Word recognition in source language analysis of Chinese-English machine translation. J. Chin. Inf. Process. 05, 7–13 (1999)
ChunXia, W.: Researches of separable words based on corpus. Master thesis, Beijing University of Language and Culture, Beijing (2001)
Haifeng, W.: The study on separable words’ separated form function of modern Chinese. Master thesis, Beijing Language and Culture University, Beijing (2008)
Silei, H.: Automatic identification of Chinese prepositional phrase based on CRF. Master thesis, Dalian University of Technology, Dalian (2008)
Gaohui, H., Tianfang, Y., Quansheng, L.: Mining Chinese comparative sentences and relations based on CRF algorithm. Appl. Res. Comput. 27(06), 2061–2064 (2010)
Lafferty, J., Mccallum, A., Pereira, F., et al.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the 18th International Conference on Machine Learning 2001, pp. 282–289. Morgan Kaufmann Publishers Inc., San Francisco (2001)
BCC. http://bcc.blcu.edu.cn/. Accessed 21 May 2019
Office of Examination Center of National Chinese Proficiency Examination Committee: Modern Chinese Dictionary, 5th edn. Economic Science Press, Beijing (2001)
CCRMA. http://www.jubenwei.com/. Accessed 21 May 2019
Institute of Linguistics: CASS: Modern Chinese Dictionary, 6th edn. The Commercial Press, Beijing (2012)
Song, T., Peng, W., Song, J., Guo, D., He, J.: The construction of sentence-based diagrammatic treebank. Chinese Lexical Semantics. LNCS (LNAI), vol. 10085, pp. 306–314. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-49508-8_29
Daoqin, W., Zhongchu, L.: On grammatical features and distinctive principles of split words. Soc. Sci. J. Xiangtan Polytechnic Univ. 03(03), 47–50 (2001)
Qinghui, Y.: Dictionary of the Usage of Separable Words, 1st edn. Beijing Normal University Press, Beijing (1995)
Acknowledgments
This research is supported by: National Natural Science Foundation of China (61877004); Natural Science Foundation for the Higher Education Institutions of Anhui Province of China (No: KJ2019A0592).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Dong, N., Peng, W. (2020). Automatic Recognition of Chinese Separable Words Based on CRFs. In: Hong, JF., Zhang, Y., Liu, P. (eds) Chinese Lexical Semantics. CLSW 2019. Lecture Notes in Computer Science(), vol 11831. Springer, Cham. https://doi.org/10.1007/978-3-030-38189-9_47
Download citation
DOI: https://doi.org/10.1007/978-3-030-38189-9_47
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-38188-2
Online ISBN: 978-3-030-38189-9
eBook Packages: Computer ScienceComputer Science (R0)