Abstract
YES is a simplified stroke-based method for sorting Chinese characters. It is free from stroke counting and grouping, and thus much faster and more accurate than the traditional method. This paper presents a collation element table built in YES for a large joint Chinese character set covering (a) all 20,902 characters of Unicode CJK Unified Ideographs, (b) all 11,408 characters in the Complete List of Chinese Characters Used by the Media in 2013, (c) all 13,000 plus characters in the latest versions of Xinhua Dictionary(v11) and Contemporary Chinese Dictionary(v6). Of the 20,902 Chinese characters in Unicode, 97.23% have one-to-one relationship with their stroke order codes in YES, comparing with 90.69% of the traditional method. Enhanced with the secondary and tertiary sorting levels of stroke layout and Unicode value, there is a guarantee of one-to-one relationship between the characters and collation elements. The collation element table has been successfully applied to sorting CC-CEDICT, a Chinese-English dictionary of over 112,000 word entries.
References
Davis, M. Whistler, K., Scherer, M.: Unicode Technical Standard #10: Unicode Collation Algorithm, version 8.0 (2015). http://www.unicode.org/reports/tr10/
Kleeman, J., Yu, H. (eds.): The Oxford Chinese Dictionary. Oxford University Press, Oxford (2010)
Linguistic Institute of the Chinese Academy of Social Sciences: Xinhua Dictionary (Xinhua Zidian, 新华字典), 11th edn. The Commercial Press, Beijing (2011)
Linguistic Institute of the Chinese Academy of Social Sciences: Contemporary Chinese Dictionary (Xiandai Hanyu Cidian, 现代汉语词典), 6th edn. The Commercial Press, Beijing (2012)
Mair, V.H.: The need for an alphabetically arranged general usage dictionary of Mandarin Chinese: a review article of some recent dictionaries and current lexicographical projects. Sino-Platonic Papers 1, 1–31 (1986)
MDBG: CC-CEDICT Chinese to English Dictionary (2015). http://www.mdbg.net/chindict/chindict.php?page=cedict (Downloaded on 31 January 2015)
National Language Commission of China (国家语委): Standard Stroke Order of Commonly-Used Characters of Modern Chinese (现代汉语通用字笔顺规范). Language & Culture Press (语文出版社), Beijing (1997)
National Language Commission of China (国家语委): The Standard Stroke Order of the GB13000.1 Character Set (GB13000.1 字符集汉字笔顺规范). Shanghai Education Press (上海教育出版社), Shanghai (1999)
National Language Commission of China (国家语委): The Standard (Stroke-Based) Order of the GB13000.1 Character Set (GB13000.1字符集汉字字序(笔画序)规范). Shanghai Education Press, Shanghai (2000)
National Language Commission of China (国家语委): The Standard Bending Strokes of GB13000.1 Character Set (GB13000.1字符集汉字折笔规范). Language & Culture Press, Beijing (2001)
National Language Commission of China (国家语委): Standard List of Commonly-Used Chinese Characters (通用规范汉字表). Language & Culture Press, Beijing (2013)
National Language Commission of China (国家语委): Language Situation in China: 2014. (中国语言生活状况报告(2014)). Commercial Press, Beijing (2014)
Norman, J.: Chinese. Cambridge University Press, Cambridge (1988)
Su, P.: Essentials of Modern Chinese Characters (现代汉字学刚要), 3rd edn. Commercial Press, Beijing (2014)
Sun, C.: Chinese: A Linguistic Introduction. Cambridge University Press, Cambridge (2006). (Ch. 5 Chinese writing)
The Unicode Consortium: The Unicode Standard, Version 8.0. The Unicode Consortium, Mountain View, CA (2015). (http://www.unicode.org/versions/Unicode8.0.0/)
Yong, H., Luo, Z., Zhang, X.: Chinese Dictionaries: Three Millennia. Shanghai Foreign Language Education Press, Shanghai (2010)
Yuan, B., Church, S.K.: Oxford Chinese Mini Dictionary, 2nd edn. Oxford University Press, New York (2008)
Zhang, X.: Duplicate encoding of Chinese characters (中文的同形异码字问题). J. Chin. Inf. Process. 4(29), 233–240 (2015)
Zhang, X., Li, X.: Handbook of the YES Stroke-Based Sorting Method for Chinese Characters (一二三笔顺检字手册). Language & Culture Press, Beijing (2013)
Zhang, X., Li, X.: Integration and optimization of standard Chinese stroke lists (标准笔形表的整合与优化). In: Li, X., Jia, Y., Xu, J. (eds.) Digital Teaching of Chinese Language 2014 (数字化汉语教学 2014), pp. 200–208. Tsinghua University Press, Beijing (2014)
Zhang, X., Li, X., Lun, C.: The YES-CEDICT Chinese Dictionary (一二三汉英大词典, Trial Edition, Sorted by Simplified Chinese). J. Mod. Chin. Lang. Edu. (中文教学现代化学报), 4(1) June 2015. (http://xuebao.eblcu.com/)
Zhang, X., Li, X., Lun, C.: The YES-CEDICT Chinese Dictionary (一二三漢英大詞典, Trial Edition, Sorted by Traditional Chinese). J. Mod. Chin. Lang. Edu. (中文教学现代化学报), 4(1) June 2015. (http://xuebao.eblcu.com/)
Acknowledgements
The project has been partially supported by a University research fund (Account Code: 4-ZZEW). The authors are also very grateful to the three anonymous reviewers, whose valuable comments played an important role in the revision of the paper.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial 2.5 International License (http://creativecommons.org/licenses/by-nc/2.5/), which permits any noncommercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Zhang, X., Li, X. (2015). Building a Collation Element Table for a Large Chinese Character Set in YES. In: Sun, M., Liu, Z., Zhang, M., Liu, Y. (eds) Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data. CCL NLP-NABD 2015 2015. Lecture Notes in Computer Science(), vol 9427. Springer, Cham. https://doi.org/10.1007/978-3-319-25816-4_1
Download citation
DOI: https://doi.org/10.1007/978-3-319-25816-4_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-25815-7
Online ISBN: 978-3-319-25816-4
eBook Packages: Computer ScienceComputer Science (R0)