Skip to main content

Building a Collation Element Table for a Large Chinese Character Set in YES

  • Conference paper
  • First Online:
Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data (CCL 2015, NLP-NABD 2015)

Abstract

YES is a simplified stroke-based method for sorting Chinese characters. It is free from stroke counting and grouping, and thus much faster and more accurate than the traditional method. This paper presents a collation element table built in YES for a large joint Chinese character set covering (a) all 20,902 characters of Unicode CJK Unified Ideographs, (b) all 11,408 characters in the Complete List of Chinese Characters Used by the Media in 2013, (c) all 13,000 plus characters in the latest versions of Xinhua Dictionary(v11) and Contemporary Chinese Dictionary(v6). Of the 20,902 Chinese characters in Unicode, 97.23% have one-to-one relationship with their stroke order codes in YES, comparing with 90.69% of the traditional method. Enhanced with the secondary and tertiary sorting levels of stroke layout and Unicode value, there is a guarantee of one-to-one relationship between the characters and collation elements. The collation element table has been successfully applied to sorting CC-CEDICT, a Chinese-English dictionary of over 112,000 word entries.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

References

  1. Davis, M. Whistler, K., Scherer, M.: Unicode Technical Standard #10: Unicode Collation Algorithm, version 8.0 (2015). http://www.unicode.org/reports/tr10/

  2. Kleeman, J., Yu, H. (eds.): The Oxford Chinese Dictionary. Oxford University Press, Oxford (2010)

    Google Scholar 

  3. Linguistic Institute of the Chinese Academy of Social Sciences: Xinhua Dictionary (Xinhua Zidian, 新华字典), 11th edn. The Commercial Press, Beijing (2011)

    Google Scholar 

  4. Linguistic Institute of the Chinese Academy of Social Sciences: Contemporary Chinese Dictionary (Xiandai Hanyu Cidian, 现代汉语词典), 6th edn. The Commercial Press, Beijing (2012)

    Google Scholar 

  5. Mair, V.H.: The need for an alphabetically arranged general usage dictionary of Mandarin Chinese: a review article of some recent dictionaries and current lexicographical projects. Sino-Platonic Papers 1, 1–31 (1986)

    Google Scholar 

  6. MDBG: CC-CEDICT Chinese to English Dictionary (2015). http://www.mdbg.net/chindict/chindict.php?page=cedict (Downloaded on 31 January 2015)

  7. National Language Commission of China (国家语委): Standard Stroke Order of Commonly-Used Characters of Modern Chinese (现代汉语通用字笔顺规范). Language & Culture Press (语文出版社), Beijing (1997)

    Google Scholar 

  8. National Language Commission of China (国家语委): The Standard Stroke Order of the GB13000.1 Character Set (GB13000.1 字符集汉字笔顺规范). Shanghai Education Press (上海教育出版社), Shanghai (1999)

    Google Scholar 

  9. National Language Commission of China (国家语委): The Standard (Stroke-Based) Order of the GB13000.1 Character Set (GB13000.1字符集汉字字序(笔画序)规范). Shanghai Education Press, Shanghai (2000)

    Google Scholar 

  10. National Language Commission of China (国家语委): The Standard Bending Strokes of GB13000.1 Character Set (GB13000.1字符集汉字折笔规范). Language & Culture Press, Beijing (2001)

    Google Scholar 

  11. National Language Commission of China (国家语委): Standard List of Commonly-Used Chinese Characters (通用规范汉字表). Language & Culture Press, Beijing (2013)

    Google Scholar 

  12. National Language Commission of China (国家语委): Language Situation in China: 2014. (中国语言生活状况报告(2014)). Commercial Press, Beijing (2014)

    Google Scholar 

  13. Norman, J.: Chinese. Cambridge University Press, Cambridge (1988)

    Google Scholar 

  14. Su, P.: Essentials of Modern Chinese Characters (现代汉字学刚要), 3rd edn. Commercial Press, Beijing (2014)

    Google Scholar 

  15. Sun, C.: Chinese: A Linguistic Introduction. Cambridge University Press, Cambridge (2006). (Ch. 5 Chinese writing)

    Book  Google Scholar 

  16. The Unicode Consortium: The Unicode Standard, Version 8.0. The Unicode Consortium, Mountain View, CA (2015). (http://www.unicode.org/versions/Unicode8.0.0/)

  17. Yong, H., Luo, Z., Zhang, X.: Chinese Dictionaries: Three Millennia. Shanghai Foreign Language Education Press, Shanghai (2010)

    Google Scholar 

  18. Yuan, B., Church, S.K.: Oxford Chinese Mini Dictionary, 2nd edn. Oxford University Press, New York (2008)

    Google Scholar 

  19. Zhang, X.: Duplicate encoding of Chinese characters (中文的同形异码字问题). J. Chin. Inf. Process. 4(29), 233–240 (2015)

    Google Scholar 

  20. Zhang, X., Li, X.: Handbook of the YES Stroke-Based Sorting Method for Chinese Characters (一二三笔顺检字手册). Language & Culture Press, Beijing (2013)

    Google Scholar 

  21. Zhang, X., Li, X.: Integration and optimization of standard Chinese stroke lists (标准笔形表的整合与优化). In: Li, X., Jia, Y., Xu, J. (eds.) Digital Teaching of Chinese Language 2014 (数字化汉语教学 2014), pp. 200–208. Tsinghua University Press, Beijing (2014)

    Google Scholar 

  22. Zhang, X., Li, X., Lun, C.: The YES-CEDICT Chinese Dictionary (一二三汉英大词典, Trial Edition, Sorted by Simplified Chinese). J. Mod. Chin. Lang. Edu. (中文教学现代化学报), 4(1) June 2015. (http://xuebao.eblcu.com/)

  23. Zhang, X., Li, X., Lun, C.: The YES-CEDICT Chinese Dictionary (一二三漢英大詞典, Trial Edition, Sorted by Traditional Chinese). J. Mod. Chin. Lang. Edu. (中文教学现代化学报), 4(1) June 2015. (http://xuebao.eblcu.com/)

Download references

Acknowledgements

The project has been partially supported by a University research fund (Account Code: 4-ZZEW). The authors are also very grateful to the three anonymous reviewers, whose valuable comments played an important role in the revision of the paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiaoheng Zhang .

Editor information

Editors and Affiliations

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial 2.5 International License (http://creativecommons.org/licenses/by-nc/2.5/), which permits any noncommercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Zhang, X., Li, X. (2015). Building a Collation Element Table for a Large Chinese Character Set in YES. In: Sun, M., Liu, Z., Zhang, M., Liu, Y. (eds) Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data. CCL NLP-NABD 2015 2015. Lecture Notes in Computer Science(), vol 9427. Springer, Cham. https://doi.org/10.1007/978-3-319-25816-4_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-25816-4_1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-25815-7

  • Online ISBN: 978-3-319-25816-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics