Abstract
Much work has been done on the problem of gender prediction about English using the idea of probability models or traditional machine learning methods. Different from English or other alphabetic languages, Chinese characters are logosyllabic. Previous approaches work quite well for Indo-European languages in general and English in particular, however, their performance deteriorate in Asian languages such as Chinese, Japanese and Korean. In our work, we focus on Simplified Chinese characters and present a novel approach incorporating phonetic information (Pinyin) to enhance Chinese word embedding trained on BERT model. We compared our method with several previous methods, namely Naive Bayes, GBDT, and Random forest with word embedding via fastText as features. Quantitative and qualitative experiments demonstrate the superior of our model. The results show that we can achieve 93.45% test accuracy using our method. In addition, we have released two large-scale gender-labeled datasets (one with over one million first names and the other with over six million full names) used as a part of this study for the community.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Mueller, J., Stumme, G.: Gender inference using statistical name characteristics in twitter. In: Proceedings of the The 3rd Multidisciplinary International Social Networks Conference on SocialInformatics 2016, Data Science 2016, p. 47. ACM (2016)
Karimi, F., Wagner, C., Lemmerich, F., Jadidi, M., Strohmaier, M.: Inferring gender from names on the web: a comparative evaluation of gender detection methods. In: Proceedings of the 25th International Conference Companion on World Wide Web, WWW 2016 Companion, Republic and Canton of Geneva, Switzerland, pp. 53–54. International World Wide Web Conferences Steering Committee (2016)
Khachane, M.Y.: Gender estimation from first name: a rule based approach. Int. J. Adv. Res. Comput. Sci. 9(2), 609 (2018)
Liu, W., Ruths, D.: What’s in a name? using first names as features for gender inference in twitter. In: 2013 AAAI Spring Symposium Series (2013)
Gu, C., Tian, X.-P., Yu, J.-D.: Automatic recognition of chinese personal name using conditional random fields and knowledge base. Mathematical Problems in Engineering (2015)
Burger, J.D., Henderson, J., Kim, G., Zarrella, G.: Discriminating gender on twitter. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP 2011, Stroudsburg, PA, USA, pp. 1301–1309. Association for Computational Linguistics (2011)
Liu, M., Rus, V., Liao, Q., Liu, L.: Encoding and ranking similar chinese characters. J. Inf. Sci. Eng. 33(5), 1195–1211 (2017)
Huang, S., Wu, J.: A pragmatic approach for classical chinese word segmentation. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC-2018) 2018
Peng, N., Yu, M., Dredze, M.: An empirical study of chinese name matching and applications. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), vol. 2, pp. 377–383 (2015)
Huang, Y., Zhao, H.: Chinese pinyin aided IME, input what you have not keystroked yet. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 2923–2929. Association for Computational Linguistics, October-November 2018
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv e-prints, page arXiv:1810.04805, October 2018
Chen, H., Gallagher, A.C., Girod, B.: What’s in a name? first names as facial attributes. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2013
Zhao, H., Kamareddine, F.: Advance gender prediction tool of first names and its use in analysing gender disparity in computer science in the uk, malaysia and china. In: 2017 International Conference on Computational Science and Computational Intelligence (CSCI), pp. 222–227, December 2017
Jin, H., et al.: Incorporating Chinese Characters of Words for Lexical Sememe Prediction. arXiv e-prints, page arXiv:1806.06349, June 2018
Gender Guesser. https://test.pypi.org/project/gender-guesser/. Accessed 4 May 2019
Namsor Gender API. https://gender-api.com/. Accessed 4 May 2019
Ngender. https://github.com/observerss/ngender/. Accessed 4 May 2019
pypinyin. https://pypi.org/project/pypinyin/. Accessed 4 May 2019
Most common surnames revealed. http://www.chinadaily.com.cn/a/201901/31/WS5c528e7ea3106c65c34e78cb.html. Accessed 4 May 2019
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Jia, J., Zhao, Q. (2019). Gender Prediction Based on Chinese Name. In: Tang, J., Kan, MY., Zhao, D., Li, S., Zan, H. (eds) Natural Language Processing and Chinese Computing. NLPCC 2019. Lecture Notes in Computer Science(), vol 11839. Springer, Cham. https://doi.org/10.1007/978-3-030-32236-6_62
Download citation
DOI: https://doi.org/10.1007/978-3-030-32236-6_62
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-32235-9
Online ISBN: 978-3-030-32236-6
eBook Packages: Computer ScienceComputer Science (R0)