Abstract
In recent years, Chinese pre-training language models have achieved significant improvements in the fields, such as natural language understanding (NLU) and text generation. However, most of these existing pre-trained language models focus on modern Chinese but ignore the rich semantic information embedded for Chinese characters, especially the radical information. To this end, we present RAC-BERT, a language-specific BERT model for ancient Chinese. Specifically, we propose two new radical-based pre-training tasks, which are: (1) replacing the masked tokens with random words of the same radical, that can mitigate the gap between the pre-training and fine-tuning stages; (2) predicting the radical of the masked token, not the original word, that reduces the computational effort. Extensive experiments were conducted on two ancient Chinese NLP datasets. The results show that our model significantly outperforms the state-of-the-art models on most tasks. And we conducted ablation experiments to demonstrate the effectiveness of our approach. The pre-trained model are publicly available at https://github.com/CubeHan/RAC-BERT
Similar content being viewed by others
References
Chen, L.: Deep Learning and Practice with MindSpore. Springer, Cham (2021)
Cui, Y., et al.: Pre-training with whole word masking for Chinese BERT. IEEE/ACM Trans. Audio, Speech Lang. Process. 29, 3504ā3514 (2019)
Delobelle, P., Winters, T., Berendt, B.: RobBERT: a Dutch RoBERTa-based language model. In: Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 3255ā3265 (2020)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171ā4186. Association for Computational Linguistics, Minneapolis, Minnesota (2019). https://doi.org/10.18653/v1/N19-1423, https://aclanthology.org/N19-1423
Hu, R., Li, S., Zhu, Y.: Knowledge representation and sentence segmentation of ancient Chinese based on deep language models. J. Chin. Inf. Process. 35(4), 8ā15 (2021)
Ji, Z., Shen, Y., Sun, Y., Yu, T., Wang, X.: C-CLUE: a benchmark of classical Chinese based on a crowdsourcing system for knowledge graph construction. In: Qin, B., Jin, Z., Wang, H., Pan, J., Liu, Y., An, B. (eds.) CCKS 2021. CCIS, vol. 1466, pp. 295ā301. Springer, Singapore (2021). https://doi.org/10.1007/978-981-16-6471-7_24
Ji, Z., Wang, X., Shen, Y., Rao, G.: CANCN-BERT: a joint pre-trained language model for classical and modern Chinese. In: Proceedings of the 30th ACM International Conference on Information & Knowledge Management, pp. 3112ā3116 (2021)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint: arXiv:1412.6980 (2014)
Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., Soricut, R.: ALBERT: a lite BERT for self-supervised learning of language representations. arXiv preprint: arXiv:1909.11942 (2019)
Li, Y., Li, W., Sun, F., Li, S.: Component-enhanced Chinese character embeddings. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 829ā834 (2015)
Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach. arXiv preprint: arXiv:1907.11692 (2019)
Martin, L., et al.: Camembert: a tasty French language model. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7203ā7219 (2020)
Nozza, D., Bianchi, F., Hovy, D.: What the [mask]? Making sense of language-specific BERT models. arXiv preprint: arXiv:2003.02912 (2020)
Sun, Y., Lin, L., Yang, N., Ji, Z., Wang, X.: Radical-enhanced Chinese character embedding. In: Loo, C.K., Yap, K.S., Wong, K.W., Teoh, A., Huang, K. (eds.) ICONIP 2014. LNCS, vol. 8835, pp. 279ā286. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-12640-1_34
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Wang, D., et al.: Construction and application of pre-training model of āsiku quanshuāā oriented to digital humanities. Libr. Tribune 42(6), 31ā43 (2022)
Wang, X., Xiong, Y., Niu, H., Yue, J., Zhu, Y., Philip, S.Y.: BioHanBERT: a Hanzi-aware pre-trained language model for Chinese biomedical text mining. In: 2021 IEEE International Conference on Data Mining (ICDM), pp. 1415ā1420. IEEE (2021)
Acknowledgement
This work is supported by the CAAI-Huawei MindSpore Open Fund (2022037A).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
Ā© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Han, L. et al. (2023). RAC-BERT: Character Radical Enhanced BERT for Ancient Chinese. In: Liu, F., Duan, N., Xu, Q., Hong, Y. (eds) Natural Language Processing and Chinese Computing. NLPCC 2023. Lecture Notes in Computer Science(), vol 14303. Springer, Cham. https://doi.org/10.1007/978-3-031-44696-2_59
Download citation
DOI: https://doi.org/10.1007/978-3-031-44696-2_59
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-44695-5
Online ISBN: 978-3-031-44696-2
eBook Packages: Computer ScienceComputer Science (R0)