Abstract
Classical Chinese poetry has a history of thousands of years and is a precious cultural heritage of humankind. Compared with the modern Chinese corpus, it is irrecoverable and specially organized, making it difficult to be learned by existing pre-trained language models. Besides, with the thousands of years of development, many words in classical Chinese poetry have changed their meanings or been out of use today, which further limiting the capability of existing pre-trained models to learn the semantics of classical Chinese poetry. To address these challenges, we construct a large-scale sememe knowledge graph of classical Chinese Poetry (SKG-Poetry), which connects the vocabularies in classical Chinese poetry and modern Chinese. By extracting the sememe knowledge from classical Chinese poetry, our model PoetryBERT not only enlarges the irrecoverable pre-training corpus but also enriches the semantics of the vocabularies in classical Chinese poetry, which enables PoetryBERT to be successfully used in downstream tasks. Specifically, we evaluate our model in two tasks in the field of Chinese classical poetry, which are poetry theme classification and poetry-modern Chinese translation. Extensive experiments are conducted on the two tasks to show the effectiveness of sememe knowledge based pre-training model.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
In linguistics, sememes are defined as the smallest semantic unit of language. Linguists believe that the meaning of all words can be described by a limited set of meanings.
- 2.
Ormosia belong to seven sememes: think of, vegetable, part, tree, eat, mean, embryo.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
References
Alsentzer, E., et al.: Publicly available clinical BERT embeddings. In: Proceedings of the 2nd Clinical Natural Language Processing Workshop, pp. 72–78 (2019)
Beltagy, I., Lo, K., Cohan, A.: SciBERT: a pretrained language model for scientific text. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3615–3620 (2019)
Bordes, A., Usunier, N., Garcia-Duran, A., Weston, J., Yakhnenko, O.: Translating embeddings for modeling multi-relational data. In: Advances in Neural Information Processing Systems, vol. 26 (2013)
Brown, T.B., et al.: Language models are few-shot learners. arXiv preprint arXiv:2005.14165 (2020)
Chen, C., et al.: bert2BERT: towards reusable pretrained language models. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2134–2148 (2022)
Chen, X., et al.: Knowprompt: knowledge-aware prompt-tuning with synergistic optimization for relation extraction. In: Proceedings of the ACM Web Conference 2022, pp. 2778–2788 (2022)
Cui, G., Hu, S., Ding, N., Huang, L., Liu, Z.: Prototypical verbalizer for prompt-based few-shot tuning. arXiv preprint arXiv:2203.09770 (2022)
Cui, Y., et al.: Pre-training with whole word masking for Chinese BERT. arXiv preprint arXiv:1906.08101 (2019)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186 (2019)
Dong, L., et al.: Unified language model pre-training for natural language understanding and generation. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems (2019)
Dong, Z., Dong, Q., Hao, C.: HowNet and its computation of meaning. In: COLING 2010: Demonstrations, pp. 53–56 (2010)
Du, J., Qi, F., Sun, M., Liu, Z.: Lexical sememe prediction by dictionary definitions and localsemantic correspondence. J. Chin. Inf. Process. 34, 1–9 (2020)
Ghazvininejad, M., Choi, Y., Knight, K.: Neural poetry translation. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp. 67–71 (2018)
Gu, Y., Han, X., Liu, Z., Huang, M.: PPT: pre-trained prompt tuning for few-shot learning. arXiv preprint arXiv:2109.04332 (2021)
Guan, J., Huang, F., Zhao, Z., Zhu, X., Huang, M.: A knowledge-enhanced pretraining model for commonsense story generation. Trans. Assoc. Comput. Linguist. 8, 93–108 (2020)
He, B., et al.: BERT-MK: integrating graph contextualized knowledge into pre-trained language models. In: Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 2281–2290 (2020)
He, L., Zheng, S., Yang, T., Zhang, F.: KLMo: knowledge graph enhanced pretrained language model with fine-grained relationships. In: Findings of the Association for Computational Linguistics: EMNLP 2021, pp. 4536–4542 (2021)
Hong, L., Hou, W., Zhou, L.: Knowpoetry: A knowledge service platform for tang poetry research based on domain-specific knowledge graph. In: Library Trends, vol. 69, pp. 101–124 (2020)
Hsu, C.J., Lee, H.Y., Tsao, Y.: XdBERT: distilling visual information to BERT from cross-modal systems to improve language understanding. arXiv preprint arXiv:2204.07316 (2022)
Hu, R., Li, K., Zhu, Y.: Knowledge representation and sentence segmentation of ancient Chinese based on deep language models. J. Chin. Inf. Sci. 35, 8 (2021)
Joshi, M., Chen, D., Liu, Y., Weld, D.S., Zettlemoyer, L., Levy, O.: SpanBERT: improving pre-training by Representing and Predicting Spans. Trans. Assoc. Comput. Linguist. 8, 64–77 (2020)
Ke, P., Ji, H., Liu, S., Zhu, X., Huang, M.: SentiLARE: sentiment-aware language representation learning with linguistic knowledge. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 6975–6988 (2020)
Lai, S., Xu, L., Liu, K., Zhao, J.: Recurrent convolutional neural networks for text classification. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 2267–2273 (2015)
Lample, G., Conneau, A.: Cross-lingual language model pretraining. In: Proceedings of the Advances in Neural Information Processing Systems, pp. 7057–7067 (2019)
Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., Soricut, R.: AlBERT: a lite BERT for self-supervised learning of language representations. In: Proceedings of the International Conference on Learning Representations (2020)
Lauscher, A., Vulic, I., Ponti, E.M., Korhonen, A., Glavas, G.: Specializing unsupervised pretraining models for word-level semantic similarity. In: Proceedings of the 28th International Conference on Computational Linguistics, pp. 1371–1383 (2020)
Lee, J., et al.: BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36, 1234–1240 (2019)
Li, W., Qi, F., Sun, M., Yi, X., Zhang, J.: CCPM: a Chinese classical poetry matching dataset. arXiv preprint arXiv:2106.01979 (2021)
Lin, Z., et al.: Pre-training multilingual neural machine translation by leveraging alignment information. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 2649–2663 (2020)
Liu, D., Yang, K., Qu, Q., Lv, J.: Ancient-modern Chinese translation with a new large training dataset. In: ACM Transactions on Asian and Low-Resource Language Information Processing, vol. 19, pp. 1–13 (2019)
Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., Neubig, G.: Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing. arXiv preprint arXiv:2107.13586 (2021)
Liu, W., et al.: K-BERT: enabling language representation with knowledge graph. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 2901–2908 (2020)
Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692 (2019)
Liu, Y., Wu, B., Bai, T.: The construction and analysis of the knowledge graph of classical Chinese poetry. In: Computer Research and Development, vol. 57, p. 1252 (2020)
Liu, Y., Wu, B., Xie, T., Wang, B.: New word detection in ancient Chinese corpus. J. Chin. Inf. Process. 33, 46–55 (2019)
Mahabadi, R.K., et al.: Perfect: prompt-free and efficient few-shot learning with language models. arXiv preprint arXiv:2204.01172 (2022)
Peters, M.E., et al.: Knowledge enhanced contextual word representations. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 43–54 (2019)
Qi, F., Lv, C., Liu, Z., Meng, X., Sun, M., Zheng, H.T.: Sememe prediction for babelnet synsets using multilingual and multimodal information. arXiv preprint arXiv:2203.07426 (2022)
Qi, F., Yang, Y., Yi, J., Cheng, Z., Liu, Z., Sun, M.: Quoter: a benchmark of quote recommendation for writing. arXiv preprint arXiv:2202.13145 (2022)
Qin, Y., et al.: ERICA: improving entity and relation understanding for pre-trained language models via contrastive learning. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics, pp. 3350–3363 (2021)
Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.: Improving language understanding by generative pre-training. OpenAI Blog (2018)
Shen, T., Mao, Y., He, P., Long, G., Trischler, A., Chen, W.: Exploiting structured knowledge in text via graph-guided representation learning. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 8980–8994 (2020)
Song, K., Tan, X., Qin, T., Lu, J., Liu, T.Y.: MASS: masked sequence to sequence pre-training for language generation. In: Proceedings of the 36th International Conference on Machine Learning, vol. 97, pp. 5926–5936 (2019)
Sun, T., Shao, Y., Qiu, X., Guo, Q., Hu, Y., Huang, X., Zhang, Z.: CoLAKE: contextualized language and knowledge embedding. In: Proceedings of the 28th International Conference on Computational Linguistics, pp. 3660–3670 (2020)
Sun, Y., et al.: Ernie: enhanced representation through knowledge integration. arXiv preprint arXiv:1904.09223 (2019)
Sun, Z., et al.: ChineseBERT: Chinese pretraining enhanced by glyph and Pinyin information. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics, pp. 2065–2075 (2021)
Tian, H., Yang, K., Liu, D., Lv, J.: AnchiBERT: a pre-trained model for ancient Chinese language understanding and generation. In: 2021 International Joint Conference on Neural Networks (IJCNN), pp. 1–8 (2021)
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Wang, X., et al.: KEPLER: a unified model for knowledge embedding and pre-trained language representation. Trans. Assoc. Comput. Linguist. 9, 176–194 (2021)
Wei, Y., Wang, H., Zhao, J., Liu, Y., Zhang, Y., Wu, B.: Gelaigelai: a visual platform for analysis of classical Chinese poetry based on knowledge graph. In: 2020 IEEE International Conference on Knowledge Graph (ICKG), pp. 513–520 (2020)
Xiong, W., Du, J., Wang, W.Y., Stoyanov, V.: Pretrained encyclopedia: weakly supervised knowledge-pretrained language model. In: International Conference on Learning Representations (2020)
Yang, K., Liu, D., Qu, Q., Sang, Y., Lv, J.: An automatic evaluation metric for ancient-modern Chinese translation. Neural Comput. Appl. 33, 3855–3867 (2021)
Yang, Z., et al.: Generating classical Chinese poems from vernacular Chinese. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing, vol. 2019, p. 6155 (2019)
Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R.R., Le, Q.V.: Xlnet: generalized autoregressive pretraining for language understanding. In: Proceedings of the Advances in Neural Information Processing Systems, vol. 32 (2019)
Yi, X., Li, R., Yang, C., Li, W., Sun, M.: Mixpoet: diverse poetry generation via learning controllable mixed latent space. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 9450–9457 (2020)
Yu, D., Zhu, C., Yang, Y., Zeng, M.: Jaket: joint pre-training of knowledge graph and language understanding. arXiv preprint arXiv:2010.00796 (2020)
Zhang, Z., Han, X., Liu, Z., Jiang, X., Sun, M., Liu, Q.: ERNIE: enhanced language representation with informative entities. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 1441–1451 (2019)
Zhipeng, G., et al.: Jiuge: a human-machine collaborative Chinese classical poetry generation system. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 25–30 (2019)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Zhao, J., Bai, T., Wei, Y., Wu, B. (2022). PoetryBERT: Pre-training with Sememe Knowledge for Classical Chinese Poetry. In: Tan, Y., Shi, Y. (eds) Data Mining and Big Data. DMBD 2022. Communications in Computer and Information Science, vol 1745. Springer, Singapore. https://doi.org/10.1007/978-981-19-8991-9_26
Download citation
DOI: https://doi.org/10.1007/978-981-19-8991-9_26
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-19-8990-2
Online ISBN: 978-981-19-8991-9
eBook Packages: Computer ScienceComputer Science (R0)