PoetryBERT: Pre-training with Sememe Knowledge for Classical Chinese Poetry

Zhao, Jiaqi; Bai, Ting; Wei, Yuting; Wu, Bin

doi:10.1007/978-981-19-8991-9_26

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1745))

Included in the following conference series:

International Conference on Data Mining and Big Data

776 Accesses
2 Citations

Abstract

Classical Chinese poetry has a history of thousands of years and is a precious cultural heritage of humankind. Compared with the modern Chinese corpus, it is irrecoverable and specially organized, making it difficult to be learned by existing pre-trained language models. Besides, with the thousands of years of development, many words in classical Chinese poetry have changed their meanings or been out of use today, which further limiting the capability of existing pre-trained models to learn the semantics of classical Chinese poetry. To address these challenges, we construct a large-scale sememe knowledge graph of classical Chinese Poetry (SKG-Poetry), which connects the vocabularies in classical Chinese poetry and modern Chinese. By extracting the sememe knowledge from classical Chinese poetry, our model PoetryBERT not only enlarges the irrecoverable pre-training corpus but also enriches the semantics of the vocabularies in classical Chinese poetry, which enables PoetryBERT to be successfully used in downstream tasks. Specifically, we evaluate our model in two tasks in the field of Chinese classical poetry, which are poetry theme classification and poetry-modern Chinese translation. Extensive experiments are conducted on the two tasks to show the effectiveness of sememe knowledge based pre-training model.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
In linguistics, sememes are defined as the smallest semantic unit of language. Linguists believe that the meaning of all words can be described by a limited set of meanings.
2.
Ormosia belong to seven sememes: think of, vegetable, part, tree, eat, mean, embryo.
3.
https://github.com/ymcui/Chinese-Minority-PLM.
4.
https://www.gushiwen.cn/.
5.
https://sou-yun.cn/.
6.
https://www.zdic.net/.
7.
https://github.com/ymcui/Chinese-BERT-wwm.
8.
https://github.com/shuizhonghaitong/classification_GAT/tree/master/data.
9.
https://github.com/THUNLP-AIPoet/CCPM.

References

Alsentzer, E., et al.: Publicly available clinical BERT embeddings. In: Proceedings of the 2nd Clinical Natural Language Processing Workshop, pp. 72–78 (2019)
Google Scholar
Beltagy, I., Lo, K., Cohan, A.: SciBERT: a pretrained language model for scientific text. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3615–3620 (2019)
Google Scholar
Bordes, A., Usunier, N., Garcia-Duran, A., Weston, J., Yakhnenko, O.: Translating embeddings for modeling multi-relational data. In: Advances in Neural Information Processing Systems, vol. 26 (2013)
Google Scholar
Brown, T.B., et al.: Language models are few-shot learners. arXiv preprint arXiv:2005.14165 (2020)
Chen, C., et al.: bert2BERT: towards reusable pretrained language models. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2134–2148 (2022)
Google Scholar
Chen, X., et al.: Knowprompt: knowledge-aware prompt-tuning with synergistic optimization for relation extraction. In: Proceedings of the ACM Web Conference 2022, pp. 2778–2788 (2022)
Google Scholar
Cui, G., Hu, S., Ding, N., Huang, L., Liu, Z.: Prototypical verbalizer for prompt-based few-shot tuning. arXiv preprint arXiv:2203.09770 (2022)
Cui, Y., et al.: Pre-training with whole word masking for Chinese BERT. arXiv preprint arXiv:1906.08101 (2019)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186 (2019)
Google Scholar
Dong, L., et al.: Unified language model pre-training for natural language understanding and generation. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems (2019)
Google Scholar
Dong, Z., Dong, Q., Hao, C.: HowNet and its computation of meaning. In: COLING 2010: Demonstrations, pp. 53–56 (2010)
Google Scholar
Du, J., Qi, F., Sun, M., Liu, Z.: Lexical sememe prediction by dictionary definitions and localsemantic correspondence. J. Chin. Inf. Process. 34, 1–9 (2020)
Google Scholar
Ghazvininejad, M., Choi, Y., Knight, K.: Neural poetry translation. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp. 67–71 (2018)
Google Scholar
Gu, Y., Han, X., Liu, Z., Huang, M.: PPT: pre-trained prompt tuning for few-shot learning. arXiv preprint arXiv:2109.04332 (2021)
Guan, J., Huang, F., Zhao, Z., Zhu, X., Huang, M.: A knowledge-enhanced pretraining model for commonsense story generation. Trans. Assoc. Comput. Linguist. 8, 93–108 (2020)
Google Scholar
He, B., et al.: BERT-MK: integrating graph contextualized knowledge into pre-trained language models. In: Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 2281–2290 (2020)
Google Scholar
He, L., Zheng, S., Yang, T., Zhang, F.: KLMo: knowledge graph enhanced pretrained language model with fine-grained relationships. In: Findings of the Association for Computational Linguistics: EMNLP 2021, pp. 4536–4542 (2021)
Google Scholar
Hong, L., Hou, W., Zhou, L.: Knowpoetry: A knowledge service platform for tang poetry research based on domain-specific knowledge graph. In: Library Trends, vol. 69, pp. 101–124 (2020)
Google Scholar
Hsu, C.J., Lee, H.Y., Tsao, Y.: XdBERT: distilling visual information to BERT from cross-modal systems to improve language understanding. arXiv preprint arXiv:2204.07316 (2022)
Hu, R., Li, K., Zhu, Y.: Knowledge representation and sentence segmentation of ancient Chinese based on deep language models. J. Chin. Inf. Sci. 35, 8 (2021)
Google Scholar
Joshi, M., Chen, D., Liu, Y., Weld, D.S., Zettlemoyer, L., Levy, O.: SpanBERT: improving pre-training by Representing and Predicting Spans. Trans. Assoc. Comput. Linguist. 8, 64–77 (2020)
Google Scholar
Ke, P., Ji, H., Liu, S., Zhu, X., Huang, M.: SentiLARE: sentiment-aware language representation learning with linguistic knowledge. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 6975–6988 (2020)
Google Scholar
Lai, S., Xu, L., Liu, K., Zhao, J.: Recurrent convolutional neural networks for text classification. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 2267–2273 (2015)
Google Scholar
Lample, G., Conneau, A.: Cross-lingual language model pretraining. In: Proceedings of the Advances in Neural Information Processing Systems, pp. 7057–7067 (2019)
Google Scholar
Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., Soricut, R.: AlBERT: a lite BERT for self-supervised learning of language representations. In: Proceedings of the International Conference on Learning Representations (2020)
Google Scholar
Lauscher, A., Vulic, I., Ponti, E.M., Korhonen, A., Glavas, G.: Specializing unsupervised pretraining models for word-level semantic similarity. In: Proceedings of the 28th International Conference on Computational Linguistics, pp. 1371–1383 (2020)
Google Scholar
Lee, J., et al.: BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36, 1234–1240 (2019)
Google Scholar
Li, W., Qi, F., Sun, M., Yi, X., Zhang, J.: CCPM: a Chinese classical poetry matching dataset. arXiv preprint arXiv:2106.01979 (2021)
Lin, Z., et al.: Pre-training multilingual neural machine translation by leveraging alignment information. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 2649–2663 (2020)
Google Scholar
Liu, D., Yang, K., Qu, Q., Lv, J.: Ancient-modern Chinese translation with a new large training dataset. In: ACM Transactions on Asian and Low-Resource Language Information Processing, vol. 19, pp. 1–13 (2019)
Google Scholar
Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., Neubig, G.: Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing. arXiv preprint arXiv:2107.13586 (2021)
Liu, W., et al.: K-BERT: enabling language representation with knowledge graph. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 2901–2908 (2020)
Google Scholar
Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692 (2019)
Liu, Y., Wu, B., Bai, T.: The construction and analysis of the knowledge graph of classical Chinese poetry. In: Computer Research and Development, vol. 57, p. 1252 (2020)
Google Scholar
Liu, Y., Wu, B., Xie, T., Wang, B.: New word detection in ancient Chinese corpus. J. Chin. Inf. Process. 33, 46–55 (2019)
Google Scholar
Mahabadi, R.K., et al.: Perfect: prompt-free and efficient few-shot learning with language models. arXiv preprint arXiv:2204.01172 (2022)
Peters, M.E., et al.: Knowledge enhanced contextual word representations. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 43–54 (2019)
Google Scholar
Qi, F., Lv, C., Liu, Z., Meng, X., Sun, M., Zheng, H.T.: Sememe prediction for babelnet synsets using multilingual and multimodal information. arXiv preprint arXiv:2203.07426 (2022)
Qi, F., Yang, Y., Yi, J., Cheng, Z., Liu, Z., Sun, M.: Quoter: a benchmark of quote recommendation for writing. arXiv preprint arXiv:2202.13145 (2022)
Qin, Y., et al.: ERICA: improving entity and relation understanding for pre-trained language models via contrastive learning. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics, pp. 3350–3363 (2021)
Google Scholar
Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.: Improving language understanding by generative pre-training. OpenAI Blog (2018)
Google Scholar
Shen, T., Mao, Y., He, P., Long, G., Trischler, A., Chen, W.: Exploiting structured knowledge in text via graph-guided representation learning. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 8980–8994 (2020)
Google Scholar
Song, K., Tan, X., Qin, T., Lu, J., Liu, T.Y.: MASS: masked sequence to sequence pre-training for language generation. In: Proceedings of the 36th International Conference on Machine Learning, vol. 97, pp. 5926–5936 (2019)
Google Scholar
Sun, T., Shao, Y., Qiu, X., Guo, Q., Hu, Y., Huang, X., Zhang, Z.: CoLAKE: contextualized language and knowledge embedding. In: Proceedings of the 28th International Conference on Computational Linguistics, pp. 3660–3670 (2020)
Google Scholar
Sun, Y., et al.: Ernie: enhanced representation through knowledge integration. arXiv preprint arXiv:1904.09223 (2019)
Sun, Z., et al.: ChineseBERT: Chinese pretraining enhanced by glyph and Pinyin information. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics, pp. 2065–2075 (2021)
Google Scholar
Tian, H., Yang, K., Liu, D., Lv, J.: AnchiBERT: a pre-trained model for ancient Chinese language understanding and generation. In: 2021 International Joint Conference on Neural Networks (IJCNN), pp. 1–8 (2021)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Google Scholar
Wang, X., et al.: KEPLER: a unified model for knowledge embedding and pre-trained language representation. Trans. Assoc. Comput. Linguist. 9, 176–194 (2021)
Google Scholar
Wei, Y., Wang, H., Zhao, J., Liu, Y., Zhang, Y., Wu, B.: Gelaigelai: a visual platform for analysis of classical Chinese poetry based on knowledge graph. In: 2020 IEEE International Conference on Knowledge Graph (ICKG), pp. 513–520 (2020)
Google Scholar
Xiong, W., Du, J., Wang, W.Y., Stoyanov, V.: Pretrained encyclopedia: weakly supervised knowledge-pretrained language model. In: International Conference on Learning Representations (2020)
Google Scholar
Yang, K., Liu, D., Qu, Q., Sang, Y., Lv, J.: An automatic evaluation metric for ancient-modern Chinese translation. Neural Comput. Appl. 33, 3855–3867 (2021)
Google Scholar
Yang, Z., et al.: Generating classical Chinese poems from vernacular Chinese. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing, vol. 2019, p. 6155 (2019)
Google Scholar
Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R.R., Le, Q.V.: Xlnet: generalized autoregressive pretraining for language understanding. In: Proceedings of the Advances in Neural Information Processing Systems, vol. 32 (2019)
Google Scholar
Yi, X., Li, R., Yang, C., Li, W., Sun, M.: Mixpoet: diverse poetry generation via learning controllable mixed latent space. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 9450–9457 (2020)
Google Scholar
Yu, D., Zhu, C., Yang, Y., Zeng, M.: Jaket: joint pre-training of knowledge graph and language understanding. arXiv preprint arXiv:2010.00796 (2020)
Zhang, Z., Han, X., Liu, Z., Jiang, X., Sun, M., Liu, Q.: ERNIE: enhanced language representation with informative entities. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 1441–1451 (2019)
Google Scholar
Zhipeng, G., et al.: Jiuge: a human-machine collaborative Chinese classical poetry generation system. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 25–30 (2019)
Google Scholar

Download references

Author information

Authors and Affiliations

Beijing Key Laboratory of Intelligent Telecommunications Software and Multimedia, Beijing University of Posts and Telecommunications, Haidian, Beijing, 100876, China
Jiaqi Zhao, Ting Bai, Yuting Wei & Bin Wu

Authors

Jiaqi Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Ting Bai
View author publications
You can also search for this author in PubMed Google Scholar
Yuting Wei
View author publications
You can also search for this author in PubMed Google Scholar
Bin Wu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bin Wu .

Editor information

Editors and Affiliations

Peking University, Beijing, China
Ying Tan
Southern University of Science and Technology, Shenzhen, China
Yuhui Shi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhao, J., Bai, T., Wei, Y., Wu, B. (2022). PoetryBERT: Pre-training with Sememe Knowledge for Classical Chinese Poetry. In: Tan, Y., Shi, Y. (eds) Data Mining and Big Data. DMBD 2022. Communications in Computer and Information Science, vol 1745. Springer, Singapore. https://doi.org/10.1007/978-981-19-8991-9_26

Download citation

DOI: https://doi.org/10.1007/978-981-19-8991-9_26
Published: 19 January 2023
Publisher Name: Springer, Singapore
Print ISBN: 978-981-19-8990-2
Online ISBN: 978-981-19-8991-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

PoetryBERT: Pre-training with Sememe Knowledge for Classical Chinese Poetry