Skip to main content

PoetryBERT: Pre-training with Sememe Knowledge for Classical Chinese Poetry

  • Conference paper
  • First Online:
Data Mining and Big Data (DMBD 2022)

Abstract

Classical Chinese poetry has a history of thousands of years and is a precious cultural heritage of humankind. Compared with the modern Chinese corpus, it is irrecoverable and specially organized, making it difficult to be learned by existing pre-trained language models. Besides, with the thousands of years of development, many words in classical Chinese poetry have changed their meanings or been out of use today, which further limiting the capability of existing pre-trained models to learn the semantics of classical Chinese poetry. To address these challenges, we construct a large-scale sememe knowledge graph of classical Chinese Poetry (SKG-Poetry), which connects the vocabularies in classical Chinese poetry and modern Chinese. By extracting the sememe knowledge from classical Chinese poetry, our model PoetryBERT not only enlarges the irrecoverable pre-training corpus but also enriches the semantics of the vocabularies in classical Chinese poetry, which enables PoetryBERT to be successfully used in downstream tasks. Specifically, we evaluate our model in two tasks in the field of Chinese classical poetry, which are poetry theme classification and poetry-modern Chinese translation. Extensive experiments are conducted on the two tasks to show the effectiveness of sememe knowledge based pre-training model.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    In linguistics, sememes are defined as the smallest semantic unit of language. Linguists believe that the meaning of all words can be described by a limited set of meanings.

  2. 2.

    Ormosia belong to seven sememes: think of, vegetable, part, tree, eat, mean, embryo.

  3. 3.

    https://github.com/ymcui/Chinese-Minority-PLM.

  4. 4.

    https://www.gushiwen.cn/.

  5. 5.

    https://sou-yun.cn/.

  6. 6.

    https://www.zdic.net/.

  7. 7.

    https://github.com/ymcui/Chinese-BERT-wwm.

  8. 8.

    https://github.com/shuizhonghaitong/classification_GAT/tree/master/data.

  9. 9.

    https://github.com/THUNLP-AIPoet/CCPM.

References

  1. Alsentzer, E., et al.: Publicly available clinical BERT embeddings. In: Proceedings of the 2nd Clinical Natural Language Processing Workshop, pp. 72–78 (2019)

    Google Scholar 

  2. Beltagy, I., Lo, K., Cohan, A.: SciBERT: a pretrained language model for scientific text. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3615–3620 (2019)

    Google Scholar 

  3. Bordes, A., Usunier, N., Garcia-Duran, A., Weston, J., Yakhnenko, O.: Translating embeddings for modeling multi-relational data. In: Advances in Neural Information Processing Systems, vol. 26 (2013)

    Google Scholar 

  4. Brown, T.B., et al.: Language models are few-shot learners. arXiv preprint arXiv:2005.14165 (2020)

  5. Chen, C., et al.: bert2BERT: towards reusable pretrained language models. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2134–2148 (2022)

    Google Scholar 

  6. Chen, X., et al.: Knowprompt: knowledge-aware prompt-tuning with synergistic optimization for relation extraction. In: Proceedings of the ACM Web Conference 2022, pp. 2778–2788 (2022)

    Google Scholar 

  7. Cui, G., Hu, S., Ding, N., Huang, L., Liu, Z.: Prototypical verbalizer for prompt-based few-shot tuning. arXiv preprint arXiv:2203.09770 (2022)

  8. Cui, Y., et al.: Pre-training with whole word masking for Chinese BERT. arXiv preprint arXiv:1906.08101 (2019)

  9. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186 (2019)

    Google Scholar 

  10. Dong, L., et al.: Unified language model pre-training for natural language understanding and generation. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems (2019)

    Google Scholar 

  11. Dong, Z., Dong, Q., Hao, C.: HowNet and its computation of meaning. In: COLING 2010: Demonstrations, pp. 53–56 (2010)

    Google Scholar 

  12. Du, J., Qi, F., Sun, M., Liu, Z.: Lexical sememe prediction by dictionary definitions and localsemantic correspondence. J. Chin. Inf. Process. 34, 1–9 (2020)

    Google Scholar 

  13. Ghazvininejad, M., Choi, Y., Knight, K.: Neural poetry translation. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp. 67–71 (2018)

    Google Scholar 

  14. Gu, Y., Han, X., Liu, Z., Huang, M.: PPT: pre-trained prompt tuning for few-shot learning. arXiv preprint arXiv:2109.04332 (2021)

  15. Guan, J., Huang, F., Zhao, Z., Zhu, X., Huang, M.: A knowledge-enhanced pretraining model for commonsense story generation. Trans. Assoc. Comput. Linguist. 8, 93–108 (2020)

    Google Scholar 

  16. He, B., et al.: BERT-MK: integrating graph contextualized knowledge into pre-trained language models. In: Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 2281–2290 (2020)

    Google Scholar 

  17. He, L., Zheng, S., Yang, T., Zhang, F.: KLMo: knowledge graph enhanced pretrained language model with fine-grained relationships. In: Findings of the Association for Computational Linguistics: EMNLP 2021, pp. 4536–4542 (2021)

    Google Scholar 

  18. Hong, L., Hou, W., Zhou, L.: Knowpoetry: A knowledge service platform for tang poetry research based on domain-specific knowledge graph. In: Library Trends, vol. 69, pp. 101–124 (2020)

    Google Scholar 

  19. Hsu, C.J., Lee, H.Y., Tsao, Y.: XdBERT: distilling visual information to BERT from cross-modal systems to improve language understanding. arXiv preprint arXiv:2204.07316 (2022)

  20. Hu, R., Li, K., Zhu, Y.: Knowledge representation and sentence segmentation of ancient Chinese based on deep language models. J. Chin. Inf. Sci. 35, 8 (2021)

    Google Scholar 

  21. Joshi, M., Chen, D., Liu, Y., Weld, D.S., Zettlemoyer, L., Levy, O.: SpanBERT: improving pre-training by Representing and Predicting Spans. Trans. Assoc. Comput. Linguist. 8, 64–77 (2020)

    Google Scholar 

  22. Ke, P., Ji, H., Liu, S., Zhu, X., Huang, M.: SentiLARE: sentiment-aware language representation learning with linguistic knowledge. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 6975–6988 (2020)

    Google Scholar 

  23. Lai, S., Xu, L., Liu, K., Zhao, J.: Recurrent convolutional neural networks for text classification. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 2267–2273 (2015)

    Google Scholar 

  24. Lample, G., Conneau, A.: Cross-lingual language model pretraining. In: Proceedings of the Advances in Neural Information Processing Systems, pp. 7057–7067 (2019)

    Google Scholar 

  25. Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., Soricut, R.: AlBERT: a lite BERT for self-supervised learning of language representations. In: Proceedings of the International Conference on Learning Representations (2020)

    Google Scholar 

  26. Lauscher, A., Vulic, I., Ponti, E.M., Korhonen, A., Glavas, G.: Specializing unsupervised pretraining models for word-level semantic similarity. In: Proceedings of the 28th International Conference on Computational Linguistics, pp. 1371–1383 (2020)

    Google Scholar 

  27. Lee, J., et al.: BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36, 1234–1240 (2019)

    Google Scholar 

  28. Li, W., Qi, F., Sun, M., Yi, X., Zhang, J.: CCPM: a Chinese classical poetry matching dataset. arXiv preprint arXiv:2106.01979 (2021)

  29. Lin, Z., et al.: Pre-training multilingual neural machine translation by leveraging alignment information. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 2649–2663 (2020)

    Google Scholar 

  30. Liu, D., Yang, K., Qu, Q., Lv, J.: Ancient-modern Chinese translation with a new large training dataset. In: ACM Transactions on Asian and Low-Resource Language Information Processing, vol. 19, pp. 1–13 (2019)

    Google Scholar 

  31. Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., Neubig, G.: Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing. arXiv preprint arXiv:2107.13586 (2021)

  32. Liu, W., et al.: K-BERT: enabling language representation with knowledge graph. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 2901–2908 (2020)

    Google Scholar 

  33. Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692 (2019)

  34. Liu, Y., Wu, B., Bai, T.: The construction and analysis of the knowledge graph of classical Chinese poetry. In: Computer Research and Development, vol. 57, p. 1252 (2020)

    Google Scholar 

  35. Liu, Y., Wu, B., Xie, T., Wang, B.: New word detection in ancient Chinese corpus. J. Chin. Inf. Process. 33, 46–55 (2019)

    Google Scholar 

  36. Mahabadi, R.K., et al.: Perfect: prompt-free and efficient few-shot learning with language models. arXiv preprint arXiv:2204.01172 (2022)

  37. Peters, M.E., et al.: Knowledge enhanced contextual word representations. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 43–54 (2019)

    Google Scholar 

  38. Qi, F., Lv, C., Liu, Z., Meng, X., Sun, M., Zheng, H.T.: Sememe prediction for babelnet synsets using multilingual and multimodal information. arXiv preprint arXiv:2203.07426 (2022)

  39. Qi, F., Yang, Y., Yi, J., Cheng, Z., Liu, Z., Sun, M.: Quoter: a benchmark of quote recommendation for writing. arXiv preprint arXiv:2202.13145 (2022)

  40. Qin, Y., et al.: ERICA: improving entity and relation understanding for pre-trained language models via contrastive learning. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics, pp. 3350–3363 (2021)

    Google Scholar 

  41. Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.: Improving language understanding by generative pre-training. OpenAI Blog (2018)

    Google Scholar 

  42. Shen, T., Mao, Y., He, P., Long, G., Trischler, A., Chen, W.: Exploiting structured knowledge in text via graph-guided representation learning. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 8980–8994 (2020)

    Google Scholar 

  43. Song, K., Tan, X., Qin, T., Lu, J., Liu, T.Y.: MASS: masked sequence to sequence pre-training for language generation. In: Proceedings of the 36th International Conference on Machine Learning, vol. 97, pp. 5926–5936 (2019)

    Google Scholar 

  44. Sun, T., Shao, Y., Qiu, X., Guo, Q., Hu, Y., Huang, X., Zhang, Z.: CoLAKE: contextualized language and knowledge embedding. In: Proceedings of the 28th International Conference on Computational Linguistics, pp. 3660–3670 (2020)

    Google Scholar 

  45. Sun, Y., et al.: Ernie: enhanced representation through knowledge integration. arXiv preprint arXiv:1904.09223 (2019)

  46. Sun, Z., et al.: ChineseBERT: Chinese pretraining enhanced by glyph and Pinyin information. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics, pp. 2065–2075 (2021)

    Google Scholar 

  47. Tian, H., Yang, K., Liu, D., Lv, J.: AnchiBERT: a pre-trained model for ancient Chinese language understanding and generation. In: 2021 International Joint Conference on Neural Networks (IJCNN), pp. 1–8 (2021)

    Google Scholar 

  48. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)

    Google Scholar 

  49. Wang, X., et al.: KEPLER: a unified model for knowledge embedding and pre-trained language representation. Trans. Assoc. Comput. Linguist. 9, 176–194 (2021)

    Google Scholar 

  50. Wei, Y., Wang, H., Zhao, J., Liu, Y., Zhang, Y., Wu, B.: Gelaigelai: a visual platform for analysis of classical Chinese poetry based on knowledge graph. In: 2020 IEEE International Conference on Knowledge Graph (ICKG), pp. 513–520 (2020)

    Google Scholar 

  51. Xiong, W., Du, J., Wang, W.Y., Stoyanov, V.: Pretrained encyclopedia: weakly supervised knowledge-pretrained language model. In: International Conference on Learning Representations (2020)

    Google Scholar 

  52. Yang, K., Liu, D., Qu, Q., Sang, Y., Lv, J.: An automatic evaluation metric for ancient-modern Chinese translation. Neural Comput. Appl. 33, 3855–3867 (2021)

    Google Scholar 

  53. Yang, Z., et al.: Generating classical Chinese poems from vernacular Chinese. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing, vol. 2019, p. 6155 (2019)

    Google Scholar 

  54. Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R.R., Le, Q.V.: Xlnet: generalized autoregressive pretraining for language understanding. In: Proceedings of the Advances in Neural Information Processing Systems, vol. 32 (2019)

    Google Scholar 

  55. Yi, X., Li, R., Yang, C., Li, W., Sun, M.: Mixpoet: diverse poetry generation via learning controllable mixed latent space. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 9450–9457 (2020)

    Google Scholar 

  56. Yu, D., Zhu, C., Yang, Y., Zeng, M.: Jaket: joint pre-training of knowledge graph and language understanding. arXiv preprint arXiv:2010.00796 (2020)

  57. Zhang, Z., Han, X., Liu, Z., Jiang, X., Sun, M., Liu, Q.: ERNIE: enhanced language representation with informative entities. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 1441–1451 (2019)

    Google Scholar 

  58. Zhipeng, G., et al.: Jiuge: a human-machine collaborative Chinese classical poetry generation system. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 25–30 (2019)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bin Wu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zhao, J., Bai, T., Wei, Y., Wu, B. (2022). PoetryBERT: Pre-training with Sememe Knowledge for Classical Chinese Poetry. In: Tan, Y., Shi, Y. (eds) Data Mining and Big Data. DMBD 2022. Communications in Computer and Information Science, vol 1745. Springer, Singapore. https://doi.org/10.1007/978-981-19-8991-9_26

Download citation

  • DOI: https://doi.org/10.1007/978-981-19-8991-9_26

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-19-8990-2

  • Online ISBN: 978-981-19-8991-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics