Abstract
Synonym discovery is important in a wide variety of concept-related tasks, such as entity/concept mining and industrial knowledge graph (KG) construction. It intends to determine whether two terms refer to the same concept in semantics. Existing methods rely on contexts or KGs. However, these methods are often impractical in some cases where contexts or KGs are not available. Therefore, this paper proposes a context-free prompt learning based synonym discovery method called ProSyno, which takes the world’s largest freely available dictionary Wiktionary as a semantic source. Based on a pre-trained language model (PLM), we employ a prompt learning method to generalize to other datasets without any fine-tuning. Thus, our model is more appropriate for context-free situation and can be easily transferred to other fields. Experimental results demonstrate its superiority comparing with state-of-the-art methods.
Similar content being viewed by others
References
Luo X, Bo L, Wu J, Li L, Luo Z, Yang Y, Yang K. AliCoCo2: commonsense knowledge extraction, representation and application in E-commerce. In: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 2021, 3385–3393
Li M, Xing Y, Kong F, Zhou G. Towards better entity linking. Frontiers of Computer Science, 2022, 16(2): 162308
Zhang M, He T, Dong M. Meta-path reasoning of knowledge graph for commonsense question answering. Frontiers of Computer Science, 2024, 18(1): 181303
Xu D, Miller T. A simple neural vector space model for medical concept normalization using concept embeddings. Journal of Biomedical Informatics, 2022, 130: 104080
Zhang C, Li Y, Du N, Fan W, Yu P S. Entity synonym discovery via multipiece bilateral context matching. In: Proceedings of the 29th International Joint Conference on Artificial Intelligence. 2021, 199
Pei S, Yu L, Zhang X. Set- aware entity synonym discovery with flexible receptive fields. IEEE Transactions on Knowledge and Data Engineering, 2023, 35(1): 891–904
Yuan Z, Zhao Z, Sun H, Li J, Wang F, Yu S. CODER: knowledge-infused cross-lingual medical term embedding for term normalization. Journal of Biomedical Informatics, 2022, 126: 103983
Garcia M. Exploring the representation of word meanings in context: a case study on homonymy and synonymy. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. 2021, 3625–3640
Miftahutdinov Z, Tutubalina E. Deep neural models for medical concept normalization in user-generated texts. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop. 2019, 393–399
Wang Z, Yue X, Moosavinasab S, Huang Y, Lin S, Sun H. SurfCon: synonym discovery on privacy-aware clinical data. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2019, 1578–1586
Gao Y, Wang X, He X, Feng H, Zhang Y. Rumor detection with self-supervised learning on texts and social graph. Frontiers of Computer Science, 2023, 17(4): 174611
Zhang N, Jia Q, Deng S, Chen X, Ye H, Chen H, Tou H, Huang G, Wang Z, Hua N, Chen H. AliCG: fine-grained and evolvable conceptual graph construction for semantic search at Alibaba. In: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 2021, 3895–3905
Xie T, Wu B, Jia B, Wang B. Graph- ranking collective Chinese entity linking algorithm. Frontiers of Computer Science, 2020, 14(2): 291–303
Wang C, He X, Zhou A. A short survey on taxonomy learning from text corpora: Issues, resources and recent advances. In: Proceedings of 2017 Conference on Empirical Methods in Natural Language Processing. 2017, 1190–1203
Zhang J, Trujillo L B, Li T, Tanwar A, Freire G, Yang X, Ive J, Gupta V, Guo Y. Self-supervised detection of contextual synonyms in a multi-class setting: Phenotype annotation use case. In: Proceedings of 2021 Conference on Empirical Methods in Natural Language Processing. 2021, 8754–8769
Zhang T, Cai Z, Wang C, Qiu M, Yang B, He X. SMedBERT: a knowledge-enhanced pre-trained language model with structured semantics for medical text mining. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. 2021, 5882–5893
Yang Y, Yin X, Yang H, Fei X, Peng H, Zhou K, Lai K, Shen J. KGSynNet: a novel entity synonyms discovery framework with knowledge graph. In: Proceedings of the 26th International Conference. 2021, 174–190
Wang C, Qiu M, Huang J, He X. KEML: a knowledge-enriched meta-learning framework for lexical relation classification. In: Proceedings of the 35th AAAI Conference on Artificial Intelligence. 2021, 13924–13932
Shen J, Lyu R, Ren X, Vanni M, Sadler B, Han J. Mining entity synonyms with efficient neural set generation. In: Proceedings of the 33rd AAAI Conference on Artificial Intelligence. 2019, 249–256
Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I. Language models are unsupervised multitask learners. OpenAI Blog, 2019, 1(8): 9
Devlin J, Chang M W, Lee K, Toutanova K. BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of 2019 Conference of the North American Chapter of the Association for Computational Linguistics. 2019, 4171–4186
Zeng J, Wang Z, Yu Y, Wen J, Gao M. Word embedding methods in natural language processing: a review. Journal of Frontiers of Computer Science and Technology, 2024, 18(1): 24–43
Liu P, Yuan W, Fu J, Jiang Z, Hayashi H, Neubig G. Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 2023, 55(9): 195
Li X L, Liang P. Prefix-tuning: optimizing continuous prompts for generation. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. 2021, 4582–4597
Zhong Z, Friedman D, Chen D. Factual probing is [MASK]: learning vs. learning to recall. In: Proceedings of 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2021, 5017–5033
Izbicki M. Aligning word vectors on low-resource languages with wiktionary. In: Proceedings of the 5th Workshop on Technologies for Machine Translation of Low-Resource Languages. 2022, 107–117
Bajčetić L, Declerck T. Using wiktionary to create specialized lexical resources and datasets. In: Proceedings of the 13th Conference on Language Resources and Evaluation. 2022
Fang Y, Wang S, Xu Y, Xu R, Sun S, Zhu C, Zeng M. Leveraging knowledge in multilingual commonsense reasoning. In: Proceedings of the Findings of the Association for Computational Linguistics. 2022, 3237–3246
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser Ł, Polosukhin I. Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017, 5998–6008
Miller G A. WordNet: a lexical database for English. Communications of the ACM, 1995, 38(11): 39–41
Limsopatham N, Collier N. Normalising medical concepts in social media texts by learning semantic representation. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. 2016, 1014–1023
Tutubalina E, Miftahutdinov Z, Nikolenko S, Malykh V. Medical concept normalization in social media posts with recurrent neural networks. Journal of Biomedical Informatics, 2018, 84: 93–102
Xu D, Zhang Z, Bethard S. A generate-and-rank framework with semantic type regularization for biomedical concept normalization. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020, 8452–8464
Lee J, Yoon W, Kim S, Kim D, Kim S, So C H, Kang J. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 2020, 36(4): 1234–1240
Xie Z, Zeng N. A mixture-of-experts model for antonym-synonym discrimination. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. 2021, 558–564
Bojanowski P, Grave E, Joulin A, Mikolov T. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 2017, 5: 135–146
Acknowledgements
This work was supported by the National Key R&D Program of China (2023YFC3304104) and the National Natural Science Foundation of China (Grant No. 62172094).
Author information
Authors and Affiliations
Corresponding authors
Ethics declarations
Competing interests The authors declare that they have no competing interests or financial conflicts to disclose.
Additional information
Song Zhang is a PhD candidate at Institute of Automation, Chinese Academy of Sciences (CAS), China. His research interests include NLP and machine learning.
Lei He is a senior research engineer at Machine Learning Platform Department in Tencent, China. She received the PhD degree from the Institute of Computing Technology, CAS, China in 2018. Her research interests include NLP and machine learning.
Dong Wang is an algorithm engineer at Tencent, China. He received the MS degree from Tsinghua University, China in 2021. His research interests include NLP, deep learning and KG.
Hongyun Bao is an associate professor in Institute of Automation, CAS, China. She received the PhD degree from Institute of Automation, CAS, China in 2013. Her research interests include KG construction and information extraction.
Suncong Zheng is responsible for Tencent’s Lexical tools, Tencent’s large-scale knowledge graph Topbase. He received the PhD degree from Institute of automation, CAS, China in 2017 and obtained ACL-2017 outstanding paper award. His research interests include information extraction, KB-QA and recommendation.
Yuqiao Liu is studying for a master’s degree at CAS, China. His research interests include recommendation system and data mining.
Baihua Xiao is a professor in Institute of Automation, CAS, China. He received the BS degree in automatic control from Northwestern Polytechnical University, China in 1995, and the PhD degree in computer science from Institute of Automation, CAS, China in 2000. His research interests include pattern recognition, computer vision, image processing, and machine learning.
Jiayue Li received his PhD degree in computer science and engineering from The Hong Kong University of Science and Technology, China. He did postdoctoral research in Arizona State University, USA from 2018 to 2019. His research mainly focuses on pattern recognition, medical imaging, and distributed ledger technology.
Dongyuan Lu is a professor in University of International Business and Economics, China. She received her PhD degree from Institute of Automation, CAS, China in 2012. Her research interests include data mining and natural language processing.
Nan Zheng is an associate professor at Institute of Automation, CAS, China. She received the PhD degree from Institute of Automation, CAS, China in 2012. Her research interests include data mining and machine learning. She was a visiting scholar at University of California, Berkeley, USA in 2019.
Electronic supplementary material
Rights and permissions
About this article
Cite this article
Zhang, S., He, L., Wang, D. et al. ProSyno: context-free prompt learning for synonym discovery. Front. Comput. Sci. 19, 196317 (2025). https://doi.org/10.1007/s11704-024-3900-z
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11704-024-3900-z