Abstract
In the era of large language models, capturing fine-grained semantics remains critical, as these models often overlook subtle semantic nuances. Sememes, the smallest units of meaning, are essential for enriching semantic representations. However, existing sememe prediction methods rely solely on intrinsic word features or dictionary definitions, neglecting the potential of subword information to bridge the gap between them. This limitation results in poor performance in predicting out-of-vocabulary (OOV) and low-frequency words. To address this, we propose the Sememe Prediction through Semantic Synthesis (SPSY) framework, which integrates subword-level information with dictionary definitions. This approach enhances sensitivity to subtle semantic variations, significantly improving prediction accuracy. Evaluations on the HowNet and WordNet datasets show that our framework outperforms existing models, achieving a 2.91% gain in mean average precision for the Chinese dataset and a 5.54% gain for the English dataset. It also achieves state-of-the-art performance, surpassing previous models by at least 2.88% across all word frequencies and by 4.11% for OOV words. Furthermore, the framework demonstrates its versatility through successful applications in industrial knowledge graph verification and entity recognition.














Similar content being viewed by others
References
Gao H, Zhang P, Zhang J, Yang C (2025) Qsim: a quantum-inspired hierarchical semantic interaction model for text classification. Neurocomputing 611:128658
Ma H, Xie R, Meng L, Yang Y, Sun X, Kang Z (2024) Seedrec: sememe-based diffusion for sequential recommendation. In: Proceedings of IJCAI, pp 1–9
Lyu M, Mo S (2023) Hsrg-wsd: a novel unsupervised chinese word sense disambiguation method based on heterogeneous sememe-relation graph. International Conference on Intelligent Computing. Springer, Berlin, pp 623–633
Du J, Qi F, Sun M, Liu Z (2020) Lexical sememe prediction using dictionary definitions by capturing local semantic correspondence. arXiv preprint arXiv:2001.05954
Lyu B, Chen L, Yu K (2021) Glyph enhanced chinese character pre-training for lexical sememe prediction. In: Findings of the Association for Computational Linguistics: EMNLP 2021, pp 4549–4555
Luo G, Cui Y (2024) A sememe prediction method based on the central word of a semantic field. Electronics 13(2):413
Patel R, Domeniconi C (2023) Enhancing out-of-vocabulary estimation with subword attention. In: Findings of the Association for Computational Linguistics: ACL 2023, pp 3592–3601
Liu Y, Li F, Ji D (2024) Improving cross-lingual aspect-based sentiment analysis with sememe bridge. ACM Trans Asian Low Resour Lang Inf Process 23(12):1–22
Wen Z, Wang R, Luo X, Wang Q, Liang B, Du J, Yu X, Gui L, Xu R (2023) Multi-perspective contrastive learning framework guided by sememe knowledge and label information for sarcasm detection. Int J Mach Learn Cybern 14(12):4119–4134
Gao H, Zhang P, Zhang J, Yang C (2024) Qsim: a quantum-inspired hierarchical semantic interaction model for text classification. Neurocomputing, 128658
Qin Y, Liu Z, Lin Y, Sun M (2023) Sememe-based lexical knowledge representation learning. Representation Learning for Natural Language Processing. Springer, Singapore, pp 351–400
Zhao Q, Gao T, Guo N (2023) La-mgfm: a legal judgment prediction method via sememe-enhanced graph neural networks and multi-graph fusion mechanism. Inf Process Manag 60(5):103455
Zhao Q, Gao T, Guo N (2023) Document-level relation extraction based on sememe knowledge-enhanced abstract meaning representation and reasoning. Complex Intell Syst 9(6):6553–6566
Xie R, Yuan X, Liu Z, Sun M (2017) Lexical sememe prediction via word embeddings and matrix factorization. In: Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), pp 4200–4206
Sarwar B, Karypis G, Konstan J, Riedl J (2001) Item-based collaborative filtering recommendation algorithms. In: Proceedings of the 10th International Conference on World Wide Web, pp 285–295
Koren Y, Bell R, Volinsky C (2009) Matrix factorization techniques for recommender systems. Computer 42(8):30–37
Jin H, Zhu H, Liu Z, Xie R, Sun M, Lin F, Lin L (2018) Incorporating Chinese Characters of Words for Lexical Sememe Prediction
Li W, Ren X, Dai D, Wu Y, Wang H, Sun X (2018) Sememe prediction: learning semantic knowledge from unstructured textual wiki descriptions. arXiv preprint arXiv:1808.05437
Sun Z, Li X, Sun X, Meng Y, Ao X, He Q, Wu F, Li J (2021) ChineseBERT: Chinese pretraining enhanced by glyph and pinyin information
Sennrich R (2015) Neural machine translation of rare words with subword units
He Y, Hutchinson B, Baumann P, Ostendorf M, Fosler-Lussier E, Pierrehumbert J (2014) Subword-based modeling for handling oov words inkeyword spotting. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp 7864–7868
Sun T, Shao Y, Qiu X, Guo Q, Hu Y, Huang X, Zhang Z (2020) Colake: contextualized language and knowledge embedding. arXiv preprint arXiv:2010.00309
Ke Y, Hagiwara M (2017) Radical-level ideograph encoder for rnn-based sentiment analysis of chinese and japanese. In: Asian Conference on Machine Learning. PMLR, pp 561–573
Nguyen M, Ngo GH, Chen NF (2019) Hierarchical character embeddings: learning phonological and semantic representations in languages of logographic origin using recursive neural networks. IEEE/ACM Trans. Audio Speech Lang Process 28:461–473
Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078
Hendrycks D, Gimpel K (2016) Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415
Ba JL, Kiros JR, Hinton GE (2016) Layer normalization. arXiv preprint arXiv:1607.06450
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, vol. 26
Ma W, Cui Y, Si C, Liu T, Wang S, Hu G (2020) Charbert: character-aware pre-trained language model. arXiv preprint arXiv:2011.01513
Bojanowski P, Grave E, Joulin A, Mikolov T (2017) Enriching word vectors with subword information. Trans. Assoc Comput Linguist 5:135–146
Wang H, Liu S, Duan J, He L, Li X (2023) Chinese lexical sememe prediction using ciline knowledge. IEICE Trans Fundam Electron Commun Comput Sci 106(2):146–153
Athiwaratkun B, Wilson AG, Anandkumar A (2018) Probabilistic fasttext for multi-sense word embeddings. arXiv preprint arXiv:1806.02901
Acknowledgements
This research is supported by the National Key Research and Development Program of China (2020AAA0109300) and the Shanghai Collaborative Innovation Center of data intelligence technology (No. 0232-A1-8900-24-13).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no conflict of interest.
Ethics approval
We confirm that this manuscript has not been published elsewhere and is not under consideration by another journal. All authors have approved the manuscript and agree with its submission to Supercomputing Journal.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix A List of acronyms
Appendix A List of acronyms
In this appendix, we provide a list of all the acronyms used throughout the paper, along with their full forms. As shown in Table 9, this table is intended to help readers better understand the terminology used and ensure clarity in communication.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Wen, T., Hu, J., Zhao, J. et al. SPSY: a semantic synthesis framework for lexical sememe prediction and its applications. J Supercomput 81, 552 (2025). https://doi.org/10.1007/s11227-025-07070-8
Accepted:
Published:
DOI: https://doi.org/10.1007/s11227-025-07070-8