Abstract
Measuring the semantic similarity between two texts is a fundamental aspect of text semantic matching. Each word in the texts holds a weighted meaning, and it is essential for the model to effectively capture the most crucial knowledge. However, current text matching methods based on BERT have limitations in acquiring professional domain knowledge. BERT requires extensive domain-specific training data to perform well in specialized fields such as medicine, where obtaining labeled data is challenging. In addition, current text matching models that inject domain knowledge often rely on creating new training tasks to fine-tune the model, which is time-consuming. Although existing works have directly injected domain knowledge into BERT through similarity matrices, they struggle to handle the challenge of small sample sizes in professional fields. Contrastive learning trains a representation learning model by generating instances that exhibit either similarity or dissimilarity, so that a more general representation can be learned with a small number of samples. In this paper, we propose to directly integrate the word similarity matrix into BERT’s multi-head attention mechanism under a contrastive learning framework to align similar words during training. Furthermore, in the context of Chinese medical applications, we propose an entity MASK approach to enhance the understanding of medical terms by pre-trained models. The proposed method helps BERT acquire domain knowledge to better learn text representations in professional fields. Extensive experimental results have shown that the algorithm significantly improves the performance of the text matching model, especially when training data is limited.



Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data Availability
The data generated and/or analyzed during the current study is available from the corresponding author on reasonable request.
References
Ranathunga S, Lee E-SA, Prifti Skenduli M, Shekhar R, Alam M, Kaur R (2023) Neural machine translation for low-resource languages: a survey. ACM Comput Surv 55(11):1–37
Fan Y, Xie X, Cai Y, Chen J, Ma X, Li X, Zhang R, Guo J (2022) Pre-training methods in information retrieval. Found Trends® Inf Retriev 16(3):178–317
Deldjoo Y, Nazary F, Ramisa A, Mcauley J, Pellegrini G, Bellogin A, Noia TD (2023) A review of modern fashion recommender systems. ACM Comput Surv 56(4):1–37
Kenton JDM-WC, Toutanova LK (2019) BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp 4171–4186
Radford A, Narasimhan K, Salimans T, Sutskever I et al (2018) Improving language understanding by generative pre-training
Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V (2019) RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692
Liu W, Zhou P, Zhao Z, Wang Z, Ju Q, Deng H, Wang P (2020)K-BERT: Enabling language representation with knowledge graph. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol 34, pp 2901–2908
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Adv Neural Inf Process Syst 30
Chen MY, Jiang H, Yang Y (2022) Context enhanced short text matching using clickthrough data. arXiv preprint arXiv:2203.01849
Xia T, Wang Y, Tian Y, Chang Y (2021) Using prior knowledge to guide BERT’s attention in semantic textual matching tasks. In: Proceedings of the Web Conference, pp 2466–2475
Rasmy L, Xiang Y, Xie Z, Tao C, Zhi D (2021) Med-BERT: pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction. NPJ Digit Med 4(1):86
Su J, Cao J, Liu W, Ou Y (2021) Whitening sentence representations for better semantics and faster retrieval. arXiv preprint arXiv:2103.15316
Caron M, Misra I, Mairal J, Goyal P, Bojanowski P, Joulin A (2020) Unsupervised learning of visual features by contrasting cluster assignments. Adv Neural Inf Process Syst 33:9912–9924
Reimers N, Gurevych I (2019) Sentence-BERT: Sentence embeddings using siamese BERT-networks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp 3982–3992
Sun Y, Wang S, Li Y, Feng S, Tian H, Wu H, Wang H (2020) ERNIE 2.0: A continual pre-training framework for language understanding. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol 34, pp 8968–8975
Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I (2019) Language models are unsupervised multitask learners. OpenAI Blog 1(8):9
Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A (2020) Language models are few-shot learners. Adv Neural Inf Process Syst 33:1877–1901
Koubaa A (2023) GPT-4 vs. GPT-3.5: A concise showdown
Li B, Zhou H, He J, Wang M, Yang Y, Li L (2020) On the sentence embeddings from pre-trained language models. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp 9119–9130
Liu Z, Xiong C, Sun M, Liu Z (2018) Entity-duet neural ranking: Understanding the role of knowledge graph semantics in neural information retrieval. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp 2395–2405
Wang Z, Wu Z, Agarwal D, Sun J (2022) MedCLIP: Contrastive learning from unpaired medical images and text. In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Rethmeier N, Augenstein I (2023) A primer on contrastive pretraining in language processing: methods, lessons learned, and perspectives. ACM Comput Surv 55(10):1–17
Gao T, Yao X, Chen D (2021) SimCSE: Simple contrastive learning of sentence embeddings. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp 6894–6910
Wu X, Gao C, Zang L, Han J, Wang Z, Hu S (2022)ESimCSE: Enhanced sample building method for contrastive learning of unsupervised sentence embedding. In: Proceedings of the 29th International Conference on Computational Linguistics, pp 3898–3907
Chuang Y-S, Dangovski R, Luo H, Zhang Y, Chang S, Soljačić M, Li S-W, Yih S, Kim Y, Glass J (2022) DiffCSE: Difference-based contrastive learning for sentence embeddings. In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp 4207–4218
Liu J, Liu J, Wang Q, Wang J, Wu W, Xian Y, Zhao D, Chen K, Yan R (2023) RankCSE: Unsupervised Sentence Representations Learning via Learning to Rank. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp 13785–13802
Chanchani S, Huang R (2023) Composition-contrastive learning for sentence embeddings. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp 15836–15848
Zhou K, Zhang B, Zhao WX, Wen J-R (2022) Debiased contrastive learning of unsupervised sentence representations. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp 6120–6130
Wu X, Gao C, Su Y, Han J, Wang Z, Hu S (2022) Smoothed contrastive learning for unsupervised sentence embedding. In: Proceedings of the 29th International Conference on Computational Linguistics, pp 4902–4906
Huang X, Peng H, Zou D, Liu Z, Li J, Liu K, Wu J, Su J, Yu PS (2024) CoSENT: Consistent sentence embedding via similarity ranking. IEEE/ACM Trans Audio Speech Language Process 32:2800–2813
Nishikawa S, Ri R, Yamada I, Tsuruoka Y, Echizen I (2022) EASE: Entity-aware contrastive learning of sentence embedding. In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp 3870–3885
Wu L, Hu J, Teng F, Li T, Du S (2023) Text semantic matching with an enhanced sample building method based on contrastive learning. Int J Mach Learn Cybern 14:3105–3112
Karimi A, Rossi L, Prati A (2021) AEDA: An easier data augmentation technique for text classification. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp 2748–2754
Liu H, Singh P (2004) ConceptNet–a practical commonsense reasoning tool-kit. BT Technol J 22(4):211–226
Cer D, Diab M, Agirre EE, Lopez-Gazpio I, Specia L (2017) SemEval-2017 Task 1: Semantic textual similarity multilingual and cross-lingual focused evaluation. In: The 11th International Workshop on Semantic Evaluation (SemEval-2017), pp 1–14
Le HT, Cao DT, Bui TH, Luong LT, Nguyen HQ (2021) Improve quora question pair dataset for question similarity task. In: 2021 RIVF International Conference on Computing and Communication Technologies (RIVF), pp 1–5
Dolan B, Brockett C (2005) Automatically constructing a corpus of sentential paraphrases. In: 3rd International Workshop on Paraphrasing (IWP2005)
Lan W, Qiu S, He H, Xu W (2017)A continuously growing dataset of sentential paraphrases. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp 1224–1234
Jin Q, Dhingra B, Liu Z, Cohen W, Lu X (2019) PubMedQA: A dataset for biomedical research question answering. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp 2567–2577
Tianchi (2020) New crown epidemic question sentence judgment dataset. https://tianchi.aliyun.com/dataset/dataDetail?dataId=76751
Zhang N, Chen M, Bi Z, Liang X, Li L, Shang X, Yin K, Tan, C, Xu J, Huang F (2022) CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp 7888–7915
Chen Q, Zhu X, Ling Z-H, Wei S, Jiang H, Inkpen D (2017) Enhanced LSTM for natural language inference. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp 1657–1668
Acknowledgements
This work was supported by the National Natural Science Foundation of China (Nos. 62176221, 62272398), Sichuan Science and Technology Program (Nos. 2023YFG0354, MZGC20230073, 2024YFHZ0024), the Key Research and Development Program of Sichuan Province (Nos. 2022NSFSC0502), and the 2023 Southwest Jiaotong University International Student Education Management Research Project (No. 23LXSGL01).
Author information
Authors and Affiliations
Contributions
Conception and design of study: Jie Hu; Acquisition of data and data curation: Jie Hu and Yinglian Zhu; Analysis and/or interpretation of data: Jie Hu and Lishan Wu; Drafting the manuscript: Jie Hu and Lishan Wu; Critical revision: Jie Hu, Yinglian Zhu, Qilei Luo, Fei Teng and Tianrui Li.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Ethical approval
Informed consent was obtained from all individual participants included in the study.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Hu, J., Zhu, Y., Wu, L. et al. Text semantic matching algorithm based on the introduction of external knowledge under contrastive learning. Int. J. Mach. Learn. & Cyber. 16, 741–753 (2025). https://doi.org/10.1007/s13042-024-02285-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13042-024-02285-2