Skip to main content

Advertisement

Log in

Text semantic matching algorithm based on the introduction of external knowledge under contrastive learning

  • Original Article
  • Published:
International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Abstract

Measuring the semantic similarity between two texts is a fundamental aspect of text semantic matching. Each word in the texts holds a weighted meaning, and it is essential for the model to effectively capture the most crucial knowledge. However, current text matching methods based on BERT have limitations in acquiring professional domain knowledge. BERT requires extensive domain-specific training data to perform well in specialized fields such as medicine, where obtaining labeled data is challenging. In addition, current text matching models that inject domain knowledge often rely on creating new training tasks to fine-tune the model, which is time-consuming. Although existing works have directly injected domain knowledge into BERT through similarity matrices, they struggle to handle the challenge of small sample sizes in professional fields. Contrastive learning trains a representation learning model by generating instances that exhibit either similarity or dissimilarity, so that a more general representation can be learned with a small number of samples. In this paper, we propose to directly integrate the word similarity matrix into BERT’s multi-head attention mechanism under a contrastive learning framework to align similar words during training. Furthermore, in the context of Chinese medical applications, we propose an entity MASK approach to enhance the understanding of medical terms by pre-trained models. The proposed method helps BERT acquire domain knowledge to better learn text representations in professional fields. Extensive experimental results have shown that the algorithm significantly improves the performance of the text matching model, especially when training data is limited.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data Availability

The data generated and/or analyzed during the current study is available from the corresponding author on reasonable request.

References

  1. Ranathunga S, Lee E-SA, Prifti Skenduli M, Shekhar R, Alam M, Kaur R (2023) Neural machine translation for low-resource languages: a survey. ACM Comput Surv 55(11):1–37

    Article  Google Scholar 

  2. Fan Y, Xie X, Cai Y, Chen J, Ma X, Li X, Zhang R, Guo J (2022) Pre-training methods in information retrieval. Found Trends® Inf Retriev 16(3):178–317

    Article  MATH  Google Scholar 

  3. Deldjoo Y, Nazary F, Ramisa A, Mcauley J, Pellegrini G, Bellogin A, Noia TD (2023) A review of modern fashion recommender systems. ACM Comput Surv 56(4):1–37

    Article  Google Scholar 

  4. Kenton JDM-WC, Toutanova LK (2019) BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp 4171–4186

  5. Radford A, Narasimhan K, Salimans T, Sutskever I et al (2018) Improving language understanding by generative pre-training

  6. Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V (2019) RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692

  7. Liu W, Zhou P, Zhao Z, Wang Z, Ju Q, Deng H, Wang P (2020)K-BERT: Enabling language representation with knowledge graph. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol 34, pp 2901–2908

  8. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Adv Neural Inf Process Syst 30

  9. Chen MY, Jiang H, Yang Y (2022) Context enhanced short text matching using clickthrough data. arXiv preprint arXiv:2203.01849

  10. Xia T, Wang Y, Tian Y, Chang Y (2021) Using prior knowledge to guide BERT’s attention in semantic textual matching tasks. In: Proceedings of the Web Conference, pp 2466–2475

  11. Rasmy L, Xiang Y, Xie Z, Tao C, Zhi D (2021) Med-BERT: pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction. NPJ Digit Med 4(1):86

    Article  Google Scholar 

  12. Su J, Cao J, Liu W, Ou Y (2021) Whitening sentence representations for better semantics and faster retrieval. arXiv preprint arXiv:2103.15316

  13. Caron M, Misra I, Mairal J, Goyal P, Bojanowski P, Joulin A (2020) Unsupervised learning of visual features by contrasting cluster assignments. Adv Neural Inf Process Syst 33:9912–9924

    Google Scholar 

  14. Reimers N, Gurevych I (2019) Sentence-BERT: Sentence embeddings using siamese BERT-networks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp 3982–3992

  15. Sun Y, Wang S, Li Y, Feng S, Tian H, Wu H, Wang H (2020) ERNIE 2.0: A continual pre-training framework for language understanding. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol 34, pp 8968–8975

  16. Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I (2019) Language models are unsupervised multitask learners. OpenAI Blog 1(8):9

    Google Scholar 

  17. Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A (2020) Language models are few-shot learners. Adv Neural Inf Process Syst 33:1877–1901

    Google Scholar 

  18. Koubaa A (2023) GPT-4 vs. GPT-3.5: A concise showdown

  19. Li B, Zhou H, He J, Wang M, Yang Y, Li L (2020) On the sentence embeddings from pre-trained language models. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp 9119–9130

  20. Liu Z, Xiong C, Sun M, Liu Z (2018) Entity-duet neural ranking: Understanding the role of knowledge graph semantics in neural information retrieval. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp 2395–2405

  21. Wang Z, Wu Z, Agarwal D, Sun J (2022) MedCLIP: Contrastive learning from unpaired medical images and text. In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

  22. Rethmeier N, Augenstein I (2023) A primer on contrastive pretraining in language processing: methods, lessons learned, and perspectives. ACM Comput Surv 55(10):1–17

    Article  MATH  Google Scholar 

  23. Gao T, Yao X, Chen D (2021) SimCSE: Simple contrastive learning of sentence embeddings. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp 6894–6910

  24. Wu X, Gao C, Zang L, Han J, Wang Z, Hu S (2022)ESimCSE: Enhanced sample building method for contrastive learning of unsupervised sentence embedding. In: Proceedings of the 29th International Conference on Computational Linguistics, pp 3898–3907

  25. Chuang Y-S, Dangovski R, Luo H, Zhang Y, Chang S, Soljačić M, Li S-W, Yih S, Kim Y, Glass J (2022) DiffCSE: Difference-based contrastive learning for sentence embeddings. In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp 4207–4218

  26. Liu J, Liu J, Wang Q, Wang J, Wu W, Xian Y, Zhao D, Chen K, Yan R (2023) RankCSE: Unsupervised Sentence Representations Learning via Learning to Rank. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp 13785–13802

  27. Chanchani S, Huang R (2023) Composition-contrastive learning for sentence embeddings. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp 15836–15848

  28. Zhou K, Zhang B, Zhao WX, Wen J-R (2022) Debiased contrastive learning of unsupervised sentence representations. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp 6120–6130

  29. Wu X, Gao C, Su Y, Han J, Wang Z, Hu S (2022) Smoothed contrastive learning for unsupervised sentence embedding. In: Proceedings of the 29th International Conference on Computational Linguistics, pp 4902–4906

  30. Huang X, Peng H, Zou D, Liu Z, Li J, Liu K, Wu J, Su J, Yu PS (2024) CoSENT: Consistent sentence embedding via similarity ranking. IEEE/ACM Trans Audio Speech Language Process 32:2800–2813

    Article  Google Scholar 

  31. Nishikawa S, Ri R, Yamada I, Tsuruoka Y, Echizen I (2022) EASE: Entity-aware contrastive learning of sentence embedding. In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp 3870–3885

  32. Wu L, Hu J, Teng F, Li T, Du S (2023) Text semantic matching with an enhanced sample building method based on contrastive learning. Int J Mach Learn Cybern 14:3105–3112

    Article  MATH  Google Scholar 

  33. Karimi A, Rossi L, Prati A (2021) AEDA: An easier data augmentation technique for text classification. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp 2748–2754

  34. Liu H, Singh P (2004) ConceptNet–a practical commonsense reasoning tool-kit. BT Technol J 22(4):211–226

    Article  MATH  Google Scholar 

  35. Cer D, Diab M, Agirre EE, Lopez-Gazpio I, Specia L (2017) SemEval-2017 Task 1: Semantic textual similarity multilingual and cross-lingual focused evaluation. In: The 11th International Workshop on Semantic Evaluation (SemEval-2017), pp 1–14

  36. Le HT, Cao DT, Bui TH, Luong LT, Nguyen HQ (2021) Improve quora question pair dataset for question similarity task. In: 2021 RIVF International Conference on Computing and Communication Technologies (RIVF), pp 1–5

  37. Dolan B, Brockett C (2005) Automatically constructing a corpus of sentential paraphrases. In: 3rd International Workshop on Paraphrasing (IWP2005)

  38. Lan W, Qiu S, He H, Xu W (2017)A continuously growing dataset of sentential paraphrases. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp 1224–1234

  39. Jin Q, Dhingra B, Liu Z, Cohen W, Lu X (2019) PubMedQA: A dataset for biomedical research question answering. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp 2567–2577

  40. Tianchi (2020) New crown epidemic question sentence judgment dataset. https://tianchi.aliyun.com/dataset/dataDetail?dataId=76751

  41. Zhang N, Chen M, Bi Z, Liang X, Li L, Shang X, Yin K, Tan, C, Xu J, Huang F (2022) CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp 7888–7915

  42. Chen Q, Zhu X, Ling Z-H, Wei S, Jiang H, Inkpen D (2017) Enhanced LSTM for natural language inference. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp 1657–1668

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (Nos. 62176221, 62272398), Sichuan Science and Technology Program (Nos. 2023YFG0354, MZGC20230073, 2024YFHZ0024), the Key Research and Development Program of Sichuan Province (Nos. 2022NSFSC0502), and the 2023 Southwest Jiaotong University International Student Education Management Research Project (No. 23LXSGL01).

Author information

Authors and Affiliations

Authors

Contributions

Conception and design of study: Jie Hu; Acquisition of data and data curation: Jie Hu and Yinglian Zhu; Analysis and/or interpretation of data: Jie Hu and Lishan Wu; Drafting the manuscript: Jie Hu and Lishan Wu; Critical revision: Jie Hu, Yinglian Zhu, Qilei Luo, Fei Teng and Tianrui Li.

Corresponding author

Correspondence to Jie Hu.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Ethical approval

Informed consent was obtained from all individual participants included in the study.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hu, J., Zhu, Y., Wu, L. et al. Text semantic matching algorithm based on the introduction of external knowledge under contrastive learning. Int. J. Mach. Learn. & Cyber. 16, 741–753 (2025). https://doi.org/10.1007/s13042-024-02285-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13042-024-02285-2

Keywords