Skip to main content
Log in

Text semantic matching with an enhanced sample building method based on contrastive learning

  • Original Article
  • Published:
International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Abstract

Text semantic matching aims to determine whether two pieces of text point to the same semantic, which has been widely applied in clinical terminology standardization, recommendation systems, and other scenarios. Recently, many existing methods introduce the idea of contrast learning, to construct positive sample pairs and negative sample pairs for text semantic matching tasks. These methods first construct positive samples by using data augmentation and then use other samples within the same group as negative samples. However, the existing mainstream data enhancement methods like dropout ignore the impact of sentence length structure, and the implementation of the word repetition method is relatively complex. On the other hand, a sufficient number of negative samples is also crucial to the quality of model training. In this paper, we propose an enhanced sample building method (ESNCSE) to construct positive samples and negative samples for text semantic matching tasks. To generate positive sample pairs, we randomly insert some punctuation marks into the original text, which aims to add noise simply and efficiently. For the expansion of the number of negative samples without increasing calculation cost, we utilize the momentum contrast based on the sentence embedding method with soft negative sample (SNCSE). The experiment results on text semantic similarity task show that the average Spearman correlation coefficient is 79.74% for BERT-base and 80.64% for BERT-large.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

References

  1. Yuejun S, Zhiqiang L, Zhihao Y (2021) Standardization of clinical terminology based on bert. Chin J Inf 35:47582

    Google Scholar 

  2. Kenton JDM-WC, Toutanova LK (2019) Bert: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of annual conference of the North American chapter of the Association for computational linguistics: human language technologies, pp 4171–4186

  3. Yan Y, Li R, Wang S, Zhang F, Xu W (2021) ConSERT: a contrastive framework for self-supervised sentence representation transfer. In: Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (volume 1: long papers)

  4. Wu X, Gao C, Zang L, Han J, Wang Z, Hu S (2021) ESimCSE: enhanced sample building method for contrastive learning of unsupervised sentence embedding. arXiv:2109.04380 (arXiv preprint)

  5. Gao T, Yao X, Chen D (2021) SimCSE: simple contrastive learning of sentence embeddings. In: Proceedings of the 2021 conference on empirical methods in natural language processing, pp 6894–6910

  6. Karimi A, Rossi L, Prati A (2021) Aeda: an easier data augmentation technique for text classification. In: Proceedings of the 2021 conference on empirical methods in natural language processing, pp 2748–2754

  7. Wang H, Li Y, Huang Z, Dou Y, Kong L, Shao J (2022) SNCSE: contrastive learning for unsupervised sentence embedding with soft negative samples. arXiv:2201.05979 (arXiv preprint)

  8. Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V (2019) RoBERTa: a robustly optimized BERT pretraining approach. arXiv:1907.11692 (arXiv preprint)

  9. Li B, Zhou H, He J, Wang M, Yang Y, Li L (2020) On the sentence embeddings from pre-trained language models. arXiv:2011.05864 (arXiv preprint)

  10. Su J, Cao J, Liu W, Ou Y (2021) Whitening sentence representations for better semantics and faster retrieval. arXiv:2103.15316 (arXiv preprint)

  11. Liu P, Yuan W, Fu J, Jiang Z, Hayashi H, Neubig G (2021) Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing. arXiv:2107.13586 (arXiv preprint)

  12. Oord Avd, Li Y, Vinyals O (2018) Representation learning with contrastive predictive coding. arXiv:1807.03748 (arXiv preprint)

  13. Agirre E, Diab M, Cer D, Gonzalez-Agirre A (2012) SemEval-2012 Task 6: a pilot on semantic textual similarity. In: First joint conference on lexical and computational semantics (* SEM), volume 2: proceedings of the main conference and the shared task: semantic textual similarity, pp 385–393

  14. Agirre E, Cer D, Diab M, Gonzalez-Agirre A, Guo W (2013) SEM 2013 shared task: semantic textual similarity. In: Second joint conference on lexical and computational semantics (* SEM), volume 1: proceedings of the main conference and the shared task: semantic textual similarity, pp 32–43

  15. Agirre E, Banea C, Cardie C, Cer DM, Diab MT, Gonzalez-Agirre A, Guo W, Mihalcea R, Rigau G, Wiebe J(2014) SemEval-2014 task 10: multilingual semantic textual similarity. In: SemEval@ COLING, pp 81–91

  16. Agirre E, Banea C, Cardie C, Cer D, Diab M, Gonzalez-Agirre A, Guo W, Lopez-Gazpio I, Maritxalar M, Mihalcea R (2015) SemEval-2015 Task 2: semantic textual similarity, English, Spanish and pilot on interpretability. In: Proceedings of the 9th international workshop on semantic evaluation (SemEval 2015), pp 252–263

  17. Agirre E, Banea C, Cer D, Diab M, Gonzalez Agirre A, Mihalcea R, Rigau Claramunt G, Wiebe J (2016) SemEval-2016 Task 1: semantic textual similarity, monolingual and cross-lingual evaluation. In: Proceedings of the 10th international workshop on semantic evaluation; 2016 Jun 16–17; San Diego, CA. Stroudsburg (PA): ACL; 2016, pp 497-511. ACL (Association for Computational Linguistics)

  18. Cer D, Diab M, Agirre E, Lopez-Gazpio I, Specia L (2017) SemEval-2017 Task 1: semantic textual similarity multilingual and cross-lingual focused evaluation. arXiv:1708.00055 (arXiv preprint)

  19. Marelli M, Menini S, Baroni M, Bentivogli L, Bernardi R, Zamparelli R (2014) A sick cure for the evaluation of compositional distributional semantic models. In: Proceedings of the 9th international conference on language resources and evaluation (LREC’14), pp 216–223

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (Nos: 62176221, 62272398, 62276215), and the key research and development program of Sichuan Province, China (Nos: 2023YFG0354, 2022YFH0020).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jie Hu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wu, L., Hu, J., Teng, F. et al. Text semantic matching with an enhanced sample building method based on contrastive learning. Int. J. Mach. Learn. & Cyber. 14, 3105–3112 (2023). https://doi.org/10.1007/s13042-023-01823-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13042-023-01823-8

Keywords

Navigation