Skip to main content

Machine Learning Based Finding of Similar Sentences from French Clinical Notes

  • Conference paper
  • First Online:
Web Information Systems and Technologies (WEBIST 2020, WEBIST 2021)

Abstract

Finding similar sentences or paragraphs is a key issue when dealing with text redundancy. This is particularly the case in the clinical domain where redundancy in clinical notes makes their secondary use limited. Due to lack of resources, this task is a key challenge for French clinical documents. In this paper, we introduce a semantic similarity computing approach between French clinical sentences based on supervised machine learning algorithms. The proposed approach is implemented in a system called CONCORDIA, for COmputing semaNtic sentenCes for fRench Clinical Documents sImilArity. After briefly reviewing various semantic textual similarity measures reported in the literature, we describe the approach, which relies on Random Forest (RF), Multilayer Perceptron (MLP) and Linear Regression (LR) algorithms to build different supervised models. These models are thereafter used to determine the degrees of semantic similarity between clinical sentences. CONCORDIA is evaluated using traditional evaluation metrics, EDRM (Accuracy in relative distance to the average solution) and Spearman correlation, on standard benchmarks provided in the context of the DEFT 2020 challenge. According to the official results of this challenge, our MLP based model ranked first out of the 15 submitted systems with an EDRM of 0.8217 and a Spearman correlation coefficient of 0.7691. The post-challenge development of CONCORDIA and the experiments performed after the DEFT 2020 edition showed a significant improvement of the performance of the different implemented models. In particular, the new MLP based model achieves a Spearman correlation coefficient of 0.80. On the other hand, the LR one, which combines the output of the MLP model with word embedding similarity scores, obtains the higher Spearman correlation coefficient with a score of 0.8030. Therefore, the experiments show the effectiveness and the relevance of the proposed approach for finding similar sentences on French clinical notes.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 54.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 69.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Agirre, E., et al.: SemEval-2015 task 2: semantic textual similarity, English, Spanish and pilot on interpretability. In: Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), pp. 252–263. Association for Computational Linguistics, Denver (2015). https://doi.org/10.18653/v1/S15-2045, https://www.aclweb.org/anthology/S15-2045

  2. Agirre, E., et al.: SemEval-2016 task 1: semantic textual similarity, monolingual and cross-lingual evaluation. In: Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), pp. 497–511. Association for Computational Linguistics, San Diego (2016). https://doi.org/10.18653/v1/S16-1081, https://www.aclweb.org/anthology/S16-1081

  3. Bird, S., Loper, E.: NLTK: the natural language toolkit. In: Proceedings of the ACL Interactive Poster and Demonstration Sessions, pp. 214–217. Association for Computational Linguistics, Barcelona (2004). https://www.aclweb.org/anthology/P04-3031

  4. Bodenreider, O.: The unified medical language system (UMLS): integrating biomedical terminology. Nucleic Acids Res. 32(Database issue), D267–D270 (2004). https://doi.org/10.1093/nar/gkh061, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC308795/

  5. Cardon, R., Grabar, N., Grouin, C., Hamon, T.: Presentation of the DEFT 2020 Challenge: open domain textual similarity and precise information extraction from clinical cases. In: Actes de la 6e conférence conjointe Journées d’Études sur la Parole (JEP, 33e édition), Traitement Automatique des Langues Naturelles (TALN, 27e édition), Rencontre des étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RÉCITAL, 22e édition). Atelier DÉfi Fouille de Textes, pp. 1–13. ATALA et AFCP, Nancy (2020). https://www.aclweb.org/anthology/2020.jeptalnrecital-deft.1

  6. Cer, D., Diab, M., Agirre, E., Lopez-Gazpio, I., Specia, L.: SemEval-2017 task 1: semantic textual similarity multilingual and crosslingual focused evaluation. In: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pp. 1–14. Association for Computational Linguistics, Vancouver (2017). https://doi.org/10.18653/v1/S17-2001, https://www.aclweb.org/anthology/S17-2001

  7. Chandrasekaran, D., Mago, V.: Evolution of semantic similarity-a survey. ACM Comput. Surv. 54(2) (Feb 2021). https://doi.org/10.1145/3440755, https://doi.org/10.1145/3440755, place: New York, NY, USA Publisher: Association for Computing Machinery

  8. Chen, Q., Du, J., Kim, S., Wilbur, W.J., Lu, Z.: Deep learning with sentence embeddings pre-trained on biomedical corpora improves the performance of finding similar sentences in electronic medical records. BMC Med. Inform. Decis. Making 20(1), 73 (2020). https://doi.org/10.1186/s12911-020-1044-0

    Article  Google Scholar 

  9. Chen, Q., Rankine, A., Peng, Y., Aghaarabi, E., Lu, Z.: Benchmarking effectiveness and efficiency of deep learning models for semantic textual similarity in the clinical domain: validation study. JMIR Med. Inform. 9(12), e27386 (2021). https://doi.org/10.2196/27386

    Article  Google Scholar 

  10. Dice, L.R.: Measures of the amount of ecologic association between species. Ecology 26(3), 297–302 (1945). https://doi.org/10.2307/1932409, https://app.dimensions.ai/details/publication/pub.1069656769, http://pdfs.semanticscholar.org/2304/5299013e8738bc8eff73827ef8de256aef66.pdf

  11. Dramé, K., Mougin, F., Diallo, G.: Large scale biomedical texts classification: a kNN and an ESA-based approaches. J. Biomed. Semant. 7, 40 (2016). https://doi.org/10.1186/s13326-016-0073-1

    Article  Google Scholar 

  12. Dramé, K., Sambe, G., Diallo, G.: CONCORDIA: computing semantic sentences for French clinical documents similarity. In: Proceedings of the 17th International Conference on Web Information Systems and Technologies - WEBIST, pp. 77–83. INSTICC, SciTePress (2021). https://doi.org/10.5220/0010687500003058

  13. Farouk, M.: Sentence semantic similarity based on word embedding and WordNet. In: 2018 13th International Conference on Computer Engineering and Systems (ICCES), pp. 33–37 (2018). https://doi.org/10.1109/ICCES.2018.8639211

  14. Farouk, M.: Measuring sentences similarity: a survey. Indian J. Sci. Technol. 12(25), 1–11 (2019). https://doi.org/10.17485/ijst/2019/v12i25/143977, http://arxiv.org/abs/1910.03940, arXiv: 1910.03940

  15. Fernando, S., Stevenson, M.: A semantic similarity approach to paraphrase detection. In: Proceedings of the 11th Annual Research Colloquium of the UK Special Interest Group for Computational Linguistics, pp. 45–52. Citeseer (2008)

    Google Scholar 

  16. Grabar, N., Cardon, R.: CLEAR - simple corpus for medical French. In: Proceedings of the 1st Workshop on Automatic Text Adaptation (ATA), pp. 3–9. Association for Computational Linguistics, Tilburg (2018). https://doi.org/10.18653/v1/W18-7002, https://www.aclweb.org/anthology/W18-7002

  17. Grabar, N., Claveau, V., Dalloux, C.: CAS: French corpus with clinical cases. In: Lavelli, A., Minard, A.L., Rinaldi, F. (eds.) Proceedings of the Ninth International Workshop on Health Text Mining and Information Analysis, Louhi@EMNLP 2018, Brussels, Belgium, 31 October 2018, pp. 122–128. Association for Computational Linguistics (2018). https://aclanthology.info/papers/W18-5614/w18-5614

  18. Jaccard, P.: The distribution of the flora in the alpine zone. 1. New Phytol. 11(2), 37–50 (1912). https://doi.org/10.1111/j.1469-8137.1912.tb05611.x, https://nph.onlinelibrary.wiley.com/doi/abs/10.1111/j.1469-8137.1912.tb05611.x, _eprint: https://nph.onlinelibrary.wiley.com/doi/pdf/10.1111/j.1469-8137.1912.tb05611.x

  19. Jiang, J.J., Conrath, D.W.: Semantic similarity based on corpus statistics and lexical taxonomy. In: Proceedings of the 10th Research on Computational Linguistics International Conference, pp. 19–33. The Association for Computational Linguistics and Chinese Language Processing (ACLCLP), Taipei (1997). https://aclanthology.org/O97-1002

  20. Jones, K.S.: A statistical interpretation of term specificity and its application in retrieval. J. Doc. (2004). https://doi.org/10.1108/00220410410560573, https://www.emerald.com/insight/content/doi/10.1108/00220410410560573/full/html

  21. Kenter, T., de Rijke, M.: Short text similarity with word embeddings. In: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, CIKM 2015, pp. 1411–1420. Association for Computing Machinery, New York (2015). https://doi.org/10.1145/2806416.2806475

  22. Landauer, T.K., Foltz, P.W., Laham, D.: An introduction to latent semantic analysis. Discour. Process. 25(2–3), 259–284 (1998). https://doi.org/10.1080/01638539809545028, _eprint: https://doi.org/10.1080/01638539809545028

  23. Le, Q.V., Mikolov, T.: Distributed representations of sentences and documents. arXiv:1405.4053 [cs] (2014)

  24. Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Sov. phys. Dokl. 10, 707–710 (1965)

    MATH  Google Scholar 

  25. Lin, D.: An information-theoretic definition of similarity. In: Proceedings of the Fifteenth International Conference on Machine Learning, ICML 1998, pp. 296–304. Morgan Kaufmann Publishers Inc., San Francisco (1998)

    Google Scholar 

  26. Liu, H., Wang, P.: Assessing sentence similarity using WordNet based word similarity. J. Softw. 8(6), 1451–1458 (2013). https://doi.org/10.4304/jsw.8.6.1451-1458

    Article  Google Scholar 

  27. McInnes, B.T., Pedersen, T., Pakhomov, S.V.: UMLS-interface and UMLS-similarity : open source software for measuring paths and semantic similarity. In: AMIA Annual Symposium Proceedings 2009, pp. 431–435 (2009). https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2815481/

  28. Mihalcea, R., Corley, C., Strapparava, C.: Corpus-based and knowledge-based measures of text semantic similarity. In: Proceedings of the 21st National Conference on Artificial Intelligence, AAAI 2006, vol. 1, pp. 775–780. AAAI Press, Boston (2006)

    Google Scholar 

  29. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv:1301.3781 [cs] (2013)

  30. Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. arXiv:1310.4546 [cs, stat] (2013)

  31. Miller, G.A.: WordNet: a lexical database for English. Commun. ACM 38(11), 39–41 (1995). https://doi.org/10.1145/219717.219748

    Article  Google Scholar 

  32. Niwattanakul, S., Singthongchai, J., Naenudorn, E., Wanapu, S.: Using of jaccard coefficient for keywords similarity. In: Proceedings of The International MultiConference of Engineers and Computer Scientists 2013, pp. 380–384 (2013)

    Google Scholar 

  33. Ochiai, A.: Zoogeographical studies on the soleoid fishes found in Japan and its neighbouring regions-II. Bull. Jpn. Soc. scient. Fish. 22, 526–530 (1957). https://ci.nii.ac.jp/naid/10024483079

  34. P, S., Shaji, A.P.: A survey on semantic similarity. In: 2019 International Conference on Advances in Computing, Communication and Control (ICAC3), pp. 1–8 (2019). https://doi.org/10.1109/ICAC347590.2019.9036843

  35. Pawar, A., Mago, V.: Calculating the similarity between words and sentences using a lexical database and corpus statistics. arXiv:1802.05667 [cs] (2018)

  36. Pennington, J., Socher, R., Manning, C.D.: GloVe: global vectors for word representation. In: Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014). http://www.aclweb.org/anthology/D14-1162

  37. Rastegar-Mojarad, M., et al.: BioCreative/OHNLP challenge 2018. In: Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics, BCB 2018, p. 575. Association for Computing Machinery, New York (2018). https://doi.org/10.1145/3233547.3233672

  38. Resnik, P.: Using information content to evaluate semantic similarity in a taxonomy. In: Proceedings of the 14th International Joint Conference on Artificial Intelligence, IJCAI 1995, vol. 1, pp. 448–453. Morgan Kaufmann Publishers Inc., San Francisco (1995)

    Google Scholar 

  39. Soğancıoğlu, G., Öztürk, H., Özgür, A.: BIOSSES: a semantic sentence similarity estimation system for the biomedical domain. Bioinformatics 33(14), i49–i58 (2017). https://doi.org/10.1093/bioinformatics/btx238

  40. Ukkonen, E.: Approximate string-matching with q-grams and maximal matches. Theor. Comput. Sci. 92(1), 191–211 (1992). https://doi.org/10.1016/0304-3975(92)90143-4, https://www.sciencedirect.com/science/article/pii/0304397592901434

  41. Wang, Y., et al.: MedSTS: a resource for clinical semantic textual similarity. Lang. Resour. Eval. 54(1), 57–72 (2018). https://doi.org/10.1007/s10579-018-9431-1

    Article  Google Scholar 

  42. Wang, Y., Fu, S., Shen, F., Henry, S., Uzuner, O., Liu, H.: The 2019 n2c2/OHNLP track on clinical semantic textual similarity: overview. JMIR Med. Inform. 8(11), e23375 (2020). https://doi.org/10.2196/23375, https://medinform.jmir.org/2020/11/e23375. Company: JMIR Medical Informatics Distributor: JMIR Medical Informatics Institution: JMIR Medical Informatics Label: JMIR Medical Informatics Publisher: JMIR Publications Inc., Toronto, Canada

  43. Yang, X., He, X., Zhang, H., Ma, Y., Bian, J., Wu, Y.: Measurement of semantic textual similarity in clinical texts: comparison of transformer-based models. JMIR Med. Inform. 8(11), e19735 (2020). https://doi.org/10.2196/19735, http://www.ncbi.nlm.nih.gov/pubmed/33226350

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Khadim Dramé .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Dramé, K., Diallo, G., Sambe, G. (2023). Machine Learning Based Finding of Similar Sentences from French Clinical Notes. In: Marchiori, M., Domínguez Mayo, F.J., Filipe, J. (eds) Web Information Systems and Technologies. WEBIST WEBIST 2020 2021. Lecture Notes in Business Information Processing, vol 469. Springer, Cham. https://doi.org/10.1007/978-3-031-24197-0_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-24197-0_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-24196-3

  • Online ISBN: 978-3-031-24197-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics