Abstract
Word embeddings play a significant role in today’s Natural Language Processing tasks and applications. However, there is a significant gap in the availability of high quality-word embeddings specific to the Italian medical domain. This study aims to address this gap by proposing a tailored solution that combines Contrastive Learning (CL) methods and Knowledge Graph Embedding (KGE), introducing a new variant of the loss function. Given the limited availability of medical texts and controlled vocabularies in the Italian language, traditional approaches for word embedding generation may not yield adequate results. To overcome this challenge, our approach leverages the synergistic benefits of CL and KGE techniques. We achieve a significant performance boost compared to the initial model, while using a considerably smaller amount of data. This work establishes a solid foundation for further investigations aimed at improving the accuracy and coverage of word embeddings in low-resource languages and specialized domains.
D. A. Bondarenko and R. Ferrod—These authors contributed equally to this work.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
UMLS is a collection of controlled vocabularies which comprises a comprehensive thesaurus and ontology of the biomedical sciences; it is available at https://www.nlm.nih.gov/research/umls.
- 2.
Body Part, Organ, or Organ Component (BP), Body Substance (BS), Chemical (C), Medical Device (MD), Finding (F), Sign or Symptom (SS), Health Care Activity (HCA), Diagnostic Procedure (DP), Laboratory Procedure (LP), Therapeutic or Preventive Procedure (TPP), Pathologic Function (PF), Physiologic Function (PhF), and Injury or Poisoning (IP).
- 3.
cambridgeltl/SapBERT-UMLS-2020AB-all-lang-from-XLMR-large.
- 4.
GanjinZero/coder_all.
References
Beltagy, I., Lo, K., Cohan, A.: Scibert: a pretrained language model for scientific text. In: EMNLP (2019)
Bordes, A., Usunier, N., Garcia-Duran, A., Weston, J., Yakhnenko, O.: Translating embeddings for modeling multi-relational data. In: Burges, C., Bottou, L., Welling, M., Ghahramani, Z., Weinberger, K. (eds.) Advances in Neural Information Processing Systems, vol. 26. Curran Associates, Inc. (2013). https://proceedings.neurips.cc/paper_files/paper/2013/file/1cecc7a77928ca8133fa24680a88d2f9-Paper.pdf
Choi, Y., Chiu, C.Y.I., Sontag, D.A.: Learning low-dimensional representations of medical concepts. AMIA Summits Transl. Sci. Proc. 2016, 41–50 (2016)
Gu, Y., et al.: Domain-specific language model pretraining for biomedical natural language processing. ACM Trans. Comput. Healthc. 3(1) (2021). https://doi.org/10.1145/3458754
Huang, K., Altosaar, J., Ranganath, R.: Clinicalbert: modeling clinical notes and predicting hospital readmission. ArXiv abs/1904.05342 (2019)
Johnson, J., Douze, M., Jégou, H.: Billion-scale similarity search with GPUs. IEEE Trans. Big Data 7(3), 535–547 (2019)
Kazemi, S.M., Poole, D.: Simple embedding for link prediction in knowledge graphs. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 4289–4300. NIPS’18, Curran Associates Inc., Red Hook, NY, USA (2018)
Lee, J., et al.: Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36, 1234–1240 (2020)
Liu, F., Shareghi, E., Meng, Z., Basaldella, M., Collier, N.: Self-alignment pretraining for biomedical entity representations. In: NAACL (2021)
Liu, F., Vulić, I., Korhonen, A., Collier, N.: Learning domain-specialised representations for cross-lingual biomedical entity linking. In: Proceedings of ACL-IJCNLP 2021, August 2021
Liu, H., Cheng, J., Wang, W., Su, Y.: The general pair-based weighting loss for deep metric learning. arXiv preprint arXiv:1905.12837 (2019)
Magnini, B., Altuna, B., Lavelli, A., Speranza, M., Zanoli, R.: The e3c project: European clinical case corpus. In: SEPLN (2021)
Pakhomov, S., McInnes, B., Adam, T., Liu, Y., Pedersen, T., Melton, G.: Semantic similarity and relatedness between clinical terms: an experimental study. In: AMIA ... Annual Symposium Proceedings/AMIA Symposium. AMIA Symposium 2010, pp. 572–576, November 2010
Pakhomov, S.V.S., Pedersen, T., McInnes, B.T., Melton, G.B., Ruggieri, A.P., Chute, C.G.: Towards a framework for developing semantic relatedness reference standards. J. Biomed. Inform. 44(2), 251–65 (2011)
Polignano, M., Basile, P., Degemmis, M., Semeraro, G., Basile, V.: Alberto: Italian BERT language understanding model for NLP challenging tasks based on tweets. In: CLiC-it (2019)
Ronzani, M., et al.: Unstructured data in predictive process monitoring: lexicographic and semantic mapping to ICD-9-CM codes for the home hospitalization service. In: Bandini, S., Gasparini, F., Mascardi, V., Palmonari, M., Vizzari, G. (eds.) AIxIA 2021 – Advances in Artificial Intelligence. AIxIA 2021. LNCS, vol. 13196, pp. 700–715. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-08421-8_48
Sun, Z., Deng, Z., Nie, J., Tang, J.: Rotate: Knowledge graph embedding by relational rotation in complex space. In: 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, 6–9 May 2019. OpenReview.net (2019). https://openreview.net/forum?id=HkgEQnRqYQ
Trouillon, T., Welbl, J., Riedel, S., Gaussier, E., Bouchard, G.: Complex embeddings for simple link prediction. In: Balcan, M.F., Weinberger, K.Q. (eds.) Proceedings of The 33rd International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 48, pp. 2071–2080. PMLR, New York, New York, USA, 20–22 June 2016. https://proceedings.mlr.press/v48/trouillon16.html
Wang, X., Han, X., Huang, W., Dong, D., Scott, M.R.: Multi-similarity loss with general pair weighting for deep metric learning. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5017–5025 (2019)
Yang, B., Yih, W., He, X., Gao, J., Deng, L.: Embedding entities and relations for learning and inference in knowledge bases. In: Bengio, Y., LeCun, Y. (eds.) 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, 7–9 May 2015, Conference Track Proceedings (2015). https://arxiv.org/abs/1412.6575
Yuan, Z., Zhao, Z., Yu, S.: Coder: knowledge infused cross-lingual medical term embedding for term normalization. J. Biomed. Inform. 103983 (2022)
Zeng, S., Yuan, Z., Yu, S.: Automatic biomedical term clustering by learning fine-grained term representations. In: BIONLP (2022)
Zhang, R., Ji, Y., Zhang, Y., Passonneau, R.J.: Contrastive data and learning for natural language processing. In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Tutorial Abstracts, pp. 39–47. Association for Computational Linguistics, Seattle, United States, July 2022. https://doi.org/10.18653/v1/2022.naacl-tutorials.6, https://aclanthology.org/2022.naacl-tutorials.6
Zhang, S., et al.: Knowledge-rich self-supervised entity linking. ArXiv abs/2112.07887 (2021)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Bondarenko, D.A., Ferrod, R., Caro, L.D. (2023). Combining Contrastive Learning and Knowledge Graph Embeddings to Develop Medical Word Embeddings for the Italian Language. In: Basili, R., Lembo, D., Limongelli, C., Orlandini, A. (eds) AIxIA 2023 – Advances in Artificial Intelligence. AIxIA 2023. Lecture Notes in Computer Science(), vol 14318. Springer, Cham. https://doi.org/10.1007/978-3-031-47546-7_28
Download citation
DOI: https://doi.org/10.1007/978-3-031-47546-7_28
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-47545-0
Online ISBN: 978-3-031-47546-7
eBook Packages: Computer ScienceComputer Science (R0)