Abstract
Concept normalization in free-form texts is a crucial step in every text-mining pipeline. Neural architectures based on Bidirectional Encoder Representations from Transformers (BERT) have achieved state-of-the-art results in the biomedical domain. In the context of drug discovery and development, clinical trials are necessary to establish the efficacy and safety of drugs. We investigate the effectiveness of transferring concept normalization from the general biomedical domain to the clinical trials domain in a zero-shot setting with an absence of labeled data. We propose a simple and effective two-stage neural approach based on fine-tuned BERT architectures. In the first stage, we train a metric learning model that optimizes relative similarity of mentions and concepts via triplet loss. The model is trained on available labeled corpora of scientific abstracts to obtain vector embeddings of concept names and entity mentions from texts. In the second stage, we find the closest concept name representation in an embedding space to a given clinical mention. We evaluated several models, including state-of-the-art architectures, on a dataset of abstracts and a real-world dataset of trial records with interventions and conditions mapped to drug and disease terminologies. Extensive experiments validate the effectiveness of our approach in knowledge transfer from the scientific literature to clinical trials.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Aronson, A.R.: Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program. In: Proceedings of the AMIA Symposium, p. 17. American Medical Informatics Association (2001)
Atal, I., Zeitoun, J.D., Névéol, A., Ravaud, P., Porcher, R., Trinquart, L.: Automatic classification of registered clinical trials towards the global burden of diseases taxonomy of diseases and injuries. BMC Bioinform. 17(1), 392 (2016)
Bodenreider, O.: The unified medical language system (UMLS): integrating biomedical terminology. Nucleic Acids Res. 32(suppl\_1), D267–D270 (2004)
Boland, M.R., Miotto, R., Gao, J., Weng, C.: Feasibility of feature-based indexing, clustering, and search of clinical trials. Meth. Inf. Med. 52(05), 382–394 (2013)
Brown, A.S., Patel, C.J.: A standard database for drug repositioning. Sci. Data 4(1), 1–7 (2017)
Coletti, M.H., Bleich, H.L.: Medical subject headings used to search the biomedical literature. J. Am. Med. Inform. Assoc. 8(4), 317–323 (2001)
Davis, A.P., et al.: The comparative toxicogenomics database: update 2019. Nucleic Acids Res. 47(D1), D948–D954 (2019)
Davis, A.P., Wiegers, T.C., Rosenstein, M.C., Mattingly, C.J.: Medic: a practical disease vocabulary used at the comparative toxicogenomics database. Database 2012 (2012)
Dermouche, M., Looten, V., Flicoteaux, R., Chevret, S., Velcin, J., Taright, N.: ECSTRA-INSERM@ CLEF eHealth2016-task 2: ICD10 code extraction from death certificates. CLEF (2016)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186 (2019)
Gayvert, K.M., Madhukar, N.S., Elemento, O.: A data-driven approach to predicting successes and failures of clinical trials. Cell Chem. Biol. 23(10), 1294–1301 (2016)
Ghiasvand, O., Kate, R.J.: UWM: Disorder mention extraction from clinical text using CRFs and normalization using learned edit distance patterns. In: SemEval@ COLING, pp. 828–832 (2014)
Gill, S.K., Christopher, A.F., Gupta, V., Bansal, P.: Emerging role of bioinformatics tools and software in evolution of clinical research. Perspect. Clin. Res. 7(3), 115 (2016)
Gillick, D., et al.: Learning dense representations for entity retrieval. In: Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), pp. 528–537 (2019)
Hao, T., Rusanov, A., Boland, M.R., Weng, C.: Clustering clinical trials with similar eligibility criteria features. J. Biomed. Inform. 52, 112–120 (2014)
Hoffer, E., Ailon, N.: Deep metric learning using triplet network. In: Feragen, A., Pelillo, M., Loog, M. (eds.) SIMBAD 2015. LNCS, vol. 9370, pp. 84–92. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24261-3_7
Huang, C.C., Lu, Z.: Community challenges in biomedical text mining over 10 years: success, failure and the future. Briefings Bioinform. 17(1), 132–144 (2015)
Huang, P.S., He, X., Gao, J., Deng, L., Acero, A., Heck, L.: Learning deep structured semantic models for web search using clickthrough data. In: Proceedings of the 22nd ACM International Conference on Information & Knowledge Management, pp. 2333–2338 (2013)
Humeau, S., Shuster, K., Lachaux, M.A., Weston, J.: Poly-encoders: transformer architectures and pre-training strategies for fast and accurate multi-sentence scoring. CoRR abs/1905.01969. External Links: Link Cited by 2, 2–2 (2019)
Ivanenkov, Y., et al.: Identification of novel antibacterials using machine-learning techniques. Front. Pharmacol. 10, 913 (2019)
Johnson, J., Douze, M., Jégou, H.: Billion-scale similarity search with GPUS. arXiv preprint arXiv:1702.08734 (2017)
Leaman, R., Lu, Z.: Taggerone: joint named entity recognition and normalization with semi-markov models. Bioinformatics 32(18), 2839–2846 (2016)
Lee, J., et al.: Biobert: pre-trained biomedical language representation model for biomedical text mining. Bioinformatics (2019)
Leveling, J.: Patient selection for clinical trials based on concept-based retrieval and result filtering and ranking. In: TREC (2017)
Li, H., et al.: CNN-based ranking for biomedical entity normalization. BMC Bioinform. 18(11), 79–86 (2017)
Li, J., Lu, Z.: Systematic identification of pharmacogenomics information from clinical trials. J. Biomed. Inform. 45(5), 870–878 (2012)
Li, J., et al.: Biocreative V CDR task corpus: a resource for chemical disease relation extraction. Database 2016 (2016)
Liu, Y., Guo, Y., Bakker, E.M., Lew, M.S.: Learning a recurrent residual fusion network for multimodal matching. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4107–4116 (2017)
Lo, B.: Sharing clinical trial data: maximizing benefits, minimizing risk. Jama 313(8), 793–794 (2015)
McNemar, Q.: Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika 12(2), 153–157 (1947)
Miftahutdinov, Z., Tutubalina, E.: Deep neural models for medical concept normalization in user-generated texts. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, pp. 393–399 (2019)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)
Mork, J.G., Jimeno-Yepes, A., Aronson, A.R.: The NLM medical text indexer system for indexing biomedical literature. In: BioASQ@ CLEF (2013)
NLM: Umls glossary (2016). http://www.nlm.nih.gov/research/umls/new_users/glossary.html
Phan, M.C., Sun, A., Tay, Y.: Robust representation learning of biomedical names. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 3275–3285 (2019)
Pradhan, S., Elhadad, N., Chapman, W.W., Manandhar, S., Savova, G.: Semeval-2014 task 7: Analysis of clinical text. In: SemEval@ COLING, pp. 54–62 (2014)
Reimers, N., Gurevych, I.: Sentence-bert: sentence embeddings using siamese bert-networks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3973–3983 (2019)
Schroff, F., Kalenichenko, D., Philbin, J.: Facenet: a unified embedding for face recognition and clustering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 815–823 (2015)
Sen, A., et al.: The representativeness of eligible patients in type 2 diabetes trials: a case study using gist 2.0. J. Am. Med. Inform. Assoc. 25(3), 239–247 (2018)
Shen, W., Wang, J., Han, J.: Entity linking with a knowledge base: issues, techniques, and solutions. IEEE Trans. Knowl. Data Eng. 27(2), 443–460 (2014)
Sung, M., Jeon, H., Lee, J., Kang, J.: Biomedical entity representations with synonym marginalization. arXiv preprint arXiv:2005.00239 (2020)
Suominen, H., et al.: Overview of the ShARe/CLEF eHealth evaluation lab 2013. In: Forner, P., Müller, H., Paredes, R., Rosso, P., Stein, B. (eds.) CLEF 2013. LNCS, vol. 8138, pp. 212–231. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40802-1_24
Tutubalina, E., Kadurin, A., Miftahutdinov, Z.: Fair evaluation in concept normalization: a large-scale comparative analysis for bert-based models. In: Proceedings of the 28th International Conference on Computational Linguistics, pp. 6710–6716 (2020)
Tutubalina, E., Miftahutdinov, Z., Nikolenko, S., Malykh, V.: Medical concept normalization in social media posts with recurrent neural networks. J. Biomed. Inform. 84, 93–102 (2018)
Van Mulligen, E., Afzal, Z., Akhondi, S.A., Vo, D., Kors, J.A.: Erasmus MC at CLEF eHealth 2016: Concept recognition and coding in French texts. CLEF (2016)
Wishart, D.S., et al.: Drugbank: a comprehensive resource for in silico drug discovery and exploration. Nucleic Acids Res. 34(suppl\_1), D668–D672 (2006)
Wright, D., Katsis, Y., Mehta, R., Hsu, C.N.: Normco: deep disease normalization for biomedical knowledge base construction. In: Automated Knowledge Base Construction (2019). https://openreview.net/forum?id=BJerQWcp6Q
Wu, P., Hoi, S.C., Xia, H., Zhao, P., Wang, D., Miao, C.: Online multimodal deep similarity learning with application to image retrieval. In: Proceedings of the 21st ACM International Conference on Multimedia, pp. 153–162 (2013)
Zhao, S., Liu, T., Zhao, S., Wang, F.: A neural multi-task learning framework to jointly model medical named entity recognition and normalization. Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 817–824 (2019)
Zhavoronkov, A., et al.: Deep learning enables rapid identification of potent DDR1 kinase inhibitors. Nat. Biotechnol. 37(9), 1038–1040 (2019)
Zhu, M., Celikkaya, B., Bhatia, P., Reddy, C.K.: Latte: Latent type modeling for biomedical entity linking. arXiv preprint arXiv:1911.09787 (2019)
Acknowledgements
Research on academic corpora was carried out by Z.M. and supported by RFBR, project no. 19–37-90074.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Miftahutdinov, Z., Kadurin, A., Kudrin, R., Tutubalina, E. (2021). Drug and Disease Interpretation Learning with Biomedical Entity Representation Transformer. In: Hiemstra, D., Moens, MF., Mothe, J., Perego, R., Potthast, M., Sebastiani, F. (eds) Advances in Information Retrieval. ECIR 2021. Lecture Notes in Computer Science(), vol 12656. Springer, Cham. https://doi.org/10.1007/978-3-030-72113-8_30
Download citation
DOI: https://doi.org/10.1007/978-3-030-72113-8_30
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-72112-1
Online ISBN: 978-3-030-72113-8
eBook Packages: Computer ScienceComputer Science (R0)