Abstract
A small percentage of the population is afflicted by what is called an orphan or a rare disease. All over the world, there are about several thousand of these diseases. When adding up together all the individuals who are affected, it amounts for up to 10% of the US population. Scientific works on these diseases are often poorly financed due to the lack of potential markets for a treatment, which means for patients and clinicians a very limited and scattered access to vital information. To contribute addressing this issue, we present in this paper a new software tool for automating the extraction of information related to rare diseases from scientific publications. More precisely, our contribution consists in a new method of extracting automatically symptoms of these diseases from research papers exploiting a Named Entity Recognition (NER) algorithm based on the numerical statistic Term Frequency - Inverse Document Frequency (TF-IDF). The proposed tool has been tested using PubMed Central (PMC) database.
Similar content being viewed by others
References
OoM (2018) Budget. Budget of the U.S. Government (2018). https://www.whitehouse.gov/
National institutes for health (2018) Budget. https://www.nih.gov/about-nih/what-we-do/budget
Rooke T (2018) The therapeutic challenge of rare diseases. Mayo Clin Proc 93(5):560
Orphanet (2018) Orphanet: about orphanet. https://www.orpha.net/consor/cgi-bin/Education_AboutOrphanet.php
EU (2015) European platform for rare disease registries. http://www.epirare.eu
NORD (1969) Home - NORD (national organization for rare disorders). https://rarediseases.org
Levenshtein V (1966) Binary codes capable of correcting deletions, insertions and reversals. Sov Phys Dokl 10:707
Gupta V, Lehal GS (2009) Journal of Emerging Technologies in Web Intelligence 1(1):60. https://doi.org/10.4304/jetwi.1.1.60-76
Allahyari M, Pouriyeh S, Assefi M, Safaei S, Trippe ED, Gutierrez JB, Kochut K (2017) arXiv:1707.02268. https://doi.org/10.14569/IJACSA.2017.081052
Venkata N, Padmasree L, Mangathayaru N (2016) Int J Comput Appl 146 (11):30. https://doi.org/10.5120/ijca2016910908
Liu Y, Liang Y, Wishart D (2015) Nucleic Acids Res 43(W1):W535. https://doi.org/10.1093/nar/gkv383
Li A, Zang Q, Sun D, Wang M (2016) Neurocomputing 206:73. https://doi.org/10.1016/j.neucom.2015.11.110
Peng Y, Wei CH, Lu Z (2016) J Cheminf 8(1):1. https://doi.org/10.1186/s13321-016-0165-z
Mahmood AS, Wu TJ, Mazumder R, Vijay-Shanker K (2016) , . PLoS ONE 11(4):1. https://doi.org/10.1371/journal.pone.0152725
Bui QC, Sloot PMA (2012) Bioinformatics 28(20):2654. https://doi.org/10.1093/bioinformatics/bts487
Holat P, Tomeh N, Charnois T, Battistelli D, Jaulent MC, Métivier JP (2016) Weakly-supervised symptom recognition for rare diseases in biomedical text
Martin L, Battistelli D, Charnois T (2014). In: 13th workshop on biomedical natural language processing (BioNLP 2014), pp 107–111
Schmid H (1995) Treetagger| a language independent part-of-speech tagger. Institut für Maschinelle Sprachverarbeitung, Universität Stuttgart 43:28
Manning C, Surdeanu M, Bauer J, Finkel J, Bethard S, McClosky D (2014). In: Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations, pp 55–60. https://doi.org/10.3115/v1/P14-5010
Orphadata (2013) Free access data from Orphanet. http://www.orphadata.org
U.S. National Institutes of Health’s National Library of Medicine (NIH/NLM) (2018) Pubmed Central. https://www.ncbi.nlm.nih.gov/pmc
Köhler S, Vasilevsky NA, et al. (2017) Nucleic Acids Res 45(D1):D865. https://doi.org/10.1093/nar/gkw1039
Freud S (1920) Entrez programming utilities help [Internet]. Bethesda: national center for biotechnology information
Umbel C, Ellis R, Mull R (2011) NaturalNode/natural. https://github.com/NaturalNode/natural
Alias-i (2008) LingPipe. http://alias-i.com/lingpipe/
Liu Y, Liao WK, Choudhary A, Li J (2007) Parallel data mining algorithms for association rules and clustering. CRC Press, Boca Raton. https://doi.org/10.1201/9781420011296.ch32
Vukotic V, Claveau V, Raymond C (2015) IRISA at DeFT 2015: supervised and unsupervised methods in sentiment analysis. https://hal.archives-ouvertes.fr/hal-01226528
Garcia E (2008). J Doc 60(5):503. https://doi.org/10.1108/00220410410560582
Cousyn C, Bouchard K, Bouchard B, Gaboury S. In: Proceedings of the 4th EAI international conference on smart objects and technologies for social good - Goodtechs ’18. Goodtechs ’18. ACM, New York, pp 13–18. https://doi.org/10.1145/3284869.3284892
Acknowledgements
This project success was conducted with the financial support received from UQAC and the National Sciences and Engineering Research Council of Canada (NSERC).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Cousyn, C., Bouchard, K., Gaboury, S. et al. Towards Using Scientific Publications to Automatically Extract Information on Rare Diseases. Mobile Netw Appl 25, 953–960 (2020). https://doi.org/10.1007/s11036-019-01237-3
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11036-019-01237-3