A literature-driven method to calculate similarities among diseases
Introduction
Diseases are functionally connected to one another. One gene can cause various diseases, or inhibiting protein translation by one miRNA can be a contributing factor to various diseases. Therefore, a person with a certain disease has a higher probability of getting functionally connected disease than normal people. By using the disease connection information, the possibility of a specific disease onset for a person can be predicted and it is a simple example showing how disease–disease similarity can be utilized for disease-related function research. Disease–disease similarity will be of much help to disease research. It can be useful for development of new drug by aiding in drug repositioning, for searching new genes related to disease, and it can increase efficiency of network analysis in disease-related function research by enhancing disease networks.
There are three primary approaches to get disease–disease similarity: function-based approaches [1], [2], [3] and semantic-based approaches [4], [5], [6], [7], [8], [9], [10], [11], [12], and hybrid approaches of combining previous two approaches [13]. To seek the disease–disease similarity, function-based approaches compare functionally related genes, pathways and biological processes, and semantic-based approaches find similarity between disease terms of ontology related to diseases. Hybrid approaches utilize both functional similarity and semantic similarity. Liu et al. [1] calculated disease–disease similarity using both genetic information from GAD (Genetic Association Database) and environmental etiological factors from MeSH (Medical Subject Headings). Suthram et al. [2] calculated disease–disease similarity using mRNA expression from GEO (Gene Expression Omnibus) database and protein–protein interaction from HPRD (Human Protein Reference Database). Mathur and Dinakarpandian [3] calculated disease–disease similarity using semantic similarity of biological process based on gene ontology. In Li's case [4], a software package for calculating disease–disease similarity was developed using semantic similarities among terms of disease ontology and in the software, 10 methods of seeking semantic similarity are applied to disease ontology in calculating disease–disease similarity. Lastly, Cheng's [13] is a hybrid approach, which first calculates association score utilizing a gene function network and disease-related gene set, and secondly calculates semantic score on disease ontology, and finally gets disease–disease similarity adding these two scores.
Many existing methods find disease–disease similarity using genetic information or semantic information on gene ontology but there are also other similar approaches. Goh et al. [14] constructed human disease network with gene–disease associations from OMIM (Online Mendelian Inheritance in Man) database. They made a connection between two diseases if the diseases shared at least one gene. Lee et al. [15] constructed bipartite human disease association network using shared metabolic pathways from KEGG (Kyoto Encyclopedia of Genes and Genomes) database. Two diseases are linked if mutated enzymes associated with them catalyze adjacent metabolic reactions. Goh's method and Lee's method are related to the proposed method, but they cannot calculate disease-disease similarity. Zhang et al. [16] created feature vectors for disease phenotypes by utilizing phenotype records and calculated cosine similarities among disease phenotypes using the feature vectors, and then developed a disease phenotype network. van Driel et al. [17] employed text mining approach to calculate similarities between diseases. MeSH terms were served as features, and the number of times the term was found in an OMIM record was counted for feature value. They used MeSH hierarchy and inverse document frequency measure to refine feature values. Lastly, similarity between two diseases was computed by the cosine of the angle between their corresponding feature vectors. Hamaneh et al. [18] calculated disease–disease similarity by considering information flow on disease-protein network. The disease–protein network was made by using disease–gene associations from CTD (The Comparative Toxicogenomics Database) database and protein–protein interactions from ppiTrim database. Proteins were treated as features of a disease, and feature value was defined by the expected number of visits by random walker on the disease–protein network. Then disease–disease similarity was calculated by the cosine of the angle similar to van Driel's method.
Biomedical term relations from literature (research papers) can also help calculate disease–disease similarity. We propose a new literature-based method LDDSim (Literature-Driven Disease Similarities) to measure disease similarity. The proposed method extracts disease–gene relations and disease–drug relations from literature, and with the number of those relations, it builds disease–gene matrix and disease–drug matrix. Then the method calculates disease–disease similarity utilizing mutual information between the two diseases. In addition to it, we constructed disease network using the disease similarities.
Section snippets
Materials
In this paper, we proposed a method that calculates disease-disease similarity and developed a disease network using disease pairs having high similarity. We extracted 36,686 disease-gene relations and 25,721 disease–drug relations from 9,803,245 MEDLINE abstracts in between year 1980 and 2012 using 27,850 disease names, 61,304 gene symbols and 9388 drug names from PharmGKB database. After constructing disease–gene matrix and disease–drug matrix with these relations, similarities of 3,353,503
Results
The proposed method can calculate similarities of 3,353,503 disease pairs. We evaluated the statistics of our similarities with mean, median, min, max, and standard deviation (Table 2).
The statistics indicate that the similarities are generally very low and there are high similarity-outliers because the mean is much larger than the median. For that reason, we can assume that the high similarity-outliers are significant. We investigated the trend of the similarities to get the outliers (Fig. 5).
Discussion
The proposed method extracts disease–gene relations and disease–drug relations from literature to get feature values of diseases utilizing frequency of occurrences of the relations. Then disease–disease similarities can be calculated by using the feature values. The proposed method discovered a larger number of answer disease pairs than other comparable methods and also found many actual disease pairs when manually checking the top-ranking disease pairs. We presume that our method showed good
Conclusions
We calculated disease–disease similarity using literature data. Our method discovered a larger number of answer disease pairs than other comparable methods and we manually checked that 15 disease pairs out of the top 20 disease pairs have actual relations. Moreover, we constructed literature-driven disease network with the top 167 disease pairs. We presume that our method showed good results because of using literature data, using all possible gene symbols and drug names for features of a
Conflict of interest
None declared.
Acknowledgements
This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIP) (NRF-2015R1A2A1A05001845). We also appreciate Mr. Junsik Kim's proofreading efforts.
References (69)
- et al.
From phenotype to gene: detecting disease-specific gene functional modules via a text-based human disease phenotype network construction
FEBS Lett.
(2010) - et al.
Mutual information for the selection of relevant variables in spectrometric nonlinear modelling
Chemom. Intell. Lab. Syst.
(2006) - et al.
Lessons from similarities between SLE and HIV infection
J. Infect.
(2002) - et al.
Incidence of cancers in people with HIV/AIDS compared with immunosuppressed transplant recipients: a meta-analysis
Lancet
(2007) - et al.
Sputum biomarker profiles in cystic fibrosis (CF) and chronic obstructive pulmonary disease (COPD) and association between pulmonary function
Cytokine
(2010) - et al.
Apoptosis and Parkinson's disease
Prog. Neuro-Psychopharmacol. Biol. Psychiatry
(2003) - et al.
The nephropathy of cystic fibrosis: a human model of chronic nephrotoxicity
Hum. Pathol.
(1982) - et al.
Metabolomics approaches for discovering biomarkers of drug-induced hepatotoxicity and nephrotoxicity
Toxicol. Appl. Pharmacol.
(2010) - et al.
Diabetic retinopathy in adult patients with cystic fibrosis-related diabetes
Respir. Med.
(1998) - et al.
Malignancy in systemic lupus erythematosus: what have we learned? Best practice & research
Clin. Rheumatol.
(2009)
The “etiome”: identification and clustering of human disease etiological factors
BMC Bioinform.
Network-based elucidation of human disease similarities reveals common functional modules enriched for pluripotent drug targets
PLOS Comput. Biol.
Finding disease similarity based on implicit semantic similarity
J. Biomed. Inform.
DOSim: an R package for similarity between diseases based on disease ontology
BMC Bioinform.
Using information content to evaluate semantic similarity in a taxonomy
Semantic similarity based on corpus statistics and lexical taxonomy
An Information-theoretic definition of similarity
Semantic similarity over the gene ontology: family correlation and selecting disjunctive ancestors
A new measure for functional similarity of gene products based on gene ontology
BMC Bioinform.
Evaluating GO-based semantic similarity measures
A new method to measure the semantic similarity of GO terms
Bioinformatics
Effectively integrating information content and structural relationship to improve the GO-based similarity measure between proteins
SemFunSim: a new method for measuring disease similarity by integrating semantic and gene functional association
PLOS ONE
The human disease network
Proc. Natl. Acad. Sci.
The implications of human metabolic network topology for disease comorbidity
Proc. Natl. Acad. Sci.
A text-mining analysis of the human phenome
Eur. J. Hum. Genet.
Relating diseases by integrating gene associations and information flow through protein interaction network
PLOS ONE
Conditional entropy and mutual information
Numerical Recipes 3rd Edition: The Art of Scientific Computing
Inferring gene regulatory networks from gene expression data by path consistency algorithm based on conditional mutual information
Bioinformatics
Semantic similarity and relatedness between clinical terms: an experimental study
How does HIV cause AIDS?
Science
Emerging concepts in the immunopathogenesis of AIDS
Annu. Rev. Med.
HIV infection and SLE: their pathogenic relationship
Clin. Exp. Rheumatol.
HIV and lupus erythema tosus: a diagnostic dilemma
Indian J. Dermatol.
Cited by (10)
Exploring novel disease-disease associations based on multi-view fusion network
2023, Computational and Structural Biotechnology JournalComputational Methods for Identifying Similar Diseases
2019, Molecular Therapy Nucleic AcidsAn Integrative Disease Information Network Approach to Similar Disease Detection
2023, IEEE/ACM Transactions on Computational Biology and BioinformaticsBiomedical data, computational methods and tools for evaluating disease-disease associations
2022, Briefings in BioinformaticsClassifying diseases by using biological features to identify potential nosological models
2021, Scientific ReportsMISSION: Multimodal-Information-Aided Similar Disease Detection Based on Disease Information Network
2020, Proceedings - 2020 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2020