A literature-driven method to calculate similarities among diseases

doi:10.1016/j.cmpb.2015.07.001

Computer Methods and Programs in Biomedicine

Volume 122, Issue 2, November 2015, Pages 108-122

https://doi.org/10.1016/j.cmpb.2015.07.001 Get rights and content

Highlights

•
We proposed a novel method which calculates disease–disease similarity.
•
The proposed method can be used in constructing of disease network.
•
The proposed method discovered the largest number of answer disease pairs and showed lowest p-value when compared with other comparable methods.
•
The proposed method can provide an insight of relationship between diseases.

Abstract

Background

“Our lives are connected by a thousand invisible threads and along these sympathetic fibers, our actions run as causes and return to us as results”. It is Herman Melville's famous quote describing connections among human lives. To paraphrase the Melville's quote, diseases are connected by many functional threads and along these sympathetic fibers, diseases run as causes and return as results. The Melville's quote explains the reason for researching disease–disease similarity and disease network. Measuring similarities between diseases and constructing disease network can play an important role in disease function research and in disease treatment. To estimate disease–disease similarities, we proposed a novel literature-based method.

Methods and results

The proposed method extracted disease–gene relations and disease–drug relations from literature and used the frequencies of occurrence of the relations as features to calculate similarities among diseases. We also constructed disease network with top-ranking disease pairs from our method. The proposed method discovered a larger number of answer disease pairs than other comparable methods and showed the lowest p-value.

Conclusions

We presume that our method showed good results because of using literature data, using all possible gene symbols and drug names for features of a disease, and determining feature values of diseases with the frequencies of co-occurrence of two entities. The disease–disease similarities from the proposed method can be used in computational biology researches which use similarities among diseases.

Introduction

Diseases are functionally connected to one another. One gene can cause various diseases, or inhibiting protein translation by one miRNA can be a contributing factor to various diseases. Therefore, a person with a certain disease has a higher probability of getting functionally connected disease than normal people. By using the disease connection information, the possibility of a specific disease onset for a person can be predicted and it is a simple example showing how disease–disease similarity can be utilized for disease-related function research. Disease–disease similarity will be of much help to disease research. It can be useful for development of new drug by aiding in drug repositioning, for searching new genes related to disease, and it can increase efficiency of network analysis in disease-related function research by enhancing disease networks.

There are three primary approaches to get disease–disease similarity: function-based approaches [1], [2], [3] and semantic-based approaches [4], [5], [6], [7], [8], [9], [10], [11], [12], and hybrid approaches of combining previous two approaches [13]. To seek the disease–disease similarity, function-based approaches compare functionally related genes, pathways and biological processes, and semantic-based approaches find similarity between disease terms of ontology related to diseases. Hybrid approaches utilize both functional similarity and semantic similarity. Liu et al. [1] calculated disease–disease similarity using both genetic information from GAD (Genetic Association Database) and environmental etiological factors from MeSH (Medical Subject Headings). Suthram et al. [2] calculated disease–disease similarity using mRNA expression from GEO (Gene Expression Omnibus) database and protein–protein interaction from HPRD (Human Protein Reference Database). Mathur and Dinakarpandian [3] calculated disease–disease similarity using semantic similarity of biological process based on gene ontology. In Li's case [4], a software package for calculating disease–disease similarity was developed using semantic similarities among terms of disease ontology and in the software, 10 methods of seeking semantic similarity are applied to disease ontology in calculating disease–disease similarity. Lastly, Cheng's [13] is a hybrid approach, which first calculates association score utilizing a gene function network and disease-related gene set, and secondly calculates semantic score on disease ontology, and finally gets disease–disease similarity adding these two scores.

Many existing methods find disease–disease similarity using genetic information or semantic information on gene ontology but there are also other similar approaches. Goh et al. [14] constructed human disease network with gene–disease associations from OMIM (Online Mendelian Inheritance in Man) database. They made a connection between two diseases if the diseases shared at least one gene. Lee et al. [15] constructed bipartite human disease association network using shared metabolic pathways from KEGG (Kyoto Encyclopedia of Genes and Genomes) database. Two diseases are linked if mutated enzymes associated with them catalyze adjacent metabolic reactions. Goh's method and Lee's method are related to the proposed method, but they cannot calculate disease-disease similarity. Zhang et al. [16] created feature vectors for disease phenotypes by utilizing phenotype records and calculated cosine similarities among disease phenotypes using the feature vectors, and then developed a disease phenotype network. van Driel et al. [17] employed text mining approach to calculate similarities between diseases. MeSH terms were served as features, and the number of times the term was found in an OMIM record was counted for feature value. They used MeSH hierarchy and inverse document frequency measure to refine feature values. Lastly, similarity between two diseases was computed by the cosine of the angle between their corresponding feature vectors. Hamaneh et al. [18] calculated disease–disease similarity by considering information flow on disease-protein network. The disease–protein network was made by using disease–gene associations from CTD (The Comparative Toxicogenomics Database) database and protein–protein interactions from ppiTrim database. Proteins were treated as features of a disease, and feature value was defined by the expected number of visits by random walker on the disease–protein network. Then disease–disease similarity was calculated by the cosine of the angle similar to van Driel's method.

Biomedical term relations from literature (research papers) can also help calculate disease–disease similarity. We propose a new literature-based method LDDSim (Literature-Driven Disease Similarities) to measure disease similarity. The proposed method extracts disease–gene relations and disease–drug relations from literature, and with the number of those relations, it builds disease–gene matrix and disease–drug matrix. Then the method calculates disease–disease similarity utilizing mutual information between the two diseases. In addition to it, we constructed disease network using the disease similarities.

Section snippets

Materials

In this paper, we proposed a method that calculates disease-disease similarity and developed a disease network using disease pairs having high similarity. We extracted 36,686 disease-gene relations and 25,721 disease–drug relations from 9,803,245 MEDLINE abstracts in between year 1980 and 2012 using 27,850 disease names, 61,304 gene symbols and 9388 drug names from PharmGKB database. After constructing disease–gene matrix and disease–drug matrix with these relations, similarities of 3,353,503

Results

The proposed method can calculate similarities of 3,353,503 disease pairs. We evaluated the statistics of our similarities with mean, median, min, max, and standard deviation (Table 2).

The statistics indicate that the similarities are generally very low and there are high similarity-outliers because the mean is much larger than the median. For that reason, we can assume that the high similarity-outliers are significant. We investigated the trend of the similarities to get the outliers (Fig. 5).

Discussion

The proposed method extracts disease–gene relations and disease–drug relations from literature to get feature values of diseases utilizing frequency of occurrences of the relations. Then disease–disease similarities can be calculated by using the feature values. The proposed method discovered a larger number of answer disease pairs than other comparable methods and also found many actual disease pairs when manually checking the top-ranking disease pairs. We presume that our method showed good

Conclusions

We calculated disease–disease similarity using literature data. Our method discovered a larger number of answer disease pairs than other comparable methods and we manually checked that 15 disease pairs out of the top 20 disease pairs have actual relations. Moreover, we constructed literature-driven disease network with the top 167 disease pairs. We presume that our method showed good results because of using literature data, using all possible gene symbols and drug names for features of a

Conflict of interest

None declared.

Acknowledgements

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIP) (NRF-2015R1A2A1A05001845). We also appreciate Mr. Junsik Kim's proofreading efforts.

References (69)

S. Zhang et al.
From phenotype to gene: detecting disease-specific gene functional modules via a text-based human disease phenotype network construction
FEBS Lett.
(2010)
F. Rossi et al.
Mutual information for the selection of relevant variables in spectrometric nonlinear modelling
Chemom. Intell. Lab. Syst.
(2006)
I. Sekigawa et al.
Lessons from similarities between SLE and HIV infection
J. Infect.
(2002)
A.E. Grulich et al.
Incidence of cancers in people with HIV/AIDS compared with immunosuppressed transplant recipients: a meta-analysis
Lancet
(2007)
O. Eickmeier et al.
Sputum biomarker profiles in cystic fibrosis (CF) and chronic obstructive pulmonary disease (COPD) and association between pulmonary function
Cytokine
(2010)
N. Lev et al.
Apoptosis and Parkinson's disease
Prog. Neuro-Psychopharmacol. Biol. Psychiatry
(2003)
C.R. Abramowsky et al.
The nephropathy of cystic fibrosis: a human model of chronic nephrotoxicity
Hum. Pathol.
(1982)
R.D. Beger et al.
Metabolomics approaches for discovering biomarkers of drug-induced hepatotoxicity and nephrotoxicity
Toxicol. Appl. Pharmacol.
(2010)
B. Yung et al.
Diabetic retinopathy in adult patients with cystic fibrosis-related diabetes
Respir. Med.
(1998)
S. Bernatsky et al.
Malignancy in systemic lupus erythematosus: what have we learned? Best practice & research
Clin. Rheumatol.
(2009)

Y.I. Liu et al.

The “etiome”: identification and clustering of human disease etiological factors

BMC Bioinform.

(2009)

S. Suthram et al.

Network-based elucidation of human disease similarities reveals common functional modules enriched for pluripotent drug targets

PLOS Comput. Biol.

(2010)

S. Mathur et al.

Finding disease similarity based on implicit semantic similarity

J. Biomed. Inform.

(2011)

J. Li et al.

DOSim: an R package for similarity between diseases based on disease ontology

BMC Bioinform.

(2011)

P. Resnik

Using information content to evaluate semantic similarity in a taxonomy

J.J. Jiang et al.

Semantic similarity based on corpus statistics and lexical taxonomy

D. Lin

An Information-theoretic definition of similarity

F.M. Couto et al.

Semantic similarity over the gene ontology: family correlation and selecting disjunctive ancestors

A. Schlicker et al.

A new measure for functional similarity of gene products based on gene ontology

BMC Bioinform.

(2006)

C. Pesquita et al.

Evaluating GO-based semantic similarity measures

J.Z. Wang et al.

A new method to measure the semantic similarity of GO terms

Bioinformatics

(2007)

B. Li et al.

Effectively integrating information content and structural relationship to improve the GO-based similarity measure between proteins

L. Cheng et al.

SemFunSim: a new method for measuring disease similarity by integrating semantic and gene functional association

PLOS ONE

(2014)

K. Goh et al.

The human disease network

Proc. Natl. Acad. Sci.

(2007)

D. Lee et al.

The implications of human metabolic network topology for disease comorbidity

Proc. Natl. Acad. Sci.

(2008)

M.A. van Driel et al.

A text-mining analysis of the human phenome

Eur. J. Hum. Genet.

(2006)

M.B. Hamaneh et al.

Relating diseases by integrating gene associations and information flow through protein interaction network

PLOS ONE

(2014)

W.H. Press et al.

Conditional entropy and mutual information

Numerical Recipes 3rd Edition: The Art of Scientific Computing

(2007)

X. Zhang et al.

Inferring gene regulatory networks from gene expression data by path consistency algorithm based on conditional mutual information

Bioinformatics

(2012)

S. Pakhomov et al.

Semantic similarity and relatedness between clinical terms: an experimental study

R.A. Weiss

How does HIV cause AIDS?

Science

(1993)

D.C. Douek et al.

Emerging concepts in the immunopathogenesis of AIDS

Annu. Rev. Med.

(2009)

I. Sekigawa et al.

HIV infection and SLE: their pathogenic relationship

Clin. Exp. Rheumatol.

(1998)

F. Kaliyadan

HIV and lupus erythema tosus: a diagnostic dilemma

Indian J. Dermatol.

(2008)

Cited by (10)

Exploring novel disease-disease associations based on multi-view fusion network
2023, Computational and Structural Biotechnology Journal
Established taxonomy system based on disease symptom and tissue characteristics have provided an important basis for physicians to correctly identify diseases and treat them successfully. However, these classifications tend to be based on phenotypic observations, lacking a molecular biological foundation. Therefore, there is an urgent to integrate multi-dimensional molecular biological information or multi-omics data to redefine disease classification in order to provide a powerful perspective for understanding the molecular structure of diseases. Therefore, we offer a flexible disease classification that integrates the biological process, gene expression, and symptom phenotype of diseases, and propose a disease-disease association network based on multi-view fusion. We applied the fusion approach to 223 diseases and divided them into 24 disease clusters. The contribution of internal and external edges of disease clusters were analyzed. The results of the fusion model were compared with Medical Subject Headings, a traditional and commonly used disease taxonomy. Then, experimental results of model performance comparison show that our approach performs better than other integration methods. As it was observed, the obtained clusters provided more interesting and novel disease-disease associations. This multi-view human disease association network describes relationships between diseases based on multiple molecular levels, thus breaking through the limitation of the disease classification system based on tissues and organs. This approach which motivates clinicians and researchers to reposition the understanding of diseases and explore diagnosis and therapy strategies, extends the existing disease taxonomy.
The preprocessed dataset and source code supporting the conclusions of this article are available at GitHub repository https://github.com/yangxiaoxi89/mvHDN.
Computational Methods for Identifying Similar Diseases
2019, Molecular Therapy Nucleic Acids
Although our knowledge of human diseases has increased dramatically, the molecular basis, phenotypic traits, and therapeutic targets of most diseases still remain unclear. An increasing number of studies have observed that similar diseases often are caused by similar molecules, can be diagnosed by similar markers or phenotypes, or can be cured by similar drugs. Thus, the identification of diseases similar to known ones has attracted considerable attention worldwide. To this end, the associations between diseases at the molecular, phenotypic, and taxonomic levels were used to measure the pairwise similarity in diseases. The corresponding performance assessment strategies for these methods involving the terms “category-based,” “simulated-patient-based,” and “benchmark-data-based” were thus further emphasized. Then, frequently used methods were evaluated using a benchmark-data-based strategy. To facilitate the assessment of disease similarity scores, researchers have designed dozens of tools that implement these methods for calculating disease similarity. Currently, disease similarity has been advantageous in predicting noncoding RNA (ncRNA) function and therapeutic drugs for diseases. In this article, we review disease similarity methods, evaluation strategies, tools, and their applications in the biomedical community. We further evaluate the performance of these methods and discuss the current limitations and future trends for calculating disease similarity.
An Integrative Disease Information Network Approach to Similar Disease Detection
2023, IEEE/ACM Transactions on Computational Biology and Bioinformatics
Biomedical data, computational methods and tools for evaluating disease-disease associations
2022, Briefings in Bioinformatics
Classifying diseases by using biological features to identify potential nosological models
2021, Scientific Reports
MISSION: Multimodal-Information-Aided Similar Disease Detection Based on Disease Information Network
2020, Proceedings - 2020 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2020

View all citing articles on Scopus

View full text

A literature-driven method to calculate similarities among diseases

Highlights

Abstract

Background

Methods and results

Conclusions

Introduction

Section snippets

Materials

Results

Discussion

Conclusions

Conflict of interest

Acknowledgements

FEBS Lett.

Chemom. Intell. Lab. Syst.

J. Infect.

Lancet

Cytokine

Prog. Neuro-Psychopharmacol. Biol. Psychiatry

Hum. Pathol.

Toxicol. Appl. Pharmacol.

Respir. Med.

Clin. Rheumatol.

The “etiome”: identification and clustering of human disease etiological factors

BMC Bioinform.

Network-based elucidation of human disease similarities reveals common functional modules enriched for pluripotent drug targets

PLOS Comput. Biol.

Finding disease similarity based on implicit semantic similarity

J. Biomed. Inform.

DOSim: an R package for similarity between diseases based on disease ontology

BMC Bioinform.

Using information content to evaluate semantic similarity in a taxonomy

Semantic similarity based on corpus statistics and lexical taxonomy

An Information-theoretic definition of similarity

Semantic similarity over the gene ontology: family correlation and selecting disjunctive ancestors

A new measure for functional similarity of gene products based on gene ontology

BMC Bioinform.

Evaluating GO-based semantic similarity measures

A new method to measure the semantic similarity of GO terms

Bioinformatics

Effectively integrating information content and structural relationship to improve the GO-based similarity measure between proteins

SemFunSim: a new method for measuring disease similarity by integrating semantic and gene functional association

PLOS ONE

The human disease network

Proc. Natl. Acad. Sci.

The implications of human metabolic network topology for disease comorbidity

Proc. Natl. Acad. Sci.

A text-mining analysis of the human phenome

Eur. J. Hum. Genet.

Relating diseases by integrating gene associations and information flow through protein interaction network

PLOS ONE

Conditional entropy and mutual information

Numerical Recipes 3rd Edition: The Art of Scientific Computing

Inferring gene regulatory networks from gene expression data by path consistency algorithm based on conditional mutual information

Bioinformatics

Semantic similarity and relatedness between clinical terms: an experimental study

How does HIV cause AIDS?

Science

Emerging concepts in the immunopathogenesis of AIDS

Annu. Rev. Med.

HIV infection and SLE: their pathogenic relationship

Clin. Exp. Rheumatol.

HIV and lupus erythema tosus: a diagnostic dilemma

Indian J. Dermatol.