A literature-driven method to calculate similarities among diseases

https://doi.org/10.1016/j.cmpb.2015.07.001Get rights and content

Highlights

  • We proposed a novel method which calculates disease–disease similarity.

  • The proposed method can be used in constructing of disease network.

  • The proposed method discovered the largest number of answer disease pairs and showed lowest p-value when compared with other comparable methods.

  • The proposed method can provide an insight of relationship between diseases.

Abstract

Background

“Our lives are connected by a thousand invisible threads and along these sympathetic fibers, our actions run as causes and return to us as results”. It is Herman Melville's famous quote describing connections among human lives. To paraphrase the Melville's quote, diseases are connected by many functional threads and along these sympathetic fibers, diseases run as causes and return as results. The Melville's quote explains the reason for researching disease–disease similarity and disease network. Measuring similarities between diseases and constructing disease network can play an important role in disease function research and in disease treatment. To estimate disease–disease similarities, we proposed a novel literature-based method.

Methods and results

The proposed method extracted disease–gene relations and disease–drug relations from literature and used the frequencies of occurrence of the relations as features to calculate similarities among diseases. We also constructed disease network with top-ranking disease pairs from our method. The proposed method discovered a larger number of answer disease pairs than other comparable methods and showed the lowest p-value.

Conclusions

We presume that our method showed good results because of using literature data, using all possible gene symbols and drug names for features of a disease, and determining feature values of diseases with the frequencies of co-occurrence of two entities. The disease–disease similarities from the proposed method can be used in computational biology researches which use similarities among diseases.

Introduction

Diseases are functionally connected to one another. One gene can cause various diseases, or inhibiting protein translation by one miRNA can be a contributing factor to various diseases. Therefore, a person with a certain disease has a higher probability of getting functionally connected disease than normal people. By using the disease connection information, the possibility of a specific disease onset for a person can be predicted and it is a simple example showing how disease–disease similarity can be utilized for disease-related function research. Disease–disease similarity will be of much help to disease research. It can be useful for development of new drug by aiding in drug repositioning, for searching new genes related to disease, and it can increase efficiency of network analysis in disease-related function research by enhancing disease networks.

There are three primary approaches to get disease–disease similarity: function-based approaches [1], [2], [3] and semantic-based approaches [4], [5], [6], [7], [8], [9], [10], [11], [12], and hybrid approaches of combining previous two approaches [13]. To seek the disease–disease similarity, function-based approaches compare functionally related genes, pathways and biological processes, and semantic-based approaches find similarity between disease terms of ontology related to diseases. Hybrid approaches utilize both functional similarity and semantic similarity. Liu et al. [1] calculated disease–disease similarity using both genetic information from GAD (Genetic Association Database) and environmental etiological factors from MeSH (Medical Subject Headings). Suthram et al. [2] calculated disease–disease similarity using mRNA expression from GEO (Gene Expression Omnibus) database and protein–protein interaction from HPRD (Human Protein Reference Database). Mathur and Dinakarpandian [3] calculated disease–disease similarity using semantic similarity of biological process based on gene ontology. In Li's case [4], a software package for calculating disease–disease similarity was developed using semantic similarities among terms of disease ontology and in the software, 10 methods of seeking semantic similarity are applied to disease ontology in calculating disease–disease similarity. Lastly, Cheng's [13] is a hybrid approach, which first calculates association score utilizing a gene function network and disease-related gene set, and secondly calculates semantic score on disease ontology, and finally gets disease–disease similarity adding these two scores.

Many existing methods find disease–disease similarity using genetic information or semantic information on gene ontology but there are also other similar approaches. Goh et al. [14] constructed human disease network with gene–disease associations from OMIM (Online Mendelian Inheritance in Man) database. They made a connection between two diseases if the diseases shared at least one gene. Lee et al. [15] constructed bipartite human disease association network using shared metabolic pathways from KEGG (Kyoto Encyclopedia of Genes and Genomes) database. Two diseases are linked if mutated enzymes associated with them catalyze adjacent metabolic reactions. Goh's method and Lee's method are related to the proposed method, but they cannot calculate disease-disease similarity. Zhang et al. [16] created feature vectors for disease phenotypes by utilizing phenotype records and calculated cosine similarities among disease phenotypes using the feature vectors, and then developed a disease phenotype network. van Driel et al. [17] employed text mining approach to calculate similarities between diseases. MeSH terms were served as features, and the number of times the term was found in an OMIM record was counted for feature value. They used MeSH hierarchy and inverse document frequency measure to refine feature values. Lastly, similarity between two diseases was computed by the cosine of the angle between their corresponding feature vectors. Hamaneh et al. [18] calculated disease–disease similarity by considering information flow on disease-protein network. The disease–protein network was made by using disease–gene associations from CTD (The Comparative Toxicogenomics Database) database and protein–protein interactions from ppiTrim database. Proteins were treated as features of a disease, and feature value was defined by the expected number of visits by random walker on the disease–protein network. Then disease–disease similarity was calculated by the cosine of the angle similar to van Driel's method.

Biomedical term relations from literature (research papers) can also help calculate disease–disease similarity. We propose a new literature-based method LDDSim (Literature-Driven Disease Similarities) to measure disease similarity. The proposed method extracts disease–gene relations and disease–drug relations from literature, and with the number of those relations, it builds disease–gene matrix and disease–drug matrix. Then the method calculates disease–disease similarity utilizing mutual information between the two diseases. In addition to it, we constructed disease network using the disease similarities.

Section snippets

Materials

In this paper, we proposed a method that calculates disease-disease similarity and developed a disease network using disease pairs having high similarity. We extracted 36,686 disease-gene relations and 25,721 disease–drug relations from 9,803,245 MEDLINE abstracts in between year 1980 and 2012 using 27,850 disease names, 61,304 gene symbols and 9388 drug names from PharmGKB database. After constructing disease–gene matrix and disease–drug matrix with these relations, similarities of 3,353,503

Results

The proposed method can calculate similarities of 3,353,503 disease pairs. We evaluated the statistics of our similarities with mean, median, min, max, and standard deviation (Table 2).

The statistics indicate that the similarities are generally very low and there are high similarity-outliers because the mean is much larger than the median. For that reason, we can assume that the high similarity-outliers are significant. We investigated the trend of the similarities to get the outliers (Fig. 5).

Discussion

The proposed method extracts disease–gene relations and disease–drug relations from literature to get feature values of diseases utilizing frequency of occurrences of the relations. Then disease–disease similarities can be calculated by using the feature values. The proposed method discovered a larger number of answer disease pairs than other comparable methods and also found many actual disease pairs when manually checking the top-ranking disease pairs. We presume that our method showed good

Conclusions

We calculated disease–disease similarity using literature data. Our method discovered a larger number of answer disease pairs than other comparable methods and we manually checked that 15 disease pairs out of the top 20 disease pairs have actual relations. Moreover, we constructed literature-driven disease network with the top 167 disease pairs. We presume that our method showed good results because of using literature data, using all possible gene symbols and drug names for features of a

Conflict of interest

None declared.

Acknowledgements

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIP) (NRF-2015R1A2A1A05001845). We also appreciate Mr. Junsik Kim's proofreading efforts.

References (69)

  • Y.I. Liu et al.

    The “etiome”: identification and clustering of human disease etiological factors

    BMC Bioinform.

    (2009)
  • S. Suthram et al.

    Network-based elucidation of human disease similarities reveals common functional modules enriched for pluripotent drug targets

    PLOS Comput. Biol.

    (2010)
  • S. Mathur et al.

    Finding disease similarity based on implicit semantic similarity

    J. Biomed. Inform.

    (2011)
  • J. Li et al.

    DOSim: an R package for similarity between diseases based on disease ontology

    BMC Bioinform.

    (2011)
  • P. Resnik

    Using information content to evaluate semantic similarity in a taxonomy

  • J.J. Jiang et al.

    Semantic similarity based on corpus statistics and lexical taxonomy

  • D. Lin

    An Information-theoretic definition of similarity

  • F.M. Couto et al.

    Semantic similarity over the gene ontology: family correlation and selecting disjunctive ancestors

  • A. Schlicker et al.

    A new measure for functional similarity of gene products based on gene ontology

    BMC Bioinform.

    (2006)
  • C. Pesquita et al.

    Evaluating GO-based semantic similarity measures

  • J.Z. Wang et al.

    A new method to measure the semantic similarity of GO terms

    Bioinformatics

    (2007)
  • B. Li et al.

    Effectively integrating information content and structural relationship to improve the GO-based similarity measure between proteins

  • L. Cheng et al.

    SemFunSim: a new method for measuring disease similarity by integrating semantic and gene functional association

    PLOS ONE

    (2014)
  • K. Goh et al.

    The human disease network

    Proc. Natl. Acad. Sci.

    (2007)
  • D. Lee et al.

    The implications of human metabolic network topology for disease comorbidity

    Proc. Natl. Acad. Sci.

    (2008)
  • M.A. van Driel et al.

    A text-mining analysis of the human phenome

    Eur. J. Hum. Genet.

    (2006)
  • M.B. Hamaneh et al.

    Relating diseases by integrating gene associations and information flow through protein interaction network

    PLOS ONE

    (2014)
  • W.H. Press et al.

    Conditional entropy and mutual information

    Numerical Recipes 3rd Edition: The Art of Scientific Computing

    (2007)
  • X. Zhang et al.

    Inferring gene regulatory networks from gene expression data by path consistency algorithm based on conditional mutual information

    Bioinformatics

    (2012)
  • S. Pakhomov et al.

    Semantic similarity and relatedness between clinical terms: an experimental study

  • R.A. Weiss

    How does HIV cause AIDS?

    Science

    (1993)
  • D.C. Douek et al.

    Emerging concepts in the immunopathogenesis of AIDS

    Annu. Rev. Med.

    (2009)
  • I. Sekigawa et al.

    HIV infection and SLE: their pathogenic relationship

    Clin. Exp. Rheumatol.

    (1998)
  • F. Kaliyadan

    HIV and lupus erythema tosus: a diagnostic dilemma

    Indian J. Dermatol.

    (2008)
  • Cited by (10)

    • Exploring novel disease-disease associations based on multi-view fusion network

      2023, Computational and Structural Biotechnology Journal
    • Computational Methods for Identifying Similar Diseases

      2019, Molecular Therapy Nucleic Acids
    • An Integrative Disease Information Network Approach to Similar Disease Detection

      2023, IEEE/ACM Transactions on Computational Biology and Bioinformatics
    • MISSION: Multimodal-Information-Aided Similar Disease Detection Based on Disease Information Network

      2020, Proceedings - 2020 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2020
    View all citing articles on Scopus
    View full text