Skip to main content

Advertisement

Log in

DisGeReExT: a knowledge discovery system for exploration of disease–gene associations through large-scale literature-wide analysis study

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Objective Advanced experimental methods such as next-generation sequencing (NGS) produced a large number of potential indicative genetic biomarkers and gene variants to diseases mentioned as research outputs in the scientific literature. To elucidate novel biomarkers and therapeutic candidates from this larger number of literature, highly sophisticated text mining-based knowledge-driven frameworks are a necessity. Materials and Methods This paper presents DisGeReExT Web server for performing a literature-wide analysis study (LWAS) to extract both direct and indirect gene–disease associations using joint ensemble learning (explicit) along with concept profiling using the ABC principle (implicit) for prioritizing and rationalizing potential informative discoveries of the genetic role on diseases. In addition, we ranked the informative sentences using a scoring model and calculated the disease–disease similarity using functional association among shared genes. Results From complete MEDLINE corpus dated September 2020 with 28 million records, DisGeReExT identified a total of 2,237,545 gene–disease associations and 2,851,662 disease–disease similarities. Discussion DisGeReExT was able to extract informative sentences related to both diseases and genes in large scale. It also explored the gene–disease association of two diseases, namely Alzheimer’s disease and liver carcinoma, and identified its top 10 associated genes and diseases of both diseases. Conclusion Overall, we strongly believe that our large-scale automated approach for knowledge discovery of gene-associated diseases from literature could provide new insights into the genetic mechanism and disease etiology and can play a pivotal role in translational research, drug discovery, and repurposing.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Data availability

DisGeReExT is freely available at http://14.139.186.184/DisGeReExT/.

Notes

  1. DisGeReExT is freely available at http://14.139.186.184/DisGeReExT/.

References

  1. Menche J, Sharma A, Kitsak M et al (2015) Uncovering disease-disease relationships through the incomplete interactome. Science. https://doi.org/10.1126/science.1257601

    Article  Google Scholar 

  2. Rifai N, Gillette MA, Carr SA (2006) Protein biomarker discovery and validation: the long and uncertain path to clinical utility. Nat Biotechnol. https://doi.org/10.1038/nbt1235

    Article  Google Scholar 

  3. Aerts S, Lambrechts D, Maity S et al (2006) Gene prioritization through genomic data fusion. Nat Biotechnol. https://doi.org/10.1038/nbt1203

    Article  Google Scholar 

  4. Timpson NJ, Greenwood CMT, Soranzo N et al (2018) Genetic architecture: the shape of the genetic contribution to human traits and disease. Nat Rev Genet. https://doi.org/10.1038/nrg.2017.101

    Article  Google Scholar 

  5. Nielsen J (2017) Systems biology of metabolism: a driver for developing personalized and precision medicine. Cell Metab. https://doi.org/10.1016/j.cmet.2017.02.002

    Article  Google Scholar 

  6. Seyyedrazzagi E, Navimipour NJ (2017) Disease genes prioritizing mechanisms: a comprehensive and systematic literature review. Netw Model Anal Heal Inform Bioinform. https://doi.org/10.1007/s13721-017-0154-9

    Article  Google Scholar 

  7. Hamazaki T, El Rouby N, Fredette NC et al (2017) Concise review: Induced pluripotent stem cell research in the era of precision medicine. Stem Cells. https://doi.org/10.1002/stem.2570

    Article  Google Scholar 

  8. Peng J, Bai K, Shang X et al (2017) Predicting disease-related genes using integrated biomedical networks. BMC Genom. https://doi.org/10.1186/s12864-016-3263-4

    Article  Google Scholar 

  9. Goh K, Cusick ME, Valle D et al (2007) The human disease network. Proc Natl Acad Sci USA. https://doi.org/10.1073/pnas.0701361104

    Article  Google Scholar 

  10. Lee J, Yoon W, Kim S et al (2020) BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics. https://doi.org/10.1093/bioinformatics/btz682

    Article  Google Scholar 

  11. Singhal A, Simmons M, Lu Z (2016) Text mining for precision medicine: automating disease-mutation relationship extraction from biomedical literature. J Am Med Inform Assoc. https://doi.org/10.1093/jamia/ocw041

    Article  Google Scholar 

  12. Pyysalo S, Baker S, Ali I et al (2019) LION LBD: a literature-based discovery system for cancer biology. Bioinformatics. https://doi.org/10.1093/bioinformatics/bty845

    Article  Google Scholar 

  13. Jung JY, DeLuca TF, Nelson TH et al (2014) A literature search tool for intelligent extraction of disease-associated genes. J Am Med Informatics Assoc. https://doi.org/10.1136/amiajnl-2012-001563

    Article  Google Scholar 

  14. Bhasuran B, Subramanian D, Natarajan J (2018) Text mining and network analysis to find functional associations of genes in high altitude diseases. Comput Biol. https://doi.org/10.1016/j.compbiolchem.2018.05.002

    Article  Google Scholar 

  15. Bhasuran B, Natarajan J (2019) Distant supervision for large-scale extraction of gene–disease associations from literature using DeepDive. In: Lecture notes in networks and systems. https://doi.org/10.1007/978-981-13-2354-6_39

  16. Zhao S, Su C, Lu Z et al (2020) Recent advances in biomedical literature mining. Brief Bioinform. https://doi.org/10.1093/bib/bbaa057

    Article  Google Scholar 

  17. Amberger JS, Bocchini CA, Schiettecatte F et al (2015) OMIM.org: online mendelian inheritance in man (OMIM®), an online catalog of human genes and genetic disorders. Nucleic Acids Res. https://doi.org/10.1093/nar/gku1205

    Article  Google Scholar 

  18. Safran M, Dalah I, Alexander J et al (2010) GeneCards version 3: the human gene integrator. Database (Oxford). https://doi.org/10.1093/database/baq020

    Article  Google Scholar 

  19. Pletscher-Frankild S, Pallejà A, Tsafou K et al (2015) DISEASES: text mining and data integration of disease-gene associations. Methods. https://doi.org/10.1016/j.ymeth.2014.11.020

    Article  Google Scholar 

  20. Liu Y, Liang Y, Wishart D (2015) PolySearch2: a significantly improved text-mining system for discovering associations between human diseases, genes, drugs, metabolites, toxins and more. Nucleic Acids Res. https://doi.org/10.1093/nar/gkv383

    Article  Google Scholar 

  21. Piñero J, Ramírez-Anguita JM, Saüch-Pitarch J et al (2020) The DisGeNET knowledge platform for disease genomics: 2019 update. Nucleic Acids Res. https://doi.org/10.1093/nar/gkz1021

    Article  Google Scholar 

  22. Kim J, So S, Lee HJ et al (2013) DigSee: disease gene search engine with evidence sentences (version cancer). Nucleic Acids Res. https://doi.org/10.1093/nar/gkt531

    Article  Google Scholar 

  23. Song M, Kim WC, Lee D et al (2015) PKDE4J: entity and relation extraction for public knowledge discovery. J Biomed Inform. https://doi.org/10.1016/j.jbi.2015.08.008

    Article  Google Scholar 

  24. Xu D, Zhang M, Xie Y et al (2016) DTMiner: identification of potential disease targets through biomedical literature mining. Bioinformatics. https://doi.org/10.1093/bioinformatics/btw503

    Article  Google Scholar 

  25. Bhasuran B, Natarajan J (2018) Automatic extraction of gene-disease associations from literature using joint ensemble learning. PLoS One. https://doi.org/10.1371/journal.pone.0200699

    Article  Google Scholar 

  26. Tsuruoka Y, Tsujii J, Ananiadou S (2008) FACTA: a text search engine for finding associated biomedical concepts. Bioinformatics. https://doi.org/10.1093/bioinformatics/btn469

    Article  Google Scholar 

  27. Tsuruoka Y, Miwa M, Hamamoto K et al (2011) Discovering and visualizing indirect associations between biomedical concepts. Bioinformatics. https://doi.org/10.1093/bioinformatics/btr214

    Article  Google Scholar 

  28. Jelier R, Jenster G, Dorssers LCJ et al (2007) Text-derived concept profiles support assessment of DNA microarray data for acute myeloid leukemia and for androgen receptor stimulation. BMC Bioinformatics. https://doi.org/10.1186/1471-2105-8-14

    Article  Google Scholar 

  29. Jelier R, Schuemie MJ, Veldhoven A et al (2008) Anni 2.0: a multipurpose text-mining tool for the life sciences. Genome Biol. https://doi.org/10.1186/gb-2008-9-6-r96

    Article  Google Scholar 

  30. Cheng D, Knox C, Young N et al (2008) PolySearch: a web-based text mining system for extracting relationships between human diseases, genes, mutations, drugs and metabolites. Nucleic Acids Res. https://doi.org/10.1093/nar/gkn296

    Article  Google Scholar 

  31. Fleuren WWM, Verhoeven S, Frijters R et al (2011) CoPub update: CoPub 5.0 a text mining system to answer biological questions. Nucleic Acids Res. https://doi.org/10.1093/nar/gkr310

    Article  Google Scholar 

  32. ElShal S, Tranchevent LC, Sifrim A et al (2016) Beegle: from literature mining to disease-gene discovery. Nucleic Acids Res. https://doi.org/10.1093/nar/gkv905

    Article  Google Scholar 

  33. Tranchevent LC, Barriot R, Yu S et al (2008) ENDEAVOUR update: a web resource for gene prioritization in multiple species. Nucleic Acids. https://doi.org/10.1093/nar/gkn325

    Article  Google Scholar 

  34. Kim J, Kim JJ, Lee H (2017) An analysis of disease-gene relationship from medline abstracts by DigSee. Sci Rep. https://doi.org/10.1038/srep40154

    Article  Google Scholar 

  35. Hettne KM, Thompson M, Van Haagen HHHBM et al (2016) The implicitome: a resource for rationalizing gene-disease associations. PLoS One. https://doi.org/10.1371/journal.pone.0149621

    Article  Google Scholar 

  36. Fontaine JF, Andrade-Navarro MA (2016) Gene set to diseases (GS2D): disease enrichment analysis on human gene sets with literature data. Genomics Comput Biol. https://doi.org/10.18547/gcb.2016.vol2.iss1.e33

    Article  Google Scholar 

  37. Swanson DR 1991 Complementary structures in disjoint science literatures. In: Proceedings of the 14th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR 1991. https://doi.org/10.1145/122860.122889

  38. Crichton G, Baker S, Guo Y et al (2020) Neural networks for open and closed literature-based discovery. PLoS One. https://doi.org/10.1371/journal.pone.0232891

    Article  Google Scholar 

  39. Xie Q, Yang KM, Heo GE et al (2020) Literature based discovery of alternative TCM medicine for adverse reactions to depression drugs. BMC Bioinformatics. https://doi.org/10.1186/s12859-020-03735-8

    Article  Google Scholar 

  40. Weeber M, Vos R, Klein H et al (2003) Generating hypotheses by discovering implicit associations in the literature: a case report of a search for new potential therapeutic uses for thalidomide. J Am Med Inform Assoc. https://doi.org/10.1197/jamia.M1158

    Article  Google Scholar 

  41. Weeber M, Klein H et al (2001) Using concepts in literature-based discovery: simulating Swanson’s Raynaud-fish oil and migraine-magnesium discoveries. J Am Soc Inf Sci Technol. https://doi.org/10.1002/asi.1104b

    Article  Google Scholar 

  42. Swanson DR (1986) Fish oil, Raynaud’s syndrome, and undiscovered public knowledge. Perspect Biol Med. https://doi.org/10.1353/pbm.1986.0087

    Article  Google Scholar 

  43. Digiacomo RA, Kremer JM, Shah DM (1989) Fish-oil dietary supplementation in patients with Raynaud’s phenomenon: a double-blind, controlled, prospective study. Am J Med. https://doi.org/10.1016/0002-9343(89)90261-1

    Article  Google Scholar 

  44. Gopalakrishnan V, Jha K, Jin W et al (2019) A survey on literature based discovery approaches in biomedical domain. J Biomed Inform. https://doi.org/10.1016/j.jbi.2019.103141

    Article  Google Scholar 

  45. Kim YH, Song M (2019) A context-based ABC model for literature-based discovery. PLoS One. https://doi.org/10.1371/journal.pone.0215313

    Article  Google Scholar 

  46. Thilakaratne M, Falkner K, Atapattu T (2019) A systematic review on literature-based discovery workflow. PeerJComput Sci. https://doi.org/10.7717/peerj-cs.235

    Article  Google Scholar 

  47. Bhasuran B, Murugesan G, Abdulkadhar S et al (2016) Stacked ensemble combined with fuzzy matching for biomedical named entity recognition of diseases. J Biomed Inform. https://doi.org/10.1016/j.jbi.2016.09.009

    Article  Google Scholar 

  48. Murugesan G, Abdulkadhar S, Bhasuran B et al (2017) BCC-NER: bidirectional, contextual clues named entity tagger for gene/protein mention recognition. Eurasip J Bioinforma Syst Biol. https://doi.org/10.1186/s13637-017-0060-6

    Article  Google Scholar 

  49. Bodenreider O (2004) The unified medical language system (UMLS): integrating biomedical terminology. Nucleic Acids Res. https://doi.org/10.1093/nar/gkh061

    Article  Google Scholar 

  50. Lipscomb CE (2000) Medical subject headings (MeSH). Bull Med Libr Assoc 88:265

    Google Scholar 

  51. Gray KA, Yates B, Seal RL et al (2015) Genenames.org: the HGNC resources in 2015. Nucleic Acids Res. https://doi.org/10.1093/nar/gku1071

    Article  Google Scholar 

  52. Law V, Knox C, Djoumbou Y et al (2014) DrugBank 4.0: shedding new light on drug metabolism. Nucleic Acids Res. https://doi.org/10.1093/nar/gkt1068

    Article  Google Scholar 

  53. Ursu O, Holmes J, Knockel J et al (2017) DrugCentral: online drug compendium. Nucleic Acids Res. https://doi.org/10.1093/nar/gkw993

    Article  Google Scholar 

  54. Mao Y, Lu Z (2017) MeSH now: automatic MeSH indexing at PubMed scale via learning to rank. J Biomed Semantics. https://doi.org/10.1186/s13326-017-0123-3

    Article  Google Scholar 

  55. Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B. https://doi.org/10.1111/j.2517-6161.1995.tb02031.x

    Article  MATH  Google Scholar 

  56. Ferro N, Crestani F, Moens MF, et al. Beyond factoid QA: effective methods for non-factoid answer sentence retrieval. Lect Notes Comput Sci (including SubserLect Notes ArtifIntellLect Notes Bioinformatics) 2016

  57. Ferreira R, De Souza CL, Lins RD et al (2013) Assessing sentence scoring techniques for extractive text summarization. Expert Syst Appl. https://doi.org/10.1016/j.eswa.2013.04.023

    Article  Google Scholar 

  58. Dang V (2012) The lemur project-wiki-ranklib. Lemur Project

  59. Wei CH, Harris BR, Kao HY et al (2013) TmVar: A text mining approach for extracting sequence variants in biomedical literature. Bioinformatics. https://doi.org/10.1093/bioinformatics/btt156

    Article  Google Scholar 

  60. Demeester T, Sutskever I, Chen K, et al. (2016) Distributed representations of words and phrases and their compositionality. EMNLP 2016-Conf Empir Methods Nat Lang Process Proc

  61. Ibrahim OAS, Landa-Silva D (2018) An evolutionary strategy with machine learning for learning to rank in information retrieval. Soft Comput. https://doi.org/10.1007/s00500-017-2988-6

    Article  Google Scholar 

  62. You Y, Lu C, Wang W et al (2019) Relative CNN-RNN: learning relative atmospheric visibility from images. IEEE Trans Image Process. https://doi.org/10.1109/TIP.2018.2857219

    Article  MathSciNet  MATH  Google Scholar 

  63. Wu Q, Burges CJC, Svore KM et al (2010) Adapting boosting for information retrieval measures. Inf Retr Boston. https://doi.org/10.1007/s10791-009-9112-1

    Article  Google Scholar 

  64. Real R, Vargas JM (1996) The probabilistic basis of Jaccard’s index of similarity. Syst Biol. https://doi.org/10.1093/sysbio/45.3.380

    Article  Google Scholar 

  65. Hoehndorf R, Schofield PN, Gkoutos GV (2015) Analysis of the human diseasome using phenotype similarity between common, genetic, and infectious diseases. Sci Rep. https://doi.org/10.1038/srep10888

    Article  Google Scholar 

  66. Rayner L, McGovern A, Creagh-Brown B et al (2018) Type 2 diabetes and asthma: systematic review of the bidirectional relationship. Curr Diabetes Rev. https://doi.org/10.2174/1573399814666180711114859

    Article  Google Scholar 

  67. Masoudkabir F, Sarrafzadegan N, Gotay C et al (2017) Cardiovascular disease and cancer: Evidence for shared disease pathways and pharmacologic prevention. Atherosclerosis. https://doi.org/10.1016/j.atherosclerosis.2017.06.001

    Article  Google Scholar 

  68. Žitnik M, Janjić V, Larminie C et al (2013) Discovering disease-disease associations by fusing systems-level molecular data. Sci Rep. https://doi.org/10.1038/srep03202

    Article  Google Scholar 

  69. Cheng L, Jiang Y, Wang Z et al (2016) DisSim: an online system for exploring significant similar diseases and exhibiting potential therapeutic drugs. Sci Rep. https://doi.org/10.1038/srep30024

    Article  Google Scholar 

  70. Suthram S, Dudley JT, Chiang AP et al (2010) Network-based elucidation of human disease similarities reveals common functional modules enriched for pluripotent drug targets. PLoSComput Biol. https://doi.org/10.1371/journal.pcbi.1000662

    Article  Google Scholar 

  71. Mathur S, Dinakarpandian D (2012) Finding disease similarity based on implicit semantic similarity. J Biomed Inform. https://doi.org/10.1016/j.jbi.2011.11.017

    Article  Google Scholar 

  72. Davis AP, Rosenstein MC, Wiegers TC et al (2011) DiseaseComps: a metric that discovers similar diseases based upon common toxicogenomic profiles at CTD. Bioinformation. https://doi.org/10.6026/97320630007154

    Article  Google Scholar 

  73. Davis AP, Wiegers TC, King BL et al (2016) Generating gene ontology-disease inferences to explore mechanisms of human disease at the comparative toxicogenomics database. PLoS One. https://doi.org/10.1371/journal.pone.0155530

    Article  Google Scholar 

  74. Gligorijevic D, Stojanovic J, Djuric N et al (2016) Large-scale discovery of disease-disease and disease-gene associations. Sci Rep. https://doi.org/10.1038/srep32404

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported by DRDO—BU Centre for Life Sciences, Coimbatore, Tamil Nadu, India. BB acknowledges the fellowship received from the grant.

Author information

Authors and Affiliations

Authors

Contributions

All authors have made a substantial, direct, and intellectual contribution to this study.

Corresponding author

Correspondence to Jeyakumar Natarajan.

Ethics declarations

Conflict of interest

The authors declare no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Bhasuran, B., Natarajan, J. DisGeReExT: a knowledge discovery system for exploration of disease–gene associations through large-scale literature-wide analysis study. Knowl Inf Syst 65, 3463–3487 (2023). https://doi.org/10.1007/s10115-023-01862-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-023-01862-1

Keywords

Navigation