skip to main content
10.1145/3535508.3545531acmconferencesArticle/Chapter ViewAbstractPublication PagesbcbConference Proceedingsconference-collections
research-article

ArcheGEO: towards improving relevance of gene expression omnibus search results

Published:07 August 2022Publication History

ABSTRACT

Transciptomic data stored in the Gene Expression Omnibus (GEO) serves thousands of queries per day, but a lack of standardized machine-readable metadata causes many searches to return irrelevant hits, which impede convenient access to useful data in the GEO repository. Here, we describe ArcheGEO, a novel end-to-end framework that improves results from the GEO Browser by automatically determining the relevance of these results. Unlike existing tools, ArcheGEO reports on the irrelevant results and provides reasoning for their exclusion. Such reasoning can be leveraged to improve annotations of metadata.

References

  1. ArrayExpress. https://www.ebi.ac.uk/arrayexpress/.Google ScholarGoogle Scholar
  2. Cellosaurus. https://web.expasy.org/cellosaurus/.Google ScholarGoogle Scholar
  3. Classification of Diseases. https://www.who.int/standards/classifications/classification-of-diseases.Google ScholarGoogle Scholar
  4. Gene Expression Omnibus. https://www.ncbi.nlm.nih.gov/geo/.Google ScholarGoogle Scholar
  5. Genomic Expression Archive. https://www.ddbj.nig.ac.jp/gea/index-e.html.Google ScholarGoogle Scholar
  6. Medical Subject Headings. https://www.ncbi.nlm.nih.gov/mesh/.Google ScholarGoogle Scholar
  7. NCI Metathesaurus. https://ncim.nci.nih.gov/ncimbrowser/.Google ScholarGoogle Scholar
  8. NCI Thesaurus. https://ncithesaurus.nci.nih.gov/ncitbrowser/.Google ScholarGoogle Scholar
  9. Online Mendelian Inheritance in Man. https://www.omim.org/.Google ScholarGoogle Scholar
  10. SNOMED CT. https://www.nlm.nih.gov/healthit/snomedct/index.html.Google ScholarGoogle Scholar
  11. UMLS Metathesaurus. https://uts.nlm.nih.gov/uts/umls/home.Google ScholarGoogle Scholar
  12. L. Amos, et al. UMLS users and uses: a current overview. Journal of the American Medical Informatics Association, 27(10): 1606--1611, 2020.Google ScholarGoogle ScholarCross RefCross Ref
  13. A.R. Aronson. Effective mapping of biomedical text to the UMLS metathesaurus: the metamap program. Proc AMIA Symp, 17--21, 2001.Google ScholarGoogle Scholar
  14. T. Barrett, et al. NCBI GEO: mining tens of millions of expression profiles - database and tools update. Nucleic Acids Research, 35(suppl_1): D760-D765, 2007.Google ScholarGoogle Scholar
  15. H. Bono. All of gene expression (AOE): an integrated index for public gene expression databases. PloS one, 15(1): e0227076, 2020.Google ScholarGoogle Scholar
  16. M. Brockington, et al. Localization and functional analysis of the LARGE family of glycosyltransferases: significance for muscular dystrophy. Human Molecular Genetics, 14(5): 657--665, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  17. T. Byrt. How good is that agreement? Epidemiology, 7(5): 561, 1996.Google ScholarGoogle ScholarCross RefCross Ref
  18. E.J.M. Campbell, J.G. Scadding, R.S. Roberts. The concept of disease. Br Med J, 2(6193): 757--762, 1979.Google ScholarGoogle ScholarCross RefCross Ref
  19. G. Chen, et al. Restructured GEO: restructuring gene expression omnibus metadata for genome dynamics analysis. Database, 2019.Google ScholarGoogle Scholar
  20. X. Chen, et al. DataMed - an open source discovery index for finding biomedical datasets. Journal of the Americal Medical Informatics Association, 25(3): 300--308, 2018.Google ScholarGoogle ScholarCross RefCross Ref
  21. Y. Chen, et al. Gene expression inference with deep learning. Bioinform., 32(12): 1832--1839, 2016.Google ScholarGoogle ScholarCross RefCross Ref
  22. H. Cho, H. Lee. Biomedical named entity recognition using deep neural networks with contextual information. BMC Bioinformatics, 20(735), 2019.Google ScholarGoogle Scholar
  23. H.-E Chua, L. Tucker-Kellogg, S. S. Bhowmick. ArcheGEO: Towards improving relevance of gene expression omnibus search results. Technical Report, https://personal.ntu.edu.sg/assourav/TechReports/ArcheGEO-TR.pdf, 2021.Google ScholarGoogle Scholar
  24. S. Davis, P.S. Meltzer. GEOquery: a bridge between the gene expression omnibus (GEO) and bioconductor. Bioinformatics, 23(14): 1846--1847, 2007.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. D. Demner-Fushman, W.J. Rogers, A.R. Aronson. MetaMap Lite: an evaluation of a new Java implementation of MetaMap. J Am Med Inform Assoc, 24(4): 841--844, 2017.Google ScholarGoogle ScholarCross RefCross Ref
  26. B. Ding, et al. Optimizing index for taxonomy keyword search. In SIGMOD, 2012.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. D. Djordjevic, et al. Discovery of perturbation gene targets via free text metadata mining in gene expression omnibus. Computational Biology and Chemistry, 80: 152--158, 2019.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. J. Dumas, M.A. Gargano, G.M. Dancik. shinyGEO: a web-based application for analyzing gene expression omnibus datasets. Bioinformatics, 32(23): 3679--3681, 2016.Google ScholarGoogle ScholarCross RefCross Ref
  29. G. Gay, et al. On the use of relevance feedback in IR-based concept location. In IEEE ICSM, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  30. C.B. Giles, et al. ALE: automated label extraction from GEO metadata. BMC Bioinformatics, 15(14): 7--16, 2017.Google ScholarGoogle Scholar
  31. E.S. Gushchanskaia, et al. Interplay between small RNA pathways shapes chromatin landscapes in C. elegans. Nucleic Acids Research, 47(11): 5603--5613, 2019.Google ScholarGoogle ScholarCross RefCross Ref
  32. D. Hadley, et al. Precision annotation of digital samples in NCBI's gene expression omnibus. Scientific Data, 4(1): 1--11, 2017.Google ScholarGoogle ScholarCross RefCross Ref
  33. A.N. Hasan, et al. An in silico analytical study of lung cancer and smokers datasets from gene expression omnibus (GEO) for prediction of differentially expressed genes. Bioinformation, 11(5): 229, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  34. R.Q. He, et al. Clinical significance of miR-210 and its prospective signaling pathways in non-small cell lung cancer: evidence from gene expression omnibus and the cancer genome atlas data mining with 2763 samples and validation via real-time quantitative PCR. Cellular Physiology and Biochemistry, 46(3): 925--952, 2018.Google ScholarGoogle ScholarCross RefCross Ref
  35. L.J. Jensen, J. Saric, P. Bork. Literature mining for the biologist: from information retrieval to biological discovery. Nature Reviews Genetics, 7(2): 119--129, 2006.Google ScholarGoogle ScholarCross RefCross Ref
  36. N. Karam, et al. Matching biodiversity and ecology ontologies: challenges and evaluation results. The Knowledge Engineering Review, 35(E9): 1--19, 2020.Google ScholarGoogle ScholarCross RefCross Ref
  37. K. Koeppen, B.A. Stanton, T.H. Hampton. ScanGEO: parallel mining of high-throughput gene expression data. Bioinformatics, 33(21): 3500--3501, 2017.Google ScholarGoogle ScholarCross RefCross Ref
  38. Y.S. Lee, et al. Ontology-aware classification of tissue and cell-type signals in gene expression profiles across platforms and technologies. Bioinformatics, 29(23): 3036--3044, 2013.Google ScholarGoogle ScholarCross RefCross Ref
  39. J. Lee, et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4): 1234--1240, 2020.Google ScholarGoogle ScholarCross RefCross Ref
  40. A. Leuski. Evaluating document clustering for interactive information retrieval. In CIKM, 2001.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Y. Li, et al. SCIA: a novel gene set analysis applicable to data with different characteristics. Frontiers in Genetics, 10: 598, 2019.Google ScholarGoogle ScholarCross RefCross Ref
  42. Z. Li, J. Li, P. Yu. GEOMetaCuration: a web-based application for accurate manual curation of gene expression omnibus. Database, 2018, 2018.Google ScholarGoogle Scholar
  43. J. Lin. Is searching full text more effective than searching abstracts? BMC Bioinformatics, 10(1): 1--15, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  44. S. Mathur, D. Dinakarpandian. Finding disease similarity based on implicit semantic similarity. Journal of Biomedical Informatics, 45(2): 363--371, 2012.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. R. Mihalcea, C. Corley, C. Strapparava. Corpus-based and knowledge-based measures of text semantic similarity. In AAAI, 2006.Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. C.P. Morrey, et al. Resolution of redundant semantic type assignments for organic chemicals in the UMLS. Artificial Intelligence in Medicine, 52(3): 141--151, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. F. Mougin, N. Grabar. Auditing the multiply-related concepts within the UMLS. Journal of the American Medical Informatics Association, 21(e2): e185-e193, 2014.Google ScholarGoogle Scholar
  48. C.J. Mungall, et al. Uberon, an integrative multi-species anatomy ontology. Genome Biology, 13(1): 1--20, 2012.Google ScholarGoogle ScholarCross RefCross Ref
  49. U. Naseem, et al. Biomedical named-entity recognition by hierarchically fusing biobert representations and deep contextual-level word-embedding. In IJCNN, 2020.Google ScholarGoogle ScholarCross RefCross Ref
  50. M. Neumann, et al. ScispaCy: fast and robust models for biomedical natural language processing. In BioNLP, 2019.Google ScholarGoogle ScholarCross RefCross Ref
  51. V. Nguyen, H.Y. Yip, O. Bodenreider. Biomedical vocabulary alignment at scale in the umls metathesaurus. In Proceedings of the Web Conference, 2021.Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. A.W. Nienhuis, D.G. Nathan. Pathophysiology and clinical manifestations of the β-thalassemias. Cold Spring Harbor Perspectives in Medicine, 2(12): a011726, 2016.Google ScholarGoogle ScholarCross RefCross Ref
  53. D. Oliveira, C. Pesquita. Improving the interoperability of biomedical ontologies with compound alignments. J. Biomed. Semant., 9(1), 2018.Google ScholarGoogle ScholarCross RefCross Ref
  54. L. Pang, et al. Deeprank: A new deep architecture for relevance ranking in information retrieval. In CIKM, 2017.Google ScholarGoogle Scholar
  55. E.G. Puffenberger, et al. Mapping of sudden infant death with dysgenesis of the testes syndrome (SIDDT) by a SNP genome scan and identification of TSPYL loss of function. Proceedings of the National Academy of Sciences, 101(32): 11689--11694, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  56. P. Resnik. Using information content to evaluate semantic similarity in a taxonomy. In IJCAI, 1995.Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. M.A. Rodríguez, M.J. Egenhofer. Determining semantic similarity among entity classes from different ontologies. IEEE Transactions on Knowledge and Data Engineering, 15(2): 442--456, 2003.Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Y. Rui, et al. Relevance feedback: A power tool for interactive content-based image retrieval. IEEE Transactions on Circuits and Systems for Video Technology, 8(5): 644--655, 1998.Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. D. Sánchez, et al. Ontology-based semantic similarity: a new feature-based approach. Expert Systems with Applications, 39(9): 7718--7728, 2012.Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. N. Seco, T. Veale, J. Hayes. An intrinsic information content metric for semantic similarity in WordNet. In ECAI, 2004.Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. H. Toda, R. Kataoka. A search result clustering method using informatively named entities. In WIDM, 2005.Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. A. Trotman. An artificial intelligence approach to information retrieval. In SIGIR, 2004.Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. D. Tsoucas, et al. Accurate estimation of cell-type composition from gene expression data. Nature Communications, 10(1): 1--9, 2019.Google ScholarGoogle ScholarCross RefCross Ref
  64. A. Tversky. Features of similarity. Psychological Review, 84: 327--352, 1977.Google ScholarGoogle ScholarCross RefCross Ref
  65. E.M. Voorhees. The philosophy of information retrieval evaluation. In Workshop of the cross-language evaluation forum for european languages, 2001.Google ScholarGoogle Scholar
  66. H. Wang, et al. High expression levels of pyrimidine metabolic rate-limiting enzymes are adverse prognostic factors in lung adenocarcinoma: a study based on The Cancer Genome Atlas and Gene Expression Omnibus datastes. Purinergic Signalling, 16(3): 347--366, 2020.Google ScholarGoogle ScholarCross RefCross Ref
  67. L.L. Wang, et al. Ontology alignment in the biomedical domain using entity definitions and context. In BioNLP, 2018.Google ScholarGoogle ScholarCross RefCross Ref
  68. Z. Wang, A. Lachmann, A. Ma'ayan. Mining data and metadata from the gene expression omnibus. Biophysical Reviews, 11(1):103--110, 2019.Google ScholarGoogle ScholarCross RefCross Ref
  69. Z. Wang, et al. Extraction and analysis of signatures from the gene expression omnibus by the crowd. Nature Communications, 7(1): 1--11, 2016.Google ScholarGoogle Scholar
  70. Z. Wu, M. Palmer. Verbs semantics and lexical selection. In ACL, 1994.Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. D. Yin, et al. Ranking relevance in yahoo search. In SIGKDD, 2016.Google ScholarGoogle ScholarDigital LibraryDigital Library
  72. T. Zhang, et al. KIAA0101 is a novel transcriptional target of FoxM1 and is involved in the regulation of hepatocellular carcinoma microvascular invasion by regulating epithelial-mesenchymal transition. Journal of Cancer, 10(15): 3501, 2019.Google ScholarGoogle ScholarCross RefCross Ref
  73. Y. Zhu, et al. GEOmetadb: powerful alternative search engine for the gene expression omnibus. Bioinformatics, 24(23): 2798--2800, 2008.Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Conferences
    BCB '22: Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics
    August 2022
    549 pages
    ISBN:9781450393867
    DOI:10.1145/3535508

    Copyright © 2022 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 7 August 2022

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • research-article

    Acceptance Rates

    Overall Acceptance Rate254of885submissions,29%
  • Article Metrics

    • Downloads (Last 12 months)25
    • Downloads (Last 6 weeks)0

    Other Metrics

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader