ABSTRACT
Transciptomic data stored in the Gene Expression Omnibus (GEO) serves thousands of queries per day, but a lack of standardized machine-readable metadata causes many searches to return irrelevant hits, which impede convenient access to useful data in the GEO repository. Here, we describe ArcheGEO, a novel end-to-end framework that improves results from the GEO Browser by automatically determining the relevance of these results. Unlike existing tools, ArcheGEO reports on the irrelevant results and provides reasoning for their exclusion. Such reasoning can be leveraged to improve annotations of metadata.
- ArrayExpress. https://www.ebi.ac.uk/arrayexpress/.Google Scholar
- Cellosaurus. https://web.expasy.org/cellosaurus/.Google Scholar
- Classification of Diseases. https://www.who.int/standards/classifications/classification-of-diseases.Google Scholar
- Gene Expression Omnibus. https://www.ncbi.nlm.nih.gov/geo/.Google Scholar
- Genomic Expression Archive. https://www.ddbj.nig.ac.jp/gea/index-e.html.Google Scholar
- Medical Subject Headings. https://www.ncbi.nlm.nih.gov/mesh/.Google Scholar
- NCI Metathesaurus. https://ncim.nci.nih.gov/ncimbrowser/.Google Scholar
- NCI Thesaurus. https://ncithesaurus.nci.nih.gov/ncitbrowser/.Google Scholar
- Online Mendelian Inheritance in Man. https://www.omim.org/.Google Scholar
- SNOMED CT. https://www.nlm.nih.gov/healthit/snomedct/index.html.Google Scholar
- UMLS Metathesaurus. https://uts.nlm.nih.gov/uts/umls/home.Google Scholar
- L. Amos, et al. UMLS users and uses: a current overview. Journal of the American Medical Informatics Association, 27(10): 1606--1611, 2020.Google ScholarCross Ref
- A.R. Aronson. Effective mapping of biomedical text to the UMLS metathesaurus: the metamap program. Proc AMIA Symp, 17--21, 2001.Google Scholar
- T. Barrett, et al. NCBI GEO: mining tens of millions of expression profiles - database and tools update. Nucleic Acids Research, 35(suppl_1): D760-D765, 2007.Google Scholar
- H. Bono. All of gene expression (AOE): an integrated index for public gene expression databases. PloS one, 15(1): e0227076, 2020.Google Scholar
- M. Brockington, et al. Localization and functional analysis of the LARGE family of glycosyltransferases: significance for muscular dystrophy. Human Molecular Genetics, 14(5): 657--665, 2005.Google ScholarCross Ref
- T. Byrt. How good is that agreement? Epidemiology, 7(5): 561, 1996.Google ScholarCross Ref
- E.J.M. Campbell, J.G. Scadding, R.S. Roberts. The concept of disease. Br Med J, 2(6193): 757--762, 1979.Google ScholarCross Ref
- G. Chen, et al. Restructured GEO: restructuring gene expression omnibus metadata for genome dynamics analysis. Database, 2019.Google Scholar
- X. Chen, et al. DataMed - an open source discovery index for finding biomedical datasets. Journal of the Americal Medical Informatics Association, 25(3): 300--308, 2018.Google ScholarCross Ref
- Y. Chen, et al. Gene expression inference with deep learning. Bioinform., 32(12): 1832--1839, 2016.Google ScholarCross Ref
- H. Cho, H. Lee. Biomedical named entity recognition using deep neural networks with contextual information. BMC Bioinformatics, 20(735), 2019.Google Scholar
- H.-E Chua, L. Tucker-Kellogg, S. S. Bhowmick. ArcheGEO: Towards improving relevance of gene expression omnibus search results. Technical Report, https://personal.ntu.edu.sg/assourav/TechReports/ArcheGEO-TR.pdf, 2021.Google Scholar
- S. Davis, P.S. Meltzer. GEOquery: a bridge between the gene expression omnibus (GEO) and bioconductor. Bioinformatics, 23(14): 1846--1847, 2007.Google ScholarDigital Library
- D. Demner-Fushman, W.J. Rogers, A.R. Aronson. MetaMap Lite: an evaluation of a new Java implementation of MetaMap. J Am Med Inform Assoc, 24(4): 841--844, 2017.Google ScholarCross Ref
- B. Ding, et al. Optimizing index for taxonomy keyword search. In SIGMOD, 2012.Google ScholarDigital Library
- D. Djordjevic, et al. Discovery of perturbation gene targets via free text metadata mining in gene expression omnibus. Computational Biology and Chemistry, 80: 152--158, 2019.Google ScholarDigital Library
- J. Dumas, M.A. Gargano, G.M. Dancik. shinyGEO: a web-based application for analyzing gene expression omnibus datasets. Bioinformatics, 32(23): 3679--3681, 2016.Google ScholarCross Ref
- G. Gay, et al. On the use of relevance feedback in IR-based concept location. In IEEE ICSM, 2009.Google ScholarCross Ref
- C.B. Giles, et al. ALE: automated label extraction from GEO metadata. BMC Bioinformatics, 15(14): 7--16, 2017.Google Scholar
- E.S. Gushchanskaia, et al. Interplay between small RNA pathways shapes chromatin landscapes in C. elegans. Nucleic Acids Research, 47(11): 5603--5613, 2019.Google ScholarCross Ref
- D. Hadley, et al. Precision annotation of digital samples in NCBI's gene expression omnibus. Scientific Data, 4(1): 1--11, 2017.Google ScholarCross Ref
- A.N. Hasan, et al. An in silico analytical study of lung cancer and smokers datasets from gene expression omnibus (GEO) for prediction of differentially expressed genes. Bioinformation, 11(5): 229, 2015.Google ScholarCross Ref
- R.Q. He, et al. Clinical significance of miR-210 and its prospective signaling pathways in non-small cell lung cancer: evidence from gene expression omnibus and the cancer genome atlas data mining with 2763 samples and validation via real-time quantitative PCR. Cellular Physiology and Biochemistry, 46(3): 925--952, 2018.Google ScholarCross Ref
- L.J. Jensen, J. Saric, P. Bork. Literature mining for the biologist: from information retrieval to biological discovery. Nature Reviews Genetics, 7(2): 119--129, 2006.Google ScholarCross Ref
- N. Karam, et al. Matching biodiversity and ecology ontologies: challenges and evaluation results. The Knowledge Engineering Review, 35(E9): 1--19, 2020.Google ScholarCross Ref
- K. Koeppen, B.A. Stanton, T.H. Hampton. ScanGEO: parallel mining of high-throughput gene expression data. Bioinformatics, 33(21): 3500--3501, 2017.Google ScholarCross Ref
- Y.S. Lee, et al. Ontology-aware classification of tissue and cell-type signals in gene expression profiles across platforms and technologies. Bioinformatics, 29(23): 3036--3044, 2013.Google ScholarCross Ref
- J. Lee, et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4): 1234--1240, 2020.Google ScholarCross Ref
- A. Leuski. Evaluating document clustering for interactive information retrieval. In CIKM, 2001.Google ScholarDigital Library
- Y. Li, et al. SCIA: a novel gene set analysis applicable to data with different characteristics. Frontiers in Genetics, 10: 598, 2019.Google ScholarCross Ref
- Z. Li, J. Li, P. Yu. GEOMetaCuration: a web-based application for accurate manual curation of gene expression omnibus. Database, 2018, 2018.Google Scholar
- J. Lin. Is searching full text more effective than searching abstracts? BMC Bioinformatics, 10(1): 1--15, 2009.Google ScholarCross Ref
- S. Mathur, D. Dinakarpandian. Finding disease similarity based on implicit semantic similarity. Journal of Biomedical Informatics, 45(2): 363--371, 2012.Google ScholarDigital Library
- R. Mihalcea, C. Corley, C. Strapparava. Corpus-based and knowledge-based measures of text semantic similarity. In AAAI, 2006.Google ScholarDigital Library
- C.P. Morrey, et al. Resolution of redundant semantic type assignments for organic chemicals in the UMLS. Artificial Intelligence in Medicine, 52(3): 141--151, 2011.Google ScholarDigital Library
- F. Mougin, N. Grabar. Auditing the multiply-related concepts within the UMLS. Journal of the American Medical Informatics Association, 21(e2): e185-e193, 2014.Google Scholar
- C.J. Mungall, et al. Uberon, an integrative multi-species anatomy ontology. Genome Biology, 13(1): 1--20, 2012.Google ScholarCross Ref
- U. Naseem, et al. Biomedical named-entity recognition by hierarchically fusing biobert representations and deep contextual-level word-embedding. In IJCNN, 2020.Google ScholarCross Ref
- M. Neumann, et al. ScispaCy: fast and robust models for biomedical natural language processing. In BioNLP, 2019.Google ScholarCross Ref
- V. Nguyen, H.Y. Yip, O. Bodenreider. Biomedical vocabulary alignment at scale in the umls metathesaurus. In Proceedings of the Web Conference, 2021.Google ScholarDigital Library
- A.W. Nienhuis, D.G. Nathan. Pathophysiology and clinical manifestations of the β-thalassemias. Cold Spring Harbor Perspectives in Medicine, 2(12): a011726, 2016.Google ScholarCross Ref
- D. Oliveira, C. Pesquita. Improving the interoperability of biomedical ontologies with compound alignments. J. Biomed. Semant., 9(1), 2018.Google ScholarCross Ref
- L. Pang, et al. Deeprank: A new deep architecture for relevance ranking in information retrieval. In CIKM, 2017.Google Scholar
- E.G. Puffenberger, et al. Mapping of sudden infant death with dysgenesis of the testes syndrome (SIDDT) by a SNP genome scan and identification of TSPYL loss of function. Proceedings of the National Academy of Sciences, 101(32): 11689--11694, 2004.Google ScholarCross Ref
- P. Resnik. Using information content to evaluate semantic similarity in a taxonomy. In IJCAI, 1995.Google ScholarDigital Library
- M.A. Rodríguez, M.J. Egenhofer. Determining semantic similarity among entity classes from different ontologies. IEEE Transactions on Knowledge and Data Engineering, 15(2): 442--456, 2003.Google ScholarDigital Library
- Y. Rui, et al. Relevance feedback: A power tool for interactive content-based image retrieval. IEEE Transactions on Circuits and Systems for Video Technology, 8(5): 644--655, 1998.Google ScholarDigital Library
- D. Sánchez, et al. Ontology-based semantic similarity: a new feature-based approach. Expert Systems with Applications, 39(9): 7718--7728, 2012.Google ScholarDigital Library
- N. Seco, T. Veale, J. Hayes. An intrinsic information content metric for semantic similarity in WordNet. In ECAI, 2004.Google ScholarDigital Library
- H. Toda, R. Kataoka. A search result clustering method using informatively named entities. In WIDM, 2005.Google ScholarDigital Library
- A. Trotman. An artificial intelligence approach to information retrieval. In SIGIR, 2004.Google ScholarDigital Library
- D. Tsoucas, et al. Accurate estimation of cell-type composition from gene expression data. Nature Communications, 10(1): 1--9, 2019.Google ScholarCross Ref
- A. Tversky. Features of similarity. Psychological Review, 84: 327--352, 1977.Google ScholarCross Ref
- E.M. Voorhees. The philosophy of information retrieval evaluation. In Workshop of the cross-language evaluation forum for european languages, 2001.Google Scholar
- H. Wang, et al. High expression levels of pyrimidine metabolic rate-limiting enzymes are adverse prognostic factors in lung adenocarcinoma: a study based on The Cancer Genome Atlas and Gene Expression Omnibus datastes. Purinergic Signalling, 16(3): 347--366, 2020.Google ScholarCross Ref
- L.L. Wang, et al. Ontology alignment in the biomedical domain using entity definitions and context. In BioNLP, 2018.Google ScholarCross Ref
- Z. Wang, A. Lachmann, A. Ma'ayan. Mining data and metadata from the gene expression omnibus. Biophysical Reviews, 11(1):103--110, 2019.Google ScholarCross Ref
- Z. Wang, et al. Extraction and analysis of signatures from the gene expression omnibus by the crowd. Nature Communications, 7(1): 1--11, 2016.Google Scholar
- Z. Wu, M. Palmer. Verbs semantics and lexical selection. In ACL, 1994.Google ScholarDigital Library
- D. Yin, et al. Ranking relevance in yahoo search. In SIGKDD, 2016.Google ScholarDigital Library
- T. Zhang, et al. KIAA0101 is a novel transcriptional target of FoxM1 and is involved in the regulation of hepatocellular carcinoma microvascular invasion by regulating epithelial-mesenchymal transition. Journal of Cancer, 10(15): 3501, 2019.Google ScholarCross Ref
- Y. Zhu, et al. GEOmetadb: powerful alternative search engine for the gene expression omnibus. Bioinformatics, 24(23): 2798--2800, 2008.Google ScholarDigital Library
Recommendations
A survey of disease connections for CD4+ T cell master genes and their directly linked genes
HighlightsCD4+ T cell subtype master genes and their connected genes are more likely to be associated with a disease or a phenotype.Genes connected to the CD4+ T cell subtype master genes are more likely to be transcription factors.CD4+ T cell subtype ...
Bipartite network analysis reveals metabolic gene expression profiles that are highly associated with the clinical outcomes of acute myeloid leukemia
Display Omitted Metabolic genes are as important prognostic biomarkers as oncogenes.We found that significant differences exist in metabolic processes of AML patients.We identified 62 metabolic genes that highly associated with the prognosis of ...
Identification and analysis of the regulatory network of Myc and microRNAs from high-throughput experimental data
As a transcription factor, c-Myc exerts significant influence in cancer development by regulating transcription of a large number of target genes including microRNAs. However, details of regulatory networks composed of Myc, microRNAs, and microRNA ...
Comments