research-article

ArcheGEO: towards improving relevance of gene expression omnibus search results

Authors:
Huey-Eng Chua

Nanyang Technological University, Singapore

Nanyang Technological University, Singapore
View Profile

,
Lisa Tucker-Kellogg

Duke-NUS Medical School, Singapore

Duke-NUS Medical School, Singapore
View Profile

,
Sourav S Bhowmick

Nanyang Technological University, Singapore

Nanyang Technological University, Singapore
View Profile

BCB '22: Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health InformaticsAugust 2022Article No.: 4Pages 1–10https://doi.org/10.1145/3535508.3545531

Published:07 August 2022Publication History

BCB '22: Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

Pages 1–10

ABSTRACT

Transciptomic data stored in the Gene Expression Omnibus (GEO) serves thousands of queries per day, but a lack of standardized machine-readable metadata causes many searches to return irrelevant hits, which impede convenient access to useful data in the GEO repository. Here, we describe ArcheGEO, a novel end-to-end framework that improves results from the GEO Browser by automatically determining the relevance of these results. Unlike existing tools, ArcheGEO reports on the irrelevant results and provides reasoning for their exclusion. Such reasoning can be leveraged to improve annotations of metadata.

References

ArrayExpress. https://www.ebi.ac.uk/arrayexpress/.Google Scholar
Cellosaurus. https://web.expasy.org/cellosaurus/.Google Scholar
Classification of Diseases. https://www.who.int/standards/classifications/classification-of-diseases.Google Scholar
Gene Expression Omnibus. https://www.ncbi.nlm.nih.gov/geo/.Google Scholar
Genomic Expression Archive. https://www.ddbj.nig.ac.jp/gea/index-e.html.Google Scholar
Medical Subject Headings. https://www.ncbi.nlm.nih.gov/mesh/.Google Scholar
NCI Metathesaurus. https://ncim.nci.nih.gov/ncimbrowser/.Google Scholar
NCI Thesaurus. https://ncithesaurus.nci.nih.gov/ncitbrowser/.Google Scholar
Online Mendelian Inheritance in Man. https://www.omim.org/.Google Scholar
SNOMED CT. https://www.nlm.nih.gov/healthit/snomedct/index.html.Google Scholar
UMLS Metathesaurus. https://uts.nlm.nih.gov/uts/umls/home.Google Scholar
L. Amos, et al. UMLS users and uses: a current overview. Journal of the American Medical Informatics Association, 27(10): 1606--1611, 2020.Google ScholarCross Ref
A.R. Aronson. Effective mapping of biomedical text to the UMLS metathesaurus: the metamap program. Proc AMIA Symp, 17--21, 2001.Google Scholar
T. Barrett, et al. NCBI GEO: mining tens of millions of expression profiles - database and tools update. Nucleic Acids Research, 35(suppl_1): D760-D765, 2007.Google Scholar
H. Bono. All of gene expression (AOE): an integrated index for public gene expression databases. PloS one, 15(1): e0227076, 2020.Google Scholar
M. Brockington, et al. Localization and functional analysis of the LARGE family of glycosyltransferases: significance for muscular dystrophy. Human Molecular Genetics, 14(5): 657--665, 2005.Google ScholarCross Ref
T. Byrt. How good is that agreement? Epidemiology, 7(5): 561, 1996.Google ScholarCross Ref
E.J.M. Campbell, J.G. Scadding, R.S. Roberts. The concept of disease. Br Med J, 2(6193): 757--762, 1979.Google ScholarCross Ref
G. Chen, et al. Restructured GEO: restructuring gene expression omnibus metadata for genome dynamics analysis. Database, 2019.Google Scholar
X. Chen, et al. DataMed - an open source discovery index for finding biomedical datasets. Journal of the Americal Medical Informatics Association, 25(3): 300--308, 2018.Google ScholarCross Ref
Y. Chen, et al. Gene expression inference with deep learning. Bioinform., 32(12): 1832--1839, 2016.Google ScholarCross Ref
H. Cho, H. Lee. Biomedical named entity recognition using deep neural networks with contextual information. BMC Bioinformatics, 20(735), 2019.Google Scholar
H.-E Chua, L. Tucker-Kellogg, S. S. Bhowmick. ArcheGEO: Towards improving relevance of gene expression omnibus search results. Technical Report, https://personal.ntu.edu.sg/assourav/TechReports/ArcheGEO-TR.pdf, 2021.Google Scholar
S. Davis, P.S. Meltzer. GEOquery: a bridge between the gene expression omnibus (GEO) and bioconductor. Bioinformatics, 23(14): 1846--1847, 2007.Google ScholarDigital Library
D. Demner-Fushman, W.J. Rogers, A.R. Aronson. MetaMap Lite: an evaluation of a new Java implementation of MetaMap. J Am Med Inform Assoc, 24(4): 841--844, 2017.Google ScholarCross Ref
B. Ding, et al. Optimizing index for taxonomy keyword search. In SIGMOD, 2012.Google ScholarDigital Library
D. Djordjevic, et al. Discovery of perturbation gene targets via free text metadata mining in gene expression omnibus. Computational Biology and Chemistry, 80: 152--158, 2019.Google ScholarDigital Library
J. Dumas, M.A. Gargano, G.M. Dancik. shinyGEO: a web-based application for analyzing gene expression omnibus datasets. Bioinformatics, 32(23): 3679--3681, 2016.Google ScholarCross Ref
G. Gay, et al. On the use of relevance feedback in IR-based concept location. In IEEE ICSM, 2009.Google ScholarCross Ref
C.B. Giles, et al. ALE: automated label extraction from GEO metadata. BMC Bioinformatics, 15(14): 7--16, 2017.Google Scholar
E.S. Gushchanskaia, et al. Interplay between small RNA pathways shapes chromatin landscapes in C. elegans. Nucleic Acids Research, 47(11): 5603--5613, 2019.Google ScholarCross Ref
D. Hadley, et al. Precision annotation of digital samples in NCBI's gene expression omnibus. Scientific Data, 4(1): 1--11, 2017.Google ScholarCross Ref
A.N. Hasan, et al. An in silico analytical study of lung cancer and smokers datasets from gene expression omnibus (GEO) for prediction of differentially expressed genes. Bioinformation, 11(5): 229, 2015.Google ScholarCross Ref
R.Q. He, et al. Clinical significance of miR-210 and its prospective signaling pathways in non-small cell lung cancer: evidence from gene expression omnibus and the cancer genome atlas data mining with 2763 samples and validation via real-time quantitative PCR. Cellular Physiology and Biochemistry, 46(3): 925--952, 2018.Google ScholarCross Ref
L.J. Jensen, J. Saric, P. Bork. Literature mining for the biologist: from information retrieval to biological discovery. Nature Reviews Genetics, 7(2): 119--129, 2006.Google ScholarCross Ref
N. Karam, et al. Matching biodiversity and ecology ontologies: challenges and evaluation results. The Knowledge Engineering Review, 35(E9): 1--19, 2020.Google ScholarCross Ref
K. Koeppen, B.A. Stanton, T.H. Hampton. ScanGEO: parallel mining of high-throughput gene expression data. Bioinformatics, 33(21): 3500--3501, 2017.Google ScholarCross Ref
Y.S. Lee, et al. Ontology-aware classification of tissue and cell-type signals in gene expression profiles across platforms and technologies. Bioinformatics, 29(23): 3036--3044, 2013.Google ScholarCross Ref
J. Lee, et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4): 1234--1240, 2020.Google ScholarCross Ref
A. Leuski. Evaluating document clustering for interactive information retrieval. In CIKM, 2001.Google ScholarDigital Library
Y. Li, et al. SCIA: a novel gene set analysis applicable to data with different characteristics. Frontiers in Genetics, 10: 598, 2019.Google ScholarCross Ref
Z. Li, J. Li, P. Yu. GEOMetaCuration: a web-based application for accurate manual curation of gene expression omnibus. Database, 2018, 2018.Google Scholar
J. Lin. Is searching full text more effective than searching abstracts? BMC Bioinformatics, 10(1): 1--15, 2009.Google ScholarCross Ref
S. Mathur, D. Dinakarpandian. Finding disease similarity based on implicit semantic similarity. Journal of Biomedical Informatics, 45(2): 363--371, 2012.Google ScholarDigital Library
R. Mihalcea, C. Corley, C. Strapparava. Corpus-based and knowledge-based measures of text semantic similarity. In AAAI, 2006.Google ScholarDigital Library
C.P. Morrey, et al. Resolution of redundant semantic type assignments for organic chemicals in the UMLS. Artificial Intelligence in Medicine, 52(3): 141--151, 2011.Google ScholarDigital Library
F. Mougin, N. Grabar. Auditing the multiply-related concepts within the UMLS. Journal of the American Medical Informatics Association, 21(e2): e185-e193, 2014.Google Scholar
C.J. Mungall, et al. Uberon, an integrative multi-species anatomy ontology. Genome Biology, 13(1): 1--20, 2012.Google ScholarCross Ref
U. Naseem, et al. Biomedical named-entity recognition by hierarchically fusing biobert representations and deep contextual-level word-embedding. In IJCNN, 2020.Google ScholarCross Ref
M. Neumann, et al. ScispaCy: fast and robust models for biomedical natural language processing. In BioNLP, 2019.Google ScholarCross Ref
V. Nguyen, H.Y. Yip, O. Bodenreider. Biomedical vocabulary alignment at scale in the umls metathesaurus. In Proceedings of the Web Conference, 2021.Google ScholarDigital Library
A.W. Nienhuis, D.G. Nathan. Pathophysiology and clinical manifestations of the β-thalassemias. Cold Spring Harbor Perspectives in Medicine, 2(12): a011726, 2016.Google ScholarCross Ref
D. Oliveira, C. Pesquita. Improving the interoperability of biomedical ontologies with compound alignments. J. Biomed. Semant., 9(1), 2018.Google ScholarCross Ref
L. Pang, et al. Deeprank: A new deep architecture for relevance ranking in information retrieval. In CIKM, 2017.Google Scholar
E.G. Puffenberger, et al. Mapping of sudden infant death with dysgenesis of the testes syndrome (SIDDT) by a SNP genome scan and identification of TSPYL loss of function. Proceedings of the National Academy of Sciences, 101(32): 11689--11694, 2004.Google ScholarCross Ref
P. Resnik. Using information content to evaluate semantic similarity in a taxonomy. In IJCAI, 1995.Google ScholarDigital Library
M.A. Rodríguez, M.J. Egenhofer. Determining semantic similarity among entity classes from different ontologies. IEEE Transactions on Knowledge and Data Engineering, 15(2): 442--456, 2003.Google ScholarDigital Library
Y. Rui, et al. Relevance feedback: A power tool for interactive content-based image retrieval. IEEE Transactions on Circuits and Systems for Video Technology, 8(5): 644--655, 1998.Google ScholarDigital Library
D. Sánchez, et al. Ontology-based semantic similarity: a new feature-based approach. Expert Systems with Applications, 39(9): 7718--7728, 2012.Google ScholarDigital Library
N. Seco, T. Veale, J. Hayes. An intrinsic information content metric for semantic similarity in WordNet. In ECAI, 2004.Google ScholarDigital Library
H. Toda, R. Kataoka. A search result clustering method using informatively named entities. In WIDM, 2005.Google ScholarDigital Library
A. Trotman. An artificial intelligence approach to information retrieval. In SIGIR, 2004.Google ScholarDigital Library
D. Tsoucas, et al. Accurate estimation of cell-type composition from gene expression data. Nature Communications, 10(1): 1--9, 2019.Google ScholarCross Ref
A. Tversky. Features of similarity. Psychological Review, 84: 327--352, 1977.Google ScholarCross Ref
E.M. Voorhees. The philosophy of information retrieval evaluation. In Workshop of the cross-language evaluation forum for european languages, 2001.Google Scholar
H. Wang, et al. High expression levels of pyrimidine metabolic rate-limiting enzymes are adverse prognostic factors in lung adenocarcinoma: a study based on The Cancer Genome Atlas and Gene Expression Omnibus datastes. Purinergic Signalling, 16(3): 347--366, 2020.Google ScholarCross Ref
L.L. Wang, et al. Ontology alignment in the biomedical domain using entity definitions and context. In BioNLP, 2018.Google ScholarCross Ref
Z. Wang, A. Lachmann, A. Ma'ayan. Mining data and metadata from the gene expression omnibus. Biophysical Reviews, 11(1):103--110, 2019.Google ScholarCross Ref
Z. Wang, et al. Extraction and analysis of signatures from the gene expression omnibus by the crowd. Nature Communications, 7(1): 1--11, 2016.Google Scholar
Z. Wu, M. Palmer. Verbs semantics and lexical selection. In ACL, 1994.Google ScholarDigital Library
D. Yin, et al. Ranking relevance in yahoo search. In SIGKDD, 2016.Google ScholarDigital Library
T. Zhang, et al. KIAA0101 is a novel transcriptional target of FoxM1 and is involved in the regulation of hepatocellular carcinoma microvascular invasion by regulating epithelial-mesenchymal transition. Journal of Cancer, 10(15): 3501, 2019.Google ScholarCross Ref
Y. Zhu, et al. GEOmetadb: powerful alternative search engine for the gene expression omnibus. Bioinformatics, 24(23): 2798--2800, 2008.Google ScholarDigital Library

Recommendations

A survey of disease connections for CD4+ T cell master genes and their directly linked genes

HighlightsCD4+ T cell subtype master genes and their connected genes are more likely to be associated with a disease or a phenotype.Genes connected to the CD4+ T cell subtype master genes are more likely to be transcription factors.CD4+ T cell subtype ...
Read More
Bipartite network analysis reveals metabolic gene expression profiles that are highly associated with the clinical outcomes of acute myeloid leukemia

Display Omitted Metabolic genes are as important prognostic biomarkers as oncogenes.We found that significant differences exist in metabolic processes of AML patients.We identified 62 metabolic genes that highly associated with the prognosis of ...
Read More
Identification and analysis of the regulatory network of Myc and microRNAs from high-throughput experimental data

As a transcription factor, c-Myc exerts significant influence in cancer development by regulating transcription of a large number of target genes including microRNAs. However, details of regulatory networks composed of Myc, microRNAs, and microRNA ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

BCB '22: Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics
August 2022
549 pages
ISBN:9781450393867
DOI:10.1145/3535508

Copyright © 2022 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 7 August 2022
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate254of885submissions,29%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 70
  Total Downloads
- Downloads (Last 12 months)25
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

ArcheGEO: towards improving relevance of gene expression omnibus search results

BCB '22: Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

ABSTRACT

References

Cited By

Recommendations

A survey of disease connections for CD4+ T cell master genes and their directly linked genes

Bipartite network analysis reveals metabolic gene expression profiles that are highly associated with the clinical outcomes of acute myeloid leukemia

Identification and analysis of the regulatory network of Myc and microRNAs from high-throughput experimental data

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

ArcheGEO: towards improving relevance of gene expression omnibus search results

BCB '22: Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

ABSTRACT

References

Cited By

Recommendations

A survey of disease connections for CD4+ T cell master genes and their directly linked genes

Bipartite network analysis reveals metabolic gene expression profiles that are highly associated with the clinical outcomes of acute myeloid leukemia

Identification and analysis of the regulatory network of Myc and microRNAs from high-throughput experimental data

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media