skip to main content
10.1145/2618243.2618244acmotherconferencesArticle/Chapter ViewAbstractPublication PagesssdbmConference Proceedingsconference-collections
research-article

Maintaining a microbial genome & metagenome data analysis system in an academic setting

Published: 30 June 2014 Publication History

Abstract

The Integrated Microbial Genomes (IMG) system integrates microbial community aggregate genomes (metagenomes) with genomes from all domains of life. IMG provides tools for analyzing and reviewing the structural and functional annotations of metagenomes and genomes in a comparative context. At the core of the IMG system is a data warehouse that contains genome and metagenome datasets provided by scientific users, as well as public bacterial, archaeal, eukaryotic, and viral genomes from the US National Center for Biotechnology Information genomic archive and a rich set of engineered, environmental and host associated metagenomes. Genomes and metagenome datasets are processed using IMG's microbial genome and metagenome sequence data processing pipelines and then are integrated into the data warehouse using IMG's data integration toolkit. Microbial genome and metagenome application specific user interfaces provide access to different subsets of IMG's data and analysis toolkits. Genome and metagenome analysis is a gene centric iterative process that involves a sequence (composition) of data exploration and comparative analysis operations, with individual operations expected to have rapid response time.
From its first release in 2005, IMG has grown from an initial content of about 300 genomes with a total of 2 million genes, to 22,578 bacterial, archaeal, eukaryotic and viral genomes, and 4,188 metagenome samples, with about 24.6 billion genes as of May 1st, 2014. IMG's database architecture is continuously revised in order to cope with the rapid increase in the number and size of the genome and metagenome datasets, maintain good query performance, and accommodate new data types. We present in this paper IMG's new database architecture developed over the past three years in the context of limited financial, engineering and data management resources customary to academic database systems. We discuss the alternative commercial and open source database management systems we considered and experimented with and describe the hybrid architecture we devised for sustaining IMG's rapid growth.

References

[1]
Committee on Metagenomics: Challenges and Functional Applications, National Research Council. 2007. The new science of metagenomics: reavealing the secrets of our microbial planet. The National Academies Press.
[2]
Markowitz VM, Chen IMA, Palaniappan K, Chu K, Szeto E, et al. (2014) IMG 4 version of the integrated microbial genomes comparative analysis system, Nucleic Acids Res., 42. See also: http://img.jgi.doe.gov.
[3]
Dehal PS, Joachimiak MP, Price MN, Bates JT, Baumohl JK, et al. (2010) MicrobesOnline: an integrated portal for comparative and functional genomics Nucl. Acids Res. 38: D396--D400.
[4]
Vallenet D, Belda E, Calteau A, Cruveiller S, Engelen S, et al. (2013) MicroScope---an integrated microbial resource for the curation and comparative analysis of genomic and metabolic data Nucl. Acids Res. 41: D636--D64.
[5]
Meyer F, Paarmann D, D'Souza M, Olson R, Glass EM, et al. (2008) The metagenomics RAST server - a public resource for the automatic phylogenetic and functional analysis of metagenomes. BMC Bioinformatics 9:386.
[6]
Sun S, Chen J, Li W, Altintas I, Lin A, et al. (2011) Community cyberinfrastructure for Advanced Microbial Ecology Research and Analysis: the CAMERA resource Nucl. Acids Res. 39: D546--D551.
[7]
Wong L. 2002. Technologies for integrating biological data. Briefings in Bioinformatics, 3 (4): 389--404.
[8]
Stein LD. 2003. Integrating biological databases. Nature Reviews Genetics, 4: 337--345.
[9]
Hernandez T, Kambhampati S. 2004. Integration of biological sources: current systems and challenges ahead. SIGMOD Record, 33(3): 51--60.
[10]
Davidson SB, Crabtree J, Bunk B, Schug J, Tannen V, and Stoeckert C. 2001. K2/Kleisli and GUS: Experiments in Integrated Access to Genomic Data Sources, IBM Systems Journal, 40, 512--531.
[11]
Halevy AY. 2005. Why your data won't mix: semantic heterogeneity. ACM Queue, 3(8): 50--58.
[12]
Pellegrini M, Marcotte EM, Thompson MJ, Eisenberg D, Yeates TO. 1999. Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc. National Academy of Science, 96 (8), 4285--4288.
[13]
Rebhan, M., Chalifa-Caspi, V., Prilusky, J., Lancet, D. 1997. GeneCards: integrating information about genes, proteins and diseases. Trends in Genetics, 13, 163.
[14]
Searls D. 1995. bioTk: Componentry for genome informatics graphical user interfaces. Gene 163(2), GC1--16.
[15]
Editorial. 2006. Sustainable Databases. Nature Cell Biology 8(12): 1311.
[16]
Editorial. 2007. The Database Revolution. Nature, 445: 229.
[17]
Baker, M. 2012. Databases fight funding cuts. Nature 489: 19.
[18]
Stein LD. 2010. The case for cloud computing in genome informatics. Genome Biology 11:207.
[19]
Schadt EE, Linderman MD, Sorenson J, Lee L, and Nolan GP. 2010. Computational solutions to large scale data management and analysis. Nature Reviews (Genetics) 11: 647--657.
[20]
Pavlo A., Paulson E., Rasin A., Abadi, D.J., DeWitt, D.J., Madden S., Stonebraker M. 2009. A comparison of approaches to large-scale data analysis. Proc of the 2009 SIGMOD Conference on Management of Data, 165--178.
[21]
Stonebraker M and Cattel R. 2011. Ten rules for scalable performance in "simple operation" datasources. Communications of the ACM 54(6): 72--80.
[22]
Benson DA, Cavanaugh M, Clark K, Karsch-Mizrahi I, Lipman DJ, Ostell J, Sayers EW. 2013. GenBank, Nucleic Acids Res 41: D36--D42.
[23]
Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, Koonin EV, et al. 2003. The COG database: an updated version includes eukaryotes. BMC Bioinformatics 4:41.
[24]
Punta M, Coggill PC, Eberhardt RY, Mistr J, Tate J, Boursnell C, Pang N, et al., 2012. The Pfam Protein Families Database. Nucleic Acids Research 40: D290--D301.
[25]
Selengut JD, Haft DH, Davidsen T, Ganapathy A, Gwinn-Giglio M, Nelson WC, Richter AR, White O. 2007. TIGRFAMs and Genome Properties: tools for the assignment of molecular function and biological process in prokaryotic genomes Nucleic Acids Res. 35, D260--D264.
[26]
Kanehisa M, Goto S, Sato Y, Furumichi M, Tanabe M. 2012. KEGG for integration and interpretation of large scale molecular data sets. Nucleic Acids Res. 40, D109--D114.
[27]
Caspi R, Altman T, Dreher K, Fulcher AC, Subhraveti P, Keseler IM, et al. 2012. The MetaCyc database of metabolic pathways and enzymes and the BioCyc collection of Pathway/Genome Databases. Nucleic Acids Res. 40: D742--D753.
[28]
Pagani I, Liolios K, Jansson J, Chen IM, Smirnova T, Nosrat B, Markowitz VM, Kyrpides NC. 2012. The Genomes On Line Database (GOLD) v.4: status of genomic and metagenomic projects and their associated metadata. Nucleic Acids Res. 40: D571--D579.
[29]
Markowitz VM, Mavromatis K, Ivanova NN, Chen IA, Chu K, Kyrpides NC. 2009. IMG ER: a system for microbial annotation expert review and curation. Bioinformatics 25(17): 2271--2278.
[30]
Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, and Lipman DJ. 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs, Nucleic Acids Res. 25: 3389--3402.
[31]
Edgar RC. (2010. Search and clustering orders of magnitude faster than BLAST, Bioinformatics 26(19): 2460--2461.
[32]
Markowitz VM, Korzeniewski F, Palaniappan K, Szeto E, Ivanova N. and Kyrpides NC. (2005) The Integrated Microbial Genomes (IMG) System: A Case Study in Biological Data Management. Proc. of the 31st VLDB Conference, 1067--1078.
[33]
Genepool, http://www.nersc.gov/users/computational-systems/genepool/.
[34]
Hbase, http://hadoop.apache.org/hbase/.
[35]
Hive, http://wiki.apache.org/hadoop/Hive.
[36]
Bajda-Pawlikowski K, Abadi DJ, Silberschatz A, and Paulson E. (2011) Efficient processing of data warehousing queries in a split execution environment. Proc of SIGMOD Conf on Management of Data, 1165--1176.
[37]
SciDB Development Team. (2010) SciDB Overview: Large Scale Array Storage, Processing and Analysis. Proc of SIGMOD Conf on Management of Data, 963--968.
[38]
Stonebraker M, Becla J, Dewitt D, Lim KTL, Maier D, Ratzesberger O, and Zdonic S. (2009) Requirements for science data bases and SciDB. Proc. of the 4th Conf on Innovative Data Systems Research (CIDR), http://www.scidb.org/Documents/SciDB-CIDR2009.pdf.
[39]
SQLite, http://www.sqlite.org/.
[40]
Kaser O and Lemire D. (2006) Attribute Value Reordering For Efficient Hybrid OLAP, Information Sciences 176 (16), 2304--2336.
[41]
Mavromatis K, Chu K, Ivanova N, Hooper S, Markowitz VM, and Kyrpides, NC. 2009. Gene context analysis in the integrated microbial genomes (IMG) data management system. PLoS ONE, 4(11):e7979.
[42]
Romosan A, Shoshani A, Wu K, Markowitz VM, Mavrommatis K. 2013, Accelerating Gene Context Analysis Using Bitmaps, Proc. of the 25th Int.Conf. on Scientific and Statistical Database Management (SSDBM), DOI=10.1145/2484838.2484856.
[43]
Hunter S, Corbett M, Denise H, Fraser M, Gonzalez-Beltran A, Hunter C, et al. 2014. EBI metagenomics -- a new resource for the analysis and archiving of metagenomic data. Nucleic Acids Res. 42: D600--D606.
[44]
Big Data and Extreme Scale Computing Workshop Series, http://www.exascale.org/bdec/.

Cited By

View all
  • (2023)Bioinformatics Analysis Tools for Studying Microbiomes at the DOE Joint Genome InstituteJournal of the Indian Institute of Science10.1007/s41745-023-00365-w103:3(857-875)Online publication date: 30-May-2023
  • (2018)IMG/M v.5.0: an integrated data management and comparative analysis system for microbial genomes and microbiomesNucleic Acids Research10.1093/nar/gky901Online publication date: 5-Oct-2018
  • (2016)IMG/M: integrated genome and metagenome comparative data analysis systemNucleic Acids Research10.1093/nar/gkw92945:D1(D507-D516)Online publication date: 13-Oct-2016
  • Show More Cited By

Index Terms

  1. Maintaining a microbial genome & metagenome data analysis system in an academic setting

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    SSDBM '14: Proceedings of the 26th International Conference on Scientific and Statistical Database Management
    June 2014
    417 pages
    ISBN:9781450327220
    DOI:10.1145/2618243
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 30 June 2014

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. data warehouse
    2. genome data analysis system

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    SSDBM '14

    Acceptance Rates

    SSDBM '14 Paper Acceptance Rate 26 of 71 submissions, 37%;
    Overall Acceptance Rate 56 of 146 submissions, 38%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)4
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 27 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Bioinformatics Analysis Tools for Studying Microbiomes at the DOE Joint Genome InstituteJournal of the Indian Institute of Science10.1007/s41745-023-00365-w103:3(857-875)Online publication date: 30-May-2023
    • (2018)IMG/M v.5.0: an integrated data management and comparative analysis system for microbial genomes and microbiomesNucleic Acids Research10.1093/nar/gky901Online publication date: 5-Oct-2018
    • (2016)IMG/M: integrated genome and metagenome comparative data analysis systemNucleic Acids Research10.1093/nar/gkw92945:D1(D507-D516)Online publication date: 13-Oct-2016
    • (2015)BaMBa: towards the integrated management of Brazilian marine environmental dataDatabase10.1093/database/bav0882015(bav088)Online publication date: 10-Oct-2015
    • (2015)Ten Years of Maintaining and Expanding a Microbial Genome and Metagenome Analysis SystemTrends in Microbiology10.1016/j.tim.2015.07.01223:11(730-741)Online publication date: Nov-2015

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media