skip to main content
10.1145/2618243.2618244acmotherconferencesArticle/Chapter ViewAbstractPublication PagesssdbmConference Proceedingsconference-collections
research-article

Maintaining a microbial genome & metagenome data analysis system in an academic setting

Authors Info & Claims
Published:30 June 2014Publication History

ABSTRACT

The Integrated Microbial Genomes (IMG) system integrates microbial community aggregate genomes (metagenomes) with genomes from all domains of life. IMG provides tools for analyzing and reviewing the structural and functional annotations of metagenomes and genomes in a comparative context. At the core of the IMG system is a data warehouse that contains genome and metagenome datasets provided by scientific users, as well as public bacterial, archaeal, eukaryotic, and viral genomes from the US National Center for Biotechnology Information genomic archive and a rich set of engineered, environmental and host associated metagenomes. Genomes and metagenome datasets are processed using IMG's microbial genome and metagenome sequence data processing pipelines and then are integrated into the data warehouse using IMG's data integration toolkit. Microbial genome and metagenome application specific user interfaces provide access to different subsets of IMG's data and analysis toolkits. Genome and metagenome analysis is a gene centric iterative process that involves a sequence (composition) of data exploration and comparative analysis operations, with individual operations expected to have rapid response time.

From its first release in 2005, IMG has grown from an initial content of about 300 genomes with a total of 2 million genes, to 22,578 bacterial, archaeal, eukaryotic and viral genomes, and 4,188 metagenome samples, with about 24.6 billion genes as of May 1st, 2014. IMG's database architecture is continuously revised in order to cope with the rapid increase in the number and size of the genome and metagenome datasets, maintain good query performance, and accommodate new data types. We present in this paper IMG's new database architecture developed over the past three years in the context of limited financial, engineering and data management resources customary to academic database systems. We discuss the alternative commercial and open source database management systems we considered and experimented with and describe the hybrid architecture we devised for sustaining IMG's rapid growth.

References

  1. Committee on Metagenomics: Challenges and Functional Applications, National Research Council. 2007. The new science of metagenomics: reavealing the secrets of our microbial planet. The National Academies Press.Google ScholarGoogle Scholar
  2. Markowitz VM, Chen IMA, Palaniappan K, Chu K, Szeto E, et al. (2014) IMG 4 version of the integrated microbial genomes comparative analysis system, Nucleic Acids Res., 42. See also: http://img.jgi.doe.gov.Google ScholarGoogle Scholar
  3. Dehal PS, Joachimiak MP, Price MN, Bates JT, Baumohl JK, et al. (2010) MicrobesOnline: an integrated portal for comparative and functional genomics Nucl. Acids Res. 38: D396--D400.Google ScholarGoogle ScholarCross RefCross Ref
  4. Vallenet D, Belda E, Calteau A, Cruveiller S, Engelen S, et al. (2013) MicroScope---an integrated microbial resource for the curation and comparative analysis of genomic and metabolic data Nucl. Acids Res. 41: D636--D64.Google ScholarGoogle ScholarCross RefCross Ref
  5. Meyer F, Paarmann D, D'Souza M, Olson R, Glass EM, et al. (2008) The metagenomics RAST server - a public resource for the automatic phylogenetic and functional analysis of metagenomes. BMC Bioinformatics 9:386.Google ScholarGoogle ScholarCross RefCross Ref
  6. Sun S, Chen J, Li W, Altintas I, Lin A, et al. (2011) Community cyberinfrastructure for Advanced Microbial Ecology Research and Analysis: the CAMERA resource Nucl. Acids Res. 39: D546--D551.Google ScholarGoogle ScholarCross RefCross Ref
  7. Wong L. 2002. Technologies for integrating biological data. Briefings in Bioinformatics, 3 (4): 389--404.Google ScholarGoogle ScholarCross RefCross Ref
  8. Stein LD. 2003. Integrating biological databases. Nature Reviews Genetics, 4: 337--345.Google ScholarGoogle ScholarCross RefCross Ref
  9. Hernandez T, Kambhampati S. 2004. Integration of biological sources: current systems and challenges ahead. SIGMOD Record, 33(3): 51--60. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Davidson SB, Crabtree J, Bunk B, Schug J, Tannen V, and Stoeckert C. 2001. K2/Kleisli and GUS: Experiments in Integrated Access to Genomic Data Sources, IBM Systems Journal, 40, 512--531. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Halevy AY. 2005. Why your data won't mix: semantic heterogeneity. ACM Queue, 3(8): 50--58. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Pellegrini M, Marcotte EM, Thompson MJ, Eisenberg D, Yeates TO. 1999. Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc. National Academy of Science, 96 (8), 4285--4288.Google ScholarGoogle ScholarCross RefCross Ref
  13. Rebhan, M., Chalifa-Caspi, V., Prilusky, J., Lancet, D. 1997. GeneCards: integrating information about genes, proteins and diseases. Trends in Genetics, 13, 163.Google ScholarGoogle ScholarCross RefCross Ref
  14. Searls D. 1995. bioTk: Componentry for genome informatics graphical user interfaces. Gene 163(2), GC1--16.Google ScholarGoogle Scholar
  15. Editorial. 2006. Sustainable Databases. Nature Cell Biology 8(12): 1311.Google ScholarGoogle ScholarCross RefCross Ref
  16. Editorial. 2007. The Database Revolution. Nature, 445: 229.Google ScholarGoogle ScholarCross RefCross Ref
  17. Baker, M. 2012. Databases fight funding cuts. Nature 489: 19.Google ScholarGoogle ScholarCross RefCross Ref
  18. Stein LD. 2010. The case for cloud computing in genome informatics. Genome Biology 11:207.Google ScholarGoogle ScholarCross RefCross Ref
  19. Schadt EE, Linderman MD, Sorenson J, Lee L, and Nolan GP. 2010. Computational solutions to large scale data management and analysis. Nature Reviews (Genetics) 11: 647--657.Google ScholarGoogle ScholarCross RefCross Ref
  20. Pavlo A., Paulson E., Rasin A., Abadi, D.J., DeWitt, D.J., Madden S., Stonebraker M. 2009. A comparison of approaches to large-scale data analysis. Proc of the 2009 SIGMOD Conference on Management of Data, 165--178. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Stonebraker M and Cattel R. 2011. Ten rules for scalable performance in "simple operation" datasources. Communications of the ACM 54(6): 72--80. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Benson DA, Cavanaugh M, Clark K, Karsch-Mizrahi I, Lipman DJ, Ostell J, Sayers EW. 2013. GenBank, Nucleic Acids Res 41: D36--D42.Google ScholarGoogle ScholarCross RefCross Ref
  23. Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, Koonin EV, et al. 2003. The COG database: an updated version includes eukaryotes. BMC Bioinformatics 4:41.Google ScholarGoogle ScholarCross RefCross Ref
  24. Punta M, Coggill PC, Eberhardt RY, Mistr J, Tate J, Boursnell C, Pang N, et al., 2012. The Pfam Protein Families Database. Nucleic Acids Research 40: D290--D301.Google ScholarGoogle ScholarCross RefCross Ref
  25. Selengut JD, Haft DH, Davidsen T, Ganapathy A, Gwinn-Giglio M, Nelson WC, Richter AR, White O. 2007. TIGRFAMs and Genome Properties: tools for the assignment of molecular function and biological process in prokaryotic genomes Nucleic Acids Res. 35, D260--D264.Google ScholarGoogle Scholar
  26. Kanehisa M, Goto S, Sato Y, Furumichi M, Tanabe M. 2012. KEGG for integration and interpretation of large scale molecular data sets. Nucleic Acids Res. 40, D109--D114.Google ScholarGoogle ScholarCross RefCross Ref
  27. Caspi R, Altman T, Dreher K, Fulcher AC, Subhraveti P, Keseler IM, et al. 2012. The MetaCyc database of metabolic pathways and enzymes and the BioCyc collection of Pathway/Genome Databases. Nucleic Acids Res. 40: D742--D753.Google ScholarGoogle ScholarCross RefCross Ref
  28. Pagani I, Liolios K, Jansson J, Chen IM, Smirnova T, Nosrat B, Markowitz VM, Kyrpides NC. 2012. The Genomes On Line Database (GOLD) v.4: status of genomic and metagenomic projects and their associated metadata. Nucleic Acids Res. 40: D571--D579.Google ScholarGoogle ScholarCross RefCross Ref
  29. Markowitz VM, Mavromatis K, Ivanova NN, Chen IA, Chu K, Kyrpides NC. 2009. IMG ER: a system for microbial annotation expert review and curation. Bioinformatics 25(17): 2271--2278. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, and Lipman DJ. 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs, Nucleic Acids Res. 25: 3389--3402.Google ScholarGoogle Scholar
  31. Edgar RC. (2010. Search and clustering orders of magnitude faster than BLAST, Bioinformatics 26(19): 2460--2461. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Markowitz VM, Korzeniewski F, Palaniappan K, Szeto E, Ivanova N. and Kyrpides NC. (2005) The Integrated Microbial Genomes (IMG) System: A Case Study in Biological Data Management. Proc. of the 31st VLDB Conference, 1067--1078. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Genepool, http://www.nersc.gov/users/computational-systems/genepool/.Google ScholarGoogle Scholar
  34. Hbase, http://hadoop.apache.org/hbase/.Google ScholarGoogle Scholar
  35. Hive, http://wiki.apache.org/hadoop/Hive.Google ScholarGoogle Scholar
  36. Bajda-Pawlikowski K, Abadi DJ, Silberschatz A, and Paulson E. (2011) Efficient processing of data warehousing queries in a split execution environment. Proc of SIGMOD Conf on Management of Data, 1165--1176. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. SciDB Development Team. (2010) SciDB Overview: Large Scale Array Storage, Processing and Analysis. Proc of SIGMOD Conf on Management of Data, 963--968. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Stonebraker M, Becla J, Dewitt D, Lim KTL, Maier D, Ratzesberger O, and Zdonic S. (2009) Requirements for science data bases and SciDB. Proc. of the 4th Conf on Innovative Data Systems Research (CIDR), http://www.scidb.org/Documents/SciDB-CIDR2009.pdf.Google ScholarGoogle Scholar
  39. SQLite, http://www.sqlite.org/.Google ScholarGoogle Scholar
  40. Kaser O and Lemire D. (2006) Attribute Value Reordering For Efficient Hybrid OLAP, Information Sciences 176 (16), 2304--2336. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Mavromatis K, Chu K, Ivanova N, Hooper S, Markowitz VM, and Kyrpides, NC. 2009. Gene context analysis in the integrated microbial genomes (IMG) data management system. PLoS ONE, 4(11):e7979.Google ScholarGoogle ScholarCross RefCross Ref
  42. Romosan A, Shoshani A, Wu K, Markowitz VM, Mavrommatis K. 2013, Accelerating Gene Context Analysis Using Bitmaps, Proc. of the 25th Int.Conf. on Scientific and Statistical Database Management (SSDBM), DOI=10.1145/2484838.2484856. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Hunter S, Corbett M, Denise H, Fraser M, Gonzalez-Beltran A, Hunter C, et al. 2014. EBI metagenomics -- a new resource for the analysis and archiving of metagenomic data. Nucleic Acids Res. 42: D600--D606.Google ScholarGoogle ScholarCross RefCross Ref
  44. Big Data and Extreme Scale Computing Workshop Series, http://www.exascale.org/bdec/.Google ScholarGoogle Scholar

Index Terms

  1. Maintaining a microbial genome & metagenome data analysis system in an academic setting

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Other conferences
      SSDBM '14: Proceedings of the 26th International Conference on Scientific and Statistical Database Management
      June 2014
      417 pages
      ISBN:9781450327220
      DOI:10.1145/2618243

      Copyright © 2014 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 30 June 2014

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      SSDBM '14 Paper Acceptance Rate26of71submissions,37%Overall Acceptance Rate56of146submissions,38%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader