skip to main content
10.1145/1646468.1646480acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Highly scalable genome assembly on campus grids

Published: 16 November 2009 Publication History

Abstract

Bioinformatics researchers need efficient means to process large collections of sequence data. One application of interest, genome assembly, has great potential for parallelization, however most previous attempts at parallelization require uncommon high-end hardware. This paper introduces a scalable modular genome assembler that can achieve significant speedup using large numbers of conventional desktop machines, such as those found in a campus computing grid. The system is based on the Celera open-source assembly toolkit, and replaces two independent sequential modules with scalable replacements: a scalable candidate selector exploits the distributed memory capacity of a campus grid, while the scalable aligner exploits the distributed computing capacity. For large problems, these modules provide robust task and data management while also achieving speedup with high efficiency on several scales of resources. We show results for several datasets ranging from 738 thousand to over 121 million alignments using campus grid resources ranging from a small cluster to more than a thousand nodes spanning three institutions. Our largest run so far achieves a 927x speedup with 71.3 percent efficiency.

References

[1]
The Open Science Grid. http://www.opensciencegrid.org.
[2]
D. Bakken and R. Schlichting. Tolerating failures in the bag-of-tasks programming paradigm. In IEEE International Symposium on Fault Tolerant Computing, June 1991.
[3]
S. Batzoglou et al. ARACHNE: A whole-genome shotgun assembler. Genome Res., 12(1):177--189, January 2002.
[4]
D. da Silva, W. Cirne, and F. Brasilero. Trading cycles for information: Using replication to schedule bag-of-tasks applications on computational grids. In Euro-Par, 2003.
[5]
J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large cluster. In Operating Systems Design and Implementation, 2004.
[6]
W. Gentzsch. Sun grid engine: Towards creating a compute power grid. In CCGRID '01: Proceedings of the 1st International Symposium on Cluster Computing and the Grid, page 35, Washington, DC, USA, 2001. IEEE Computer Society.
[7]
D. Gusfield. Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge Univ. Press, January 2007.
[8]
P. Havlak et al. The Atlas genome assembly system. Genome Res, 14(4):721--732, April 2004.
[9]
L. W. W. Hillier et al. Whole-genome sequencing and variant discovery in C. elegans. Nat Methods, January 2008.
[10]
X. Huang and A. Madan. CAP3: A DNA sequence assembly program. Genome Res., 9(9):868--877, September 1999.
[11]
X. Huang, J. Wang, S. Aluru, S.-P. Yang, and L. Hillier. PCAP: A whole-genome assembly program. Genome Res., 13(9):2164--2170, September 2003.
[12]
A. Kalyanaraman, S. Emrich, P. Schnable, and S. Aluru. Assembling genomes on large-scale parallel computers. Journal of Parallel and Distributed Computing, 67(12):1240--1255, 2007. Best Paper Awards: 20th International Parallel and Distributed Processing Symposium (IPDPS 2006).
[13]
J. Linderoth et al. An enabling framework for master-worker applications on the computational grid. In IEEE High Performance Distributed Computing, pages 43--50, Pittsburgh, Pennsylvania, August 2000.
[14]
E. W. Myers et al. A whole-genome assembly of Drosophila. Science, 287(5461):2196--2204, March 2000.
[15]
A. H. Paterson et al. The Sorghum bicolor genome and the diversification of grasses. Nature, 457(7229):551--556, January 2009.
[16]
M. Pop et al. Genome sequence assembly: Algorithms and issues. Computer, 35(7):47--54, 2002.
[17]
M. Pop and S. L. Salzberg. Bioinformatics challenges of new sequencing technology. Trends in Genetics, 24(3):142--149, March 2008.
[18]
I. Raicu, I. Foster, and Y. Zhao. Many-Task Computing for Grids and Supercomputers. In IEEE Workshop on Many-Task Computing on Grids and Supercomputers (MTAGS08), 2008.
[19]
I. Raicu, Y. Zhao, C. Dumitrescu, I. Foster, and M. Wilde. Falkon: a Fast and Light-weight tasK executiON framework. In IEEE/ACM Supercomputing, 2007.
[20]
M. Roberts et al. A preprocessor for shotgun assembly of large genomes. Journal of Computational Biology, 11(4):734--752, 2004.
[21]
A. Sarje and S. Aluru. Parallel biological sequence alignments on the cell broadband engine. pages 1--11, April 2008.
[22]
M. Schatz. CloudBurst: Highly sensitive read mapping with MapReduce. Bioinformatics (Online Advance Access), April 2009.
[23]
M. V. Sharakhova et al. Update of the Anopheles gambiae PEST genome assembly. Genome Biology, 8: R5+, January 2007.
[24]
O. Storaasli and D. Strenski. Exploring accelerating science applications with FPGAs. July 2007.
[25]
K. A. Swan et al. High-throughput gene mapping in caenorhabditis elegans. Genome Res, 12(7):1100--1105, July 2002.
[26]
D. Thain, T. Tannenbaum, and M. Livny. Condor and the grid. In F. Berman, G. Fox, and T. Hey, editors, Grid Computing: Making the Global Infrastructure a Reality. John Wiley, 2003.
[27]
L. Yu, C. Moretti, S. Emrich, K. Judd, and D. Thain. Harnessing Parallelism in Multicore Clusters with the All-Pairs and Wavefront Abstractions. In IEEE High Performance Distributed Computing, pages 1--10, 2009.

Cited By

View all
  • (2018)Scaling up genome annotation using MAKER and work queueInternational Journal of Bioinformatics Research and Applications10.1504/IJBRA.2014.06299410:4/5(447-460)Online publication date: 21-Dec-2018
  • (2014)Adapting bioinformatics applications for heterogeneous systemsConcurrency and Computation: Practice & Experience10.1002/cpe.292726:4(866-877)Online publication date: 25-Mar-2014
  • (2013)Implementing replica exchange molecular dynamics using work queue2013 IEEE International Conference on Bioinformatics and Biomedicine10.1109/BIBM.2013.6732764(63-64)Online publication date: Dec-2013
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MTAGS '09: Proceedings of the 2nd Workshop on Many-Task Computing on Grids and Supercomputers
November 2009
131 pages
ISBN:9781605587141
DOI:10.1145/1646468
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 November 2009

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article

Funding Sources

Conference

SC '09
Sponsor:

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)2
  • Downloads (Last 6 weeks)0
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2018)Scaling up genome annotation using MAKER and work queueInternational Journal of Bioinformatics Research and Applications10.1504/IJBRA.2014.06299410:4/5(447-460)Online publication date: 21-Dec-2018
  • (2014)Adapting bioinformatics applications for heterogeneous systemsConcurrency and Computation: Practice & Experience10.1002/cpe.292726:4(866-877)Online publication date: 25-Mar-2014
  • (2013)Implementing replica exchange molecular dynamics using work queue2013 IEEE International Conference on Bioinformatics and Biomedicine10.1109/BIBM.2013.6732764(63-64)Online publication date: Dec-2013
  • (2012)A Framework for Scalable Genome Assembly on Clusters, Clouds, and GridsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2012.8023:12(2189-2197)Online publication date: 1-Dec-2012
  • (2012)A Scalable Master-Worker Architecture for PaaS CloudsProceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and Analysis10.1109/SC.Companion.2012.153(1268-1275)Online publication date: 10-Nov-2012
  • (2012)Shifting the bioinformatics computing paradigmProceedings of the 2012 IEEE 2nd International Conference on Computational Advances in Bio and medical Sciences10.1109/ICCABS.2012.6182647(1-6)Online publication date: 23-Feb-2012
  • (2011)Adapting bioinformatics applications for heterogeneous systemsProceedings of the second international workshop on Emerging computational methods for the life sciences10.1145/1996023.1996025(7-14)Online publication date: 8-Jun-2011
  • (2010)Abstractions for Cloud Computing with CondorCloud Computing and Software Services10.1201/EBK1439803158-c7(153-171)Online publication date: 13-Jul-2010

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media