research-article

High-performance genomic analysis framework with in-memory computing

Authors:

Ninghui SunAuthors Info & Claims

PPoPP '18: Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

Pages 317 - 328

https://doi.org/10.1145/3178487.3178511

Published: 10 February 2018 Publication History

Abstract

In this paper, we propose an in-memory computing framework (called GPF) that provides a set of genomic formats, APIs and a fast genomic engine for large-scale genomic data processing. Our GPF comprises two main components: (1) scalable genomic data formats and API. (2) an advanced execution engine that supports efficient compression of genomic data and eliminates redundancies in the execution engine of our GPF. We further present both system and algorithm-specific implementations for users to build genomic analysis pipeline without any acquaintance of Spark parallel programming. To test the performance of GPF, we built a WGS pipeline on top of our GPF as a test case. Our experimental data indicate that GPF completes Whole-Genome-Sequencing (WGS) analysis of 146.9G bases Human Platinum Genome in running time of 24 minutes, with over 50% parallel efficiency when used on 2048 CPU cores. Together, our GPF framework provides a fast and general engine for large-scale genomic data processing which supports in-memory computing.

References

[1]

2016. HG19 Human Genome Download. http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/. (2016).

[2]

2016. Picard: A set of command line tools (in Java) for manipulating high-throughput sequencing (HTS) data and formats such as SAM/BAM/CRAM and VCF. http://broadinstitute.github.io/picard/. (2016).

[3]

Stuart Anthony Byma, Sam David Whitlock, Laura Flueratoru, Ethan Tseng, Christos Kozyrakis, Edouard Bugnion, and James Larus. 2017. Persona: A High-Performance Bioinformatics Framework. In USENIX Annual Technical Conference 2017.

Digital Library

[4]

Sebastian Deorowicz and Szymon Grabowski. 2011. Compression of DNA sequence reads in FASTQ format. Bioinformatics 27, 6 (2011), 860--862.

Digital Library

[5]

Apache Software Foundation. Online. Apache Hadoop. http://hadoop.apache.org/. (Online).

[6]

Claudia Gonzaga-Jauregui, James R Lupski, and Richard A Gibbs. 2012. Human genome sequencing in health and disease. Annual review of medicine 63 (2012), 35--61.

[7]

Illumina. 2012. HiSeq Sequencing System. http://www.illumina.com/. (2012).

[8]

illumina. 2017. NovaSeq. https://www.illumina.com/systems/sequencing-platforms/novaseq.html. (2017).

[9]

Broad Institute. Online. GATK-4. https://github.com/broadinstitute/gatk. (Online).

[10]

Broad Institute. Online. GATK Queue. http://gatkforums.broadinstitute.org/discussion/1306/overview-of-queue. (Online).

[11]

Scott D Kahn. 2011. On the future of genomic data. science 331, 6018 (2011), 728--729.

[12]

Benjamin J. Kelly, James R. Fitch, Yangqiu Hu, Donald J. Corsmeier, Huachun Zhong, Amy N. Wetzel, Russell D. Nordquist, David L. Newsom, and Peter White. 2015. Churchill: an ultra-fast, deterministic, highly scalable and balanced parallelization strategy for the discovery of human genetic variation in clinical and population-scale genomics. Genome Biology 16, 1 (2015), 1--14.

[13]

Patricia Kovatch, Anthony Costa, Zachary Giles, Eugene Fluder, Hyung Min Cho, and Svetlana Mazurkova. 2015. Big omics data experience. In the International Conference for High Performance Computing, Networking, Storage and Analysis. ACM Press, New York, New York, USA, 1--12.

Digital Library

[14]

Hugo Y K Lam, Cuiping Pan, Michael J Clark, Phil Lacroute, Rui Chen, Rajini Haraksingh, Maeve O'Huallachain, Mark B Gerstein, Jeffrey M Kidd, Carlos D Bustamante, and Michael Snyder. 2012. Detecting and annotating genetic variations using the HugeSeq pipeline. Nat Biotech 30, 3 (03 2012), 226--229.

[15]

Ben Langmead, Cole Trapnell, Mihai Pop, Steven L Salzberg, et al. 2009. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome biol 10, 3 (2009), R25.

[16]

Heng Li and Richard Durbin. 2009. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 14 (2009), 1754--1760.

Digital Library

[17]

Heng Li, Bob Handsaker, Alec Wysoker, Tim Fennell, Jue Ruan, Nils Homer, Gabor Marth, Goncalo Abecasis, Richard Durbin, et al. 2009. The sequence alignment/map format and SAMtools. Bioinformatics 25, 16 (2009), 2078--2079.

Digital Library

[18]

Xueqi Li, Guangming Tan, Chunming Zhang, Xu Li, Zhonghai Zhang, and Ninghui Sun. 2016. Accelerating large-scale genomic analysis with Spark. In Bioinformatics and Biomedicine (BIBM), 2016 IEEE International Conference on. IEEE, 747--751.

[19]

Matt Massie, Frank Nothaft, Christopher Hartl, Christos Kozanitis, André Schumacher, Anthony D Joseph, and David A Patterson. 2013. Adam: Genomics formats and processing patterns for cloud scale computing. University of California, Berkeley Technical Report, No. UCB/EECS-2013 207 (2013).

[20]

A Mckenna, M Hanna, E Banks, A Sivachenko, K Cibulskis, A Kernytsky, K Garimella, D Altshuler, S Gabriel, and M Daly. 2010. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Research 20, 9 (2010), 1297--303.

[21]

Michael L Metzker. 2010. Sequencing technologies-the next generation. Nature reviews genetics 11, 1 (2010), 31--46.

[22]

Nabeel M Mohamed, Heshan Lin, and Wuchun Feng. 2013. Accelerating data-intensive genome analysis in the cloud. In Proceedings of the 5th International Conference on Bioinformatics and Computational Biology (BICoB), Honolulu, Hawaii, USA.

[23]

Frank Austin Nothaft, Matt Massie, Timothy Danford, Zhao Zhang, Uri Laserson, Carl Yeksigian, Jey Kottalam, Arun Ahuja, Jeff Hammerbacher, Michael Linderman, et al. 2015. Rethinking data-intensive science using scalable analytics systems. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, 631--646.

Digital Library

[24]

Kay Ousterhout, Ryan Rasti, Sylvia Ratnasamy, Scott Shenker, Byung-Gon Chun, and V ICSI. 2015. Making Sense of Performance in Data Analytics Frameworks. In NSDI, Vol. 15. 293--307.

Digital Library

[25]

Cole Trapnell, Adam Roberts, Loyal Goff, Geo Pertea, Daehwan Kim, David R Kelley, Harold Pimentel, Steven L Salzberg, John L Rinn, and Lior Pachter. 2012. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nature protocols 7, 3 (2012), 562.

[26]

Matei Zaharia, William J. Bolosky, Kristal Curtis, Armando Fox, David A. Patterson, Scott Shenker, Ion Stoica, Richard M. Karp, and Taylor Sittler. 2011. Faster and More Accurate Sequence Alignment with SNAP. CoRR abs/1111.5572 (2011). arXiv:1111.5572 http://arxiv.org/abs/1111.5572

[27]

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient Distributed Datasets: A Fault-tolerant Abstraction for In-memory Cluster Computing. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation (NSDI'12). USENIX Association, Berkeley, CA, USA, 2--2. http://dl.acm.org/citation.cfm?id=2228298.2228301

Digital Library

[28]

Jing Zhang, Heshan Lin, Pavan Balaji, and Wu-chun Feng. 2013. Optimizing Burrows-Wheeler Transform-Based Sequence Alignment on Multicore Architectures. CCGRID (2013), 377--384.

Cited By

Li YLi XGao RLiu WTan G(2023)NvWa: Enhancing Sequence Alignment Accelerator Throughput via Hardware Scheduling2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10070978(1236-1248)Online publication date: Feb-2023
https://doi.org/10.1109/HPCA56546.2023.10070978
Wang YLi XZang DTan GSun N(2018)Accelerating FM-index Search for Genomic Data ProcessingProceedings of the 47th International Conference on Parallel Processing10.1145/3225058.3225134(1-12)Online publication date: 13-Aug-2018
https://dl.acm.org/doi/10.1145/3225058.3225134
Wang XLeidel JChen Y(2018)Memory Coalescing for Hybrid Memory CubeProceedings of the 47th International Conference on Parallel Processing10.1145/3225058.3225062(1-10)Online publication date: 13-Aug-2018
https://dl.acm.org/doi/10.1145/3225058.3225062
Show More Cited By

Index Terms

High-performance genomic analysis framework with in-memory computing
1. Applied computing
  1. Life and medical sciences
    1. Computational biology
      1. Computational genomics
2. Software and its engineering
  1. Software organization and properties
    1. Software system structures
      1. Distributed systems organizing principles

Recommendations

High-performance genomic analysis framework with in-memory computing
PPoPP '18

In this paper, we propose an in-memory computing framework (called GPF) that provides a set of genomic formats, APIs and a fast genomic engine for large-scale genomic data processing. Our GPF comprises two main components: (1) scalable genomic data ...
Identifying Candidate Disease Genes with High-Performance Computing

The publicly-funded effort to read the complete nucleotide sequence of the human genome, the human genome project (HGP), is nearing completion of the approximately three billion nucleotides of the human genome. In addition, several valuable sources of ...
sRNA associated genomic islands in Salmonella spp.
ISB '10: Proceedings of the International Symposium on Biocomputing

Genomic Islands are parts of a genome that has evidence of horizontal origins. The present work is a continuation of our earlier work that identified 25 regions downstream of the small RNAs as hotspots of genomic island integration by analyzing three ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

PPoPP '18: Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

February 2018

442 pages

ISBN:9781450349826

DOI:10.1145/3178487

General Chair:
Andreas Krall
Vienna University of Technology, Austria
,
Program Chair:
Thomas R. Gross
ETH Zürich, Switzerland

ACM SIGPLAN Notices Volume 53, Issue 1
PPoPP '18
January 2018
426 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/3200691
Editor:
Matthew Fluet
Rodchester Institude of Technology
Issue’s Table of Contents

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 February 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

The National Key Research and Development Program of China
National Natural Science Foundation of China

Conference

PPoPP '18

Sponsor:

PPoPP '18: 23nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

February 24 - 28, 2018

Vienna, Austria

Acceptance Rates

Overall Acceptance Rate 230 of 1,014 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
575
Total Downloads

Downloads (Last 12 months)15
Downloads (Last 6 weeks)0

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Li YLi XGao RLiu WTan G(2023)NvWa: Enhancing Sequence Alignment Accelerator Throughput via Hardware Scheduling2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10070978(1236-1248)Online publication date: Feb-2023
https://doi.org/10.1109/HPCA56546.2023.10070978
Wang YLi XZang DTan GSun N(2018)Accelerating FM-index Search for Genomic Data ProcessingProceedings of the 47th International Conference on Parallel Processing10.1145/3225058.3225134(1-12)Online publication date: 13-Aug-2018
https://dl.acm.org/doi/10.1145/3225058.3225134
Wang XLeidel JChen Y(2018)Memory Coalescing for Hybrid Memory CubeProceedings of the 47th International Conference on Parallel Processing10.1145/3225058.3225062(1-10)Online publication date: 13-Aug-2018
https://dl.acm.org/doi/10.1145/3225058.3225062
Li YLi XGao RLiu WTan G(2023)NvWa: Enhancing Sequence Alignment Accelerator Throughput via Hardware Scheduling2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10070978(1236-1248)Online publication date: Feb-2023
https://doi.org/10.1109/HPCA56546.2023.10070978
Caderno PAwaysheh FColino-Sanguino YFuente LValdes-Mora FCabaleiro JPena TGallego-Ortega D(2023)OPERA-gSAM: Big Data Processing Framework for UMI Sequencing at High Scalability and Efficiency2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing Workshops (CCGridW)10.1109/CCGridW59191.2023.00038(160-167)Online publication date: May-2023
https://doi.org/10.1109/CCGridW59191.2023.00038
Liu YGao MTan LLiu HLin YYang WYu R(2021) scSpark XMBD : High-Performance scRNA-seq Data Processing with Spark 2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)10.1109/BIBM52615.2021.9669512(1956-1962)Online publication date: 9-Dec-2021
https://doi.org/10.1109/BIBM52615.2021.9669512
Becker MWorlikar UAgrawal SSchultze HUlas TSinghal SSchultze J(2020)Scaling Genomics Data Processing with Memory-Driven Computing to Accelerate Computational BiologyHigh Performance Computing10.1007/978-3-030-50743-5_17(328-344)Online publication date: 22-Jun-2020
https://dl.acm.org/doi/10.1007/978-3-030-50743-5_17

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten