skip to main content
10.1145/3178487.3178511acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections
research-article

High-performance genomic analysis framework with in-memory computing

Published: 10 February 2018 Publication History

Abstract

In this paper, we propose an in-memory computing framework (called GPF) that provides a set of genomic formats, APIs and a fast genomic engine for large-scale genomic data processing. Our GPF comprises two main components: (1) scalable genomic data formats and API. (2) an advanced execution engine that supports efficient compression of genomic data and eliminates redundancies in the execution engine of our GPF. We further present both system and algorithm-specific implementations for users to build genomic analysis pipeline without any acquaintance of Spark parallel programming. To test the performance of GPF, we built a WGS pipeline on top of our GPF as a test case. Our experimental data indicate that GPF completes Whole-Genome-Sequencing (WGS) analysis of 146.9G bases Human Platinum Genome in running time of 24 minutes, with over 50% parallel efficiency when used on 2048 CPU cores. Together, our GPF framework provides a fast and general engine for large-scale genomic data processing which supports in-memory computing.

References

[1]
2016. HG19 Human Genome Download. http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/. (2016).
[2]
2016. Picard: A set of command line tools (in Java) for manipulating high-throughput sequencing (HTS) data and formats such as SAM/BAM/CRAM and VCF. http://broadinstitute.github.io/picard/. (2016).
[3]
Stuart Anthony Byma, Sam David Whitlock, Laura Flueratoru, Ethan Tseng, Christos Kozyrakis, Edouard Bugnion, and James Larus. 2017. Persona: A High-Performance Bioinformatics Framework. In USENIX Annual Technical Conference 2017.
[4]
Sebastian Deorowicz and Szymon Grabowski. 2011. Compression of DNA sequence reads in FASTQ format. Bioinformatics 27, 6 (2011), 860--862.
[5]
Apache Software Foundation. Online. Apache Hadoop. http://hadoop.apache.org/. (Online).
[6]
Claudia Gonzaga-Jauregui, James R Lupski, and Richard A Gibbs. 2012. Human genome sequencing in health and disease. Annual review of medicine 63 (2012), 35--61.
[7]
Illumina. 2012. HiSeq Sequencing System. http://www.illumina.com/. (2012).
[8]
illumina. 2017. NovaSeq. https://www.illumina.com/systems/sequencing-platforms/novaseq.html. (2017).
[9]
Broad Institute. Online. GATK-4. https://github.com/broadinstitute/gatk. (Online).
[10]
Broad Institute. Online. GATK Queue. http://gatkforums.broadinstitute.org/discussion/1306/overview-of-queue. (Online).
[11]
Scott D Kahn. 2011. On the future of genomic data. science 331, 6018 (2011), 728--729.
[12]
Benjamin J. Kelly, James R. Fitch, Yangqiu Hu, Donald J. Corsmeier, Huachun Zhong, Amy N. Wetzel, Russell D. Nordquist, David L. Newsom, and Peter White. 2015. Churchill: an ultra-fast, deterministic, highly scalable and balanced parallelization strategy for the discovery of human genetic variation in clinical and population-scale genomics. Genome Biology 16, 1 (2015), 1--14.
[13]
Patricia Kovatch, Anthony Costa, Zachary Giles, Eugene Fluder, Hyung Min Cho, and Svetlana Mazurkova. 2015. Big omics data experience. In the International Conference for High Performance Computing, Networking, Storage and Analysis. ACM Press, New York, New York, USA, 1--12.
[14]
Hugo Y K Lam, Cuiping Pan, Michael J Clark, Phil Lacroute, Rui Chen, Rajini Haraksingh, Maeve O'Huallachain, Mark B Gerstein, Jeffrey M Kidd, Carlos D Bustamante, and Michael Snyder. 2012. Detecting and annotating genetic variations using the HugeSeq pipeline. Nat Biotech 30, 3 (03 2012), 226--229.
[15]
Ben Langmead, Cole Trapnell, Mihai Pop, Steven L Salzberg, et al. 2009. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome biol 10, 3 (2009), R25.
[16]
Heng Li and Richard Durbin. 2009. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 14 (2009), 1754--1760.
[17]
Heng Li, Bob Handsaker, Alec Wysoker, Tim Fennell, Jue Ruan, Nils Homer, Gabor Marth, Goncalo Abecasis, Richard Durbin, et al. 2009. The sequence alignment/map format and SAMtools. Bioinformatics 25, 16 (2009), 2078--2079.
[18]
Xueqi Li, Guangming Tan, Chunming Zhang, Xu Li, Zhonghai Zhang, and Ninghui Sun. 2016. Accelerating large-scale genomic analysis with Spark. In Bioinformatics and Biomedicine (BIBM), 2016 IEEE International Conference on. IEEE, 747--751.
[19]
Matt Massie, Frank Nothaft, Christopher Hartl, Christos Kozanitis, André Schumacher, Anthony D Joseph, and David A Patterson. 2013. Adam: Genomics formats and processing patterns for cloud scale computing. University of California, Berkeley Technical Report, No. UCB/EECS-2013 207 (2013).
[20]
A Mckenna, M Hanna, E Banks, A Sivachenko, K Cibulskis, A Kernytsky, K Garimella, D Altshuler, S Gabriel, and M Daly. 2010. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Research 20, 9 (2010), 1297--303.
[21]
Michael L Metzker. 2010. Sequencing technologies-the next generation. Nature reviews genetics 11, 1 (2010), 31--46.
[22]
Nabeel M Mohamed, Heshan Lin, and Wuchun Feng. 2013. Accelerating data-intensive genome analysis in the cloud. In Proceedings of the 5th International Conference on Bioinformatics and Computational Biology (BICoB), Honolulu, Hawaii, USA.
[23]
Frank Austin Nothaft, Matt Massie, Timothy Danford, Zhao Zhang, Uri Laserson, Carl Yeksigian, Jey Kottalam, Arun Ahuja, Jeff Hammerbacher, Michael Linderman, et al. 2015. Rethinking data-intensive science using scalable analytics systems. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, 631--646.
[24]
Kay Ousterhout, Ryan Rasti, Sylvia Ratnasamy, Scott Shenker, Byung-Gon Chun, and V ICSI. 2015. Making Sense of Performance in Data Analytics Frameworks. In NSDI, Vol. 15. 293--307.
[25]
Cole Trapnell, Adam Roberts, Loyal Goff, Geo Pertea, Daehwan Kim, David R Kelley, Harold Pimentel, Steven L Salzberg, John L Rinn, and Lior Pachter. 2012. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nature protocols 7, 3 (2012), 562.
[26]
Matei Zaharia, William J. Bolosky, Kristal Curtis, Armando Fox, David A. Patterson, Scott Shenker, Ion Stoica, Richard M. Karp, and Taylor Sittler. 2011. Faster and More Accurate Sequence Alignment with SNAP. CoRR abs/1111.5572 (2011). arXiv:1111.5572 http://arxiv.org/abs/1111.5572
[27]
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient Distributed Datasets: A Fault-tolerant Abstraction for In-memory Cluster Computing. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation (NSDI'12). USENIX Association, Berkeley, CA, USA, 2--2. http://dl.acm.org/citation.cfm?id=2228298.2228301
[28]
Jing Zhang, Heshan Lin, Pavan Balaji, and Wu-chun Feng. 2013. Optimizing Burrows-Wheeler Transform-Based Sequence Alignment on Multicore Architectures. CCGRID (2013), 377--384.

Cited By

View all
  • (2023)NvWa: Enhancing Sequence Alignment Accelerator Throughput via Hardware Scheduling2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10070978(1236-1248)Online publication date: Feb-2023
  • (2018)Accelerating FM-index Search for Genomic Data ProcessingProceedings of the 47th International Conference on Parallel Processing10.1145/3225058.3225134(1-12)Online publication date: 13-Aug-2018
  • (2018)Memory Coalescing for Hybrid Memory CubeProceedings of the 47th International Conference on Parallel Processing10.1145/3225058.3225062(1-10)Online publication date: 13-Aug-2018
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
PPoPP '18: Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
February 2018
442 pages
ISBN:9781450349826
DOI:10.1145/3178487
  • cover image ACM SIGPLAN Notices
    ACM SIGPLAN Notices  Volume 53, Issue 1
    PPoPP '18
    January 2018
    426 pages
    ISSN:0362-1340
    EISSN:1558-1160
    DOI:10.1145/3200691
    Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 February 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. genomic analysis framework
  2. high-performance computing
  3. in-memory computing

Qualifiers

  • Research-article

Funding Sources

  • The National Key Research and Development Program of China
  • National Natural Science Foundation of China

Conference

PPoPP '18

Acceptance Rates

Overall Acceptance Rate 230 of 1,014 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)15
  • Downloads (Last 6 weeks)0
Reflects downloads up to 03 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2023)NvWa: Enhancing Sequence Alignment Accelerator Throughput via Hardware Scheduling2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10070978(1236-1248)Online publication date: Feb-2023
  • (2018)Accelerating FM-index Search for Genomic Data ProcessingProceedings of the 47th International Conference on Parallel Processing10.1145/3225058.3225134(1-12)Online publication date: 13-Aug-2018
  • (2018)Memory Coalescing for Hybrid Memory CubeProceedings of the 47th International Conference on Parallel Processing10.1145/3225058.3225062(1-10)Online publication date: 13-Aug-2018
  • (2023)NvWa: Enhancing Sequence Alignment Accelerator Throughput via Hardware Scheduling2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10070978(1236-1248)Online publication date: Feb-2023
  • (2023)OPERA-gSAM: Big Data Processing Framework for UMI Sequencing at High Scalability and Efficiency2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing Workshops (CCGridW)10.1109/CCGridW59191.2023.00038(160-167)Online publication date: May-2023
  • (2021) scSpark XMBD : High-Performance scRNA-seq Data Processing with Spark 2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)10.1109/BIBM52615.2021.9669512(1956-1962)Online publication date: 9-Dec-2021
  • (2020)Scaling Genomics Data Processing with Memory-Driven Computing to Accelerate Computational BiologyHigh Performance Computing10.1007/978-3-030-50743-5_17(328-344)Online publication date: 22-Jun-2020

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media