Abstract
Genome Informatics (GI) involves accurate computational investigations of strongly correlated subsystems that demands inter-disciplinary approaches for problem solving. With the growing volume of genomic sequencing data at an alarming rate, High Performance Computing (HPC) solutions offer the right platform to address the computational needs. GI requires algorithm-architecture co-design of parallel and accelerated biocomputing involving reconfigurable hardware like FPGAs and graphics accelerators or GPUs, to bridge the gap between growing data volumes and compute capabilities. Such platforms offer high degrees of parallelism and scalability, while accelerating the multi-stage GI computational pipeline. Amidst such high computing power, it is the choice of algorithms and implementations in the entirety of the GI pipeline that decides the precision of bio-computing in revealing biologically relevant information. Through this paper, we present ReneGENE-GI, an innovatively engineered GI pipeline. This paper details the performance analysis of ReneGENE-GI’s Comparative Genomics Module (CGM), the compute intensive stage of the pipeline. This module comes in two flavours, designed to run on GPUs and FPGAs respectively, hosted on HPC platforms. The pipeline uses a very efficient reference indexing algorithm based on the dynamic Monotonic Minimal Perfect Hashing Function (MMPH), allowing an absolute indexing for the reference genome, thus avoiding heuristics. Alignment time for our FPGA version is about one-tenth the time taken by our single GPU implementation, which itself is 2.62x faster than CUSHAW2-GPU (the GPU CUDA implementation of CUSHAW). With the single-GPU implementation demonstrating a speed up of 150+ x over standard heuristic aligners in the market like BFAST, the FPGA version of our CGM is several orders faster than the competitors, offering precision over heuristics.
Similar content being viewed by others
References
Frese, K.S., Katus, H.A., Meder, B. (2013). Next-generation sequencing: from understanding biology to personalized medicine. Biology, 2(4), 378–398.
Mardis, E.R. (2011). A decade’s perspective on dna sequencing technology. Nature Perspective, 470, 198–203.
Stephens, Z.D., Lee, S.Y., Faghri, F., Campbell, R.H., Zhai, C., Efron, M.J., et al. (2015). Big data: Astronomical or genomical? PLOS Biology, 13(7).
Navarro, G. (2001). A guided tour to approximate string matching. ACM Computing Surveys, 33(1), 31–88.
Aho, A.V., & Corasick, M.J. (2000). Efficient string matching: an aid to bibliographic search. IEEE Data Engineering Bulletin, 24(4), 19–27.
Costa, F.F. (2012). Big data in genomics: Challenges and solutions. G.I.T Laboratory Journal, 11(12), 2–4.
Marx, V. (2013). The big challenges of big data. Nature, 498, 255–260.
Reinert, K., Langmead, B., Weese, D., Evers, D.J. (2015). Alignment of Next-Generation Sequencing Reads Annu. Rev Genomics Hum. Genet., 133–151.
Baker, M. (2010). Next-generation sequencing: adjusting to data overload. Nature Methods, 7, 495–499.
Treangen, T.J., & Salzberg, S.L. (2012). Repetitive dna and next-generation sequencing: computational challenges and solutions. Nature Reviews, 13, 36–46.
Flicek, P., & Birney, E. (2009). Sense from sequence reads: methods for alignment and assembly. Nature Methods, 6, S6–S12.
Yamaguchi, Y., Maruyama, T., Konagaya, A. (2002). High speed homology search with FPGAs. In Proceedings of the Pacific Symposium on Biocomputing (pp. 271–282).
Benkrid, K., Liu, Y., Benkrid, A. (2009). A highly parameterized and efficient FPGA-based skeleton for pairwise biological sequence alignment. IEEE Transactions On Very Large Scale Integration Systems, 17(4), 561–570.
Razmyslovich, D., Marcus, G., Gipp, M., Zapatka, M., Szillus, A. (2010). Implementation of Smith-Waterman Algorithm in openCL for GPUs. In IEEE Second International Workshop on High Performance Computational Systems Biology (pp. 48–56).
Banerjee, S.S., El-Hadedy, M., Lim, J.B., Kalbarczyk, Z.T., Chen, D., Lumetta, S.S., Iyer, R.K. ASAP: Accelerated Short-Read Alignment on Programmable Hardware.
Ergin, M.A., Hassan, H., Xin, H., Alli, E. (2017). Gatekeeper: a new hardware architecture for accelerating pre-alignment in DNA short read mapping. Bioinformatics.
Arram, J., Kaplan, T., Luk, W., Jiang, P. (2017). Leveraging FPGAs for accelerating short read alignment. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 14, NO. 3.
Lee, C.Y., Chiu, Y.C., Wang, L.B., al et. (2013). Common applications of next-generation sequencing technologies in genomic research. Translational Cancer Research, 2(1), 33–45.
Alyass, A., Turcotte, M., Meyre, D. (2015). From big data analysis to personalized medicine for all: challenges and opportunities. BMC Medical Genomics, 8(33).
Chen, C., & Schmidt, B. (2004). Performance analysis of computational biology applications on hierarchical grid systems. In Proceedings of the IEEE International Symposium on Cluster Computing and the Grid, CCGrid 2004 (pp 426–433). Chicago.
Bader, D.A. (2005). High-performance algorithm engineering for large-scale graph problems and computational biology. In Proceedings of the International Workshop on Experimental and Efficient Algorithms, WEA 2005 (pp. 16–21). Springer.
Natarajan, S., KrishnaKumar, N., Pal, D., Nandy, S.K. (2018). ReneGENE-GI: empowering precision genomics with FPGAs on HPCs. In Proceedings of the 14th International Symposium on Applied Reconfigurable Computing (ARC).
Myers, E. (1994). A sublinear algorithm for approximate keyword searching. Algorithmica, 12, 345–374.
Smith, T.F., & Waterman, M.S. (1981). Identification of common molecular subsequences. J. Mol Bwl., 147, 195–197.
Altschul, S.F., Bundschuh, R., Olsen, R., Hwa, T. (2001). The estimation of statistical parameters for local alignment score distributions. Nucleic Acids Research, 29, 351–361.
Natarajan, S., KrishnaKumar, N., Pavan, M., Pal, D., Nandy, S.K. (2018). ReneGENE-DP: accelerated parallel dynamic programming for genome informatics. In Proceedings of 2018 International Conference on Electronics, Computing and Communication Technologies (IEEE CONECCT).
Natarajan, S., KrishnaKumar, N, Anuchan, H.V., Pal, D., Nandy, S.K. (2018). ReneGENE-novo: co-designed algorithm-architecture for accelerated preprocessing and assembly of genomic short reads. In Proceedings of the 14th International Symposium on Applied Reconfigurable Computing (ARC).
Li, H., & Homer, N. (2010). A survey of sequence alignment algorithms for next-generation sequencing. Briefings in Bioinformatics, 2, 473–483.
Hatem, A., Bozdag, D., Toland, A.E., Catalyurek, U.V. (2013). Benchmarking short sequence mapping tools. BMC Bioinformatics, 14.
Natarajan, S., KrishnaKumar, N., Pal, D., Nandy, S.K. (2016). AccuRA: accurate alignment of short reads on scalable reconfigurable accelerators. In Proc. IEEE International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation (SAMOS XVI) (pp. 79–87).
Natarajan, S., KrishnaKumar, N., Pal, D., Nandy, S.K. Accurate and accelerated secondary analysis of genomes: Implications for Genomics, NGS’17: Structural Variation and Population Genomics.
SERC, Indian Institute of Science, Bangalore. Sahasrat (Cray XC40). http://www.serc.iisc.in/facilities/cray-xc40-named-as-sahasrat.
Liu, Y., Schmidt, B., Maskell, D.L. (2012). CUSHAW: A CUDA compatible short read aligner to large genomes based on the Burrows-Wheeler transform. Bioinformatics, 28(14), 1830–1837.
Liu, Y., & Schmidt, B. (2014). CUSHAW2-GPU: Empowering Faster gapped Short-Read alignment using GPU computing. IEEE Design and Test of Computers, 31(1), 31–39.
Homer, N., Merriman, B., Nelson, S.F. (2009). BFAST: An alignment tool for large scale genome resequencing. PLoS 4.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Natarajan, S., N., K.K., Pal, D. et al. Towards Accelerated Genome Informatics on Parallel HPC Platforms: The ReneGENE-GI Perspective. J Sign Process Syst 92, 1197–1213 (2020). https://doi.org/10.1007/s11265-019-01452-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11265-019-01452-x