Abstract
Single Molecule Real-Time (SMRT) sequencing is one of the popular issues in third-generation sequencing technology. Compared with next-generation sequencing technology, SMRT can detect single molecules and has much longer read lengths, which also leads to a huge increase in the amount of data. As the performance of a single CPU has reached its bottleneck, single-node computing is far from meeting the SMRT sequencing requirements. An alternative solution is parallel computing. It makes the alignment algorithm run on multiple computing nodes, thus greatly decreases the running time. The Regional Hashing-based Alignment Tool (rHAT) is a novel approach developed especially for SMRT sequencing. It has better sensitivity, improved correctness compared with existing sequence alignment tools. However, the original rHAT source can only run on a single node, which dramatically limits its performance. In this article, we developed PrHAT, a parallel sequence alignment version of rHAT. We test PrHAT on simulated and real datasets which the original rHAT used. Our results show that PrHAT reduces the computing wall-time from nearly an hour to several minutes. In the process of increasing the number of nodes from 2 to 16 on aligning large-scale datasets, PrHAT achieves speedups of 1.94–14.87x. The parallel efficiency decreases from 97% to 93%; moreover, its weak scaling remains almost unchanged. Based on PrHAT, we developed OpenPrHAT. It has a similar performance towards PrHAT, but can run on other computing devices like GPU in the platform. We expect that the implementation of PrHAT will promote the development of SMRT in third-generation sequencing technology.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Sanger, F., Coulson, A.R., Barrell, B., Smith, A., Roe, B.: Cloning in single-stranded bacteriophage as an aid to rapid DNA sequencing. J. Mol. Biol. 143(2), 161–178 (1980)
Roberts, R.J., Carneiro, M.O., Schatz, M.C.: The advantages of SMRT sequencing. Genome Biol. 14(6), 405 (2013)
Mary, Q., Yang, B., Athey, H., Arabnia, A.: High-throughput next-generation sequencing technologies foster new cutting-edge computing techniques in bioinformatics. BMC Genomics 10(Suppl 1), 11 (2009)
Korlach, J., Bjornson, K.P., Chaudhuri, B.P., Cicero, R.L., Turner, S.W.: Real-time DNA sequencing from single polymerase molecules. Methods Enzymol. 472, 431–455 (2010)
Carneiro, M.O., Russ, C., Ross, M.G., Gabriel, S.B., Nusbaum, C., Depristo, M.A.: Pacific biosciences sequencing technology for genotyping and variation discovery in human data. BMC Genomics 13, 375 (2012)
Bo, L., Dengfeng, G., Mingxiang, T., Yadong, W.: rHAT: fast alignment of noisy long reads with regional hashing. Bioinformatics 32(11), 1625–1631 (2015)
Li, H., Durbin, R.: Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics 26(5), 589–595 (2010)
Chaisson, M.J., Tesler, G.: Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory. BMC Bioinf. 13(1), 238 (2012)
Li, H.: Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM (2013). arXiv preprint arXiv:13033997
Peter, J.A.C., Christopher, J.: The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variant. Nuclelic Acids Res. 38(6), 1767–1771 (2010)
Peters, D., Luo, X., Qiu, K., Liang, P.: Speeding up large-scale next generation sequencing data analysis with pBWA. J Biocomput 1(2), 1–6 (2012)
Brawer, S.: Preface - an introduction to parallel programming. Introduction Parallel Program. 5(4), 361–370 (2011)
Smith, T.F., Waterman, M.S.: Identification of common molecular subsequences. J. Mol. Biol. 147(1), 195–197 (1981)
Gotoh, O.: An improved algorithm for matching biological sequences. J. Mol. Biol. 162(3), 705–708 (1982)
Kucherov, G.: Evolution of biosequence search algorithms: a brief survey. Bioinformatics 35(19), 3547–3552 (2019)
Xing, Y., Wu, C., Yang, X., Wang, W., Yin, J.: ParaBTM: a parallel processing framework for biomedical text mining on supercomputers. Molecules 23(5), 1028 (2018)
Patterson, D.A., Hennessy, J.L., Goldberg, D.: Computer Architecture: A Quantitative Approach, vol. 2. Morgan Kaufmann, San Mateo, CA (1990)
Bondi, A.B.: Characteristics of scalability and their impact on performance. In: Proceedings of the 2nd International Workshop on Software and Performance, pp. 195–203 (2000)
Acknowledgments
This work was supported by National Key R&D Program of China 2020YFA0709803, 2018YFB0204301 and NSFC Grants 62102427. The funding bodies did not influence the design of the study, data collection, analysis, or interpretation, or writing of the manuscript.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendix
Appendix
All source codes of PrHAT can be found on:
https://drive.google.com/drive/folders/1OLjYANWXHz6b22sfdf7Mqv6vm1zilB69?usp=sharing.
Rights and permissions
Copyright information
© 2022 Springer Nature Switzerland AG
About this paper
Cite this paper
Xia, Z. et al. (2022). Large-Scale Parallel Alignment Algorithm for SMRT Reads. In: Lai, Y., Wang, T., Jiang, M., Xu, G., Liang, W., Castiglione, A. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2021. Lecture Notes in Computer Science(), vol 13156. Springer, Cham. https://doi.org/10.1007/978-3-030-95388-1_14
Download citation
DOI: https://doi.org/10.1007/978-3-030-95388-1_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-95387-4
Online ISBN: 978-3-030-95388-1
eBook Packages: Computer ScienceComputer Science (R0)