Abstract:
The amount of genetic data generated by Next Generation Sequencing (NGS) technologies grows faster than Moore's law. This necessitates the development of efficient NGS da...Show MoreMetadata
Abstract:
The amount of genetic data generated by Next Generation Sequencing (NGS) technologies grows faster than Moore's law. This necessitates the development of efficient NGS data processing and analysis algorithms. A filter before the computationally-costly analysis step can significantly reduce the run time of the NGS data analysis. As GPUs are orders of magnitude more powerful than CPUs, this paper proposes a GPU-friendly pre-align filtering algorithm named SeedHit for the fast processing of NGS data. Inspired by BLAST, SeedHit counts seed hits between two sequences to determine their similarity. In SeedHit, a nucleic acid in a gene sequence is presented in binary format. By packaging data and generating a lookup table that fits into the L1 cache, SeedHit is GPU-friendly and high-throughput. Using three 16 s rRNA datasets from Greengenes as input SeedHit can reject 84%–89% dissimilar sequence pairs on average when the similarity is 0.9–0.99. The throughput of SeedHit achieved 1 T/s (Tera base per second) on 3080 Ti. Compared with the other two GPU-based filtering algorithms, GateKeeper and SneakySnake, SeedHit has the highest rejection rate and throughput. By incorporating SeedHit into our in-house clustering algorithm nGIA, the modified nGIA achieved a 1.6–2.1 times speedup compared to the original version.
Published in: IEEE/ACM Transactions on Computational Biology and Bioinformatics ( Volume: 21, Issue: 6, Nov.-Dec. 2024)