Processing math: 5%
An Ultra-Fast and Parallelizable Algorithm for Finding --Mismatch Shortest Unique Substrings | IEEE Journals & Magazine | IEEE Xplore

An Ultra-Fast and Parallelizable Algorithm for Finding kk-Mismatch Shortest Unique Substrings


Abstract:

This paper revisits the k-mismatch shortest unique substring finding problem and demonstrates that a technique recently presented in the context of solving the k-mism...Show More

Abstract:

This paper revisits the k-mismatch shortest unique substring finding problem and demonstrates that a technique recently presented in the context of solving the k-mismatch average common substring problem can be adapted and combined with parts of the existing solution, resulting in a new algorithm which has expected time complexity of O(n\;\log \!^{k}\;{n}), while maintaining a practical space complexity at O(kn), where n is the string length. When k>0, which is the hard case, our new proposal significantly improves the any-case O(n^2) time complexity of the prior best method for k-mismatch shortest unique substring finding. Experimental study shows that our new algorithm is practical to implement and demonstrates significant improvements in processing time compared to the prior best solution's implementation when k is small relative to n. For example, our method processes a 200 KB sample DNA sequence with k=1 in just 0.18 seconds compared to 174.37 seconds with the prior best solution. Further, it is observed that significant portions of the adapted technique can be executed in parallel, using two different simple concurrency models, resulting in further significant practical performance improvement. As an example, when using 8 cores, the parallel implementations both achieved processing times that are less than 1/4 of the serial implementation's time cost, when processing a 10 MB sample DNA sequence with k=2. In an age where instances with thousands of gigabytes of RAM are readily available for use through Cloud infrastructure providers, it is likely that the trade-off of additional memory usage for significantly improved processing times will be desirable and needed by many users. For example, the best prior solution may spend years to finish a DNA sample of 200MB for any k>0, while this new proposal, using 24 cores, can finish processing a sample of this size with k=1 in 206.376 seconds with a peak memory usage of 46 GB, which is bot...
Published in: IEEE/ACM Transactions on Computational Biology and Bioinformatics ( Volume: 18, Issue: 1, 01 Jan.-Feb. 2021)
Page(s): 138 - 148
Date of Publication: 22 January 2020

ISSN Information:

PubMed ID: 31985439

Funding Agency:


References

References is not available for this document.