skip to main content
10.1145/3233547.3233564acmconferencesArticle/Chapter ViewAbstractPublication PagesbcbConference Proceedingsconference-collections
research-article

A Practical and Efficient Algorithm for the k-mismatch Shortest Unique Substring Finding Problem

Published: 15 August 2018 Publication History

Abstract

This paper revisits the k-mismatch shortest unique substring finding problem and demonstrates that a technique recently presented in the context of solving the k-mismatch average common substring problem can be adapted and combined with parts of the existing solution, resulting in a new algorithm which has expected time complexity of $O(nłog^k n )$, while maintaining a practical space complexity at $O(kn)$, where n is the string length. When $k>0$, which is the hard case, our new proposal significantly improves the any-case $O(n^2)$ time complexity of the prior best method for k-mismatch shortest unique substring finding. Experimental study shows that our new algorithm is practical to implement and demonstrates significant improvements in processing time compared to the prior best solution's implementation when k is small relative to n. For example, our method processes a 200KB sample DNA sequence with $k=1$ in just 0.18 seconds compared to 174.37 seconds with the prior best solution. Further, it is observed that significant portions of the adapted technique can be executed in parallel, using two different simple concurrency models, resulting in further significant practical performance improvement. As an example, when using 8 cores, the parallel implementations both achieved processing times that are less than $1/4$ that of the serial implementation, when processing a 10MB sample DNA sequence with $k=2$. In an age where instances with thousands of gigabytes of RAM are readily available for use through Cloud infrastructure providers, it is likely that the trade-off of additional memory usage for significantly improved processing times will be desirable and needed by many users. For example, the best prior solution may spend years to finish a DNA sample of 200MB for any $k>0$, while this new proposal, using 24 cores, can finish processing a sample of this size with $k=1$ in $206.376$ seconds with a peak memory usage of 46GB, which is both easily available and affordable on Cloud for many users. It is expected that this new efficient and practical algorithm for k-mismatch shortest unique substring finding will prove useful to those using the measure on long sequences in fields such as computational biology.

References

[1]
Devroye, L., Szpankowski, W., Rais, B.: A note on the height of suffix trees. SIAM Journal on Computing 21, 48--53 (1992)
[2]
Farach-Colton, M., Ferragina, P., Muthukrishnan, S.: On the sorting-complexity of suffix tree construction. Journal of the ACM 47, 987--1011 (2000)
[3]
Fischer, J., Heun, V.: Theoretical and practical improvements on the rmq-problem, with applications to lca and lce. In: Proceedings of the Annual Symposium on Combinatorial Pattern Matching (CPM). pp. 36--48 (2006)
[4]
Gog, S., Beller, T., Moffat, A., Petri, M.: From theory to practice: Plug and play with succinct data structures. In: Proceedings of the International Symposium on Experimental Algorithms. pp. 326--337 (2014)
[5]
Gusfield, D.: Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press (1997)
[6]
Haubold, B., Pierstorff, N., Möller, F., Wiehe, T.: Genome comparison without alignment using shortest unique substrings. BMC Bioinformatics 6, 123 (2005)
[7]
Hon, W.K., Thankachan, S.V., Xu, B.: In-place algorithms for exact and approximate shortest unique substring problems. Theoretical Computer Science 690, 12 -- 25 (2017)
[8]
Hu, X., Pei, J., Tao, Y.: Shortest unique queries on strings. In: Proceedings of the International Symposium on String Processing and Information Retrieval (SPIRE). pp. 161--172 (2014)
[9]
Ileri, A.M., Külekci, M.O., Xu, B.: A simple yet time-optimal and linear-space algorithm for shortest unique substring queries. Theoretical Computer Science 562, 621--633 (2015)
[10]
Mori, Y.: libdivsufsort: A lightweight suffix-sorting library. https://github.com/y-256/libdivsufsort
[11]
Pei, J., Wu, W.C.H., Yeh, M.Y.: On shortest unique substring queries. In: Proceedings of IEEE International Conference on Data Engineering (ICDE). pp. 937--948 (2013)
[12]
Thankachan, S.V., Apostolico, A., Aluru, S.: A provably efficient algorithm for the k-mismatch average common substring problem. Journal of Computational Biology 23, 472--482 (2016)
[13]
Thankachan, S.V., Chockalingam, S.P., Liu, Y., Apostolico, A., Aluru, S.: Alfred: A practical method for alignment-free distance computation. Journal of Computational Biology 23, 452--460 (2016)
[14]
Tsuruta, K., Inenaga, S., Bannai, H., Takeda, M.: Shortest unique substrings queries in optimal time. In: Proceedings of the International Conference on Current Trends in Theory and Practice of Computer Science (SOFSEM). pp. 503--513 (2014)
[15]
Weiner, P.: Linear pattern matching algorithms. In: Proceedings of the Annual Symposium on Switching and Automata Theory (SWAT). pp. 1--11 (1973)

Cited By

View all
  • (2020)Efficient Data Structures for Range Shortest Unique Substring QueriesAlgorithms10.3390/a1311027613:11(276)Online publication date: 30-Oct-2020
  • (2020)A Survey on Shortest Unique Substring QueriesAlgorithms10.3390/a1309022413:9(224)Online publication date: 6-Sep-2020
  • (2019)Range Shortest Unique Substring QueriesString Processing and Information Retrieval10.1007/978-3-030-32686-9_18(258-266)Online publication date: 3-Oct-2019

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
BCB '18: Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics
August 2018
727 pages
ISBN:9781450357944
DOI:10.1145/3233547
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 August 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. hamming distance
  2. mismatch
  3. shortest unique substring
  4. string

Qualifiers

  • Research-article

Funding Sources

  • U.S. National Science Foundation

Conference

BCB '18
Sponsor:

Acceptance Rates

BCB '18 Paper Acceptance Rate 46 of 148 submissions, 31%;
Overall Acceptance Rate 254 of 885 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)3
  • Downloads (Last 6 weeks)0
Reflects downloads up to 14 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2020)Efficient Data Structures for Range Shortest Unique Substring QueriesAlgorithms10.3390/a1311027613:11(276)Online publication date: 30-Oct-2020
  • (2020)A Survey on Shortest Unique Substring QueriesAlgorithms10.3390/a1309022413:9(224)Online publication date: 6-Sep-2020
  • (2019)Range Shortest Unique Substring QueriesString Processing and Information Retrieval10.1007/978-3-030-32686-9_18(258-266)Online publication date: 3-Oct-2019

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media