Abstract
We propose a new algorithm for computing the longest prefix of each suffix of a given string of length n over a constant-sized alphabet of size \(\sigma \) that occurs elsewhere in the string with Hamming distance at most k. Specifically, we show that the proposed algorithm requires time \(\mathcal {O}(n (\sigma R)^k \log \log n (\log k+ \log \log n))\) on average, where \(R=\lceil (k+2) (\log _{\sigma } n+1) \rceil \), and space \(\mathcal {O}(n)\). This improves upon the state-of-the-art average-case time complexity for the case when \(k=1\) [23] by a factor of \(\log n / \log ^3 \log n\). In addition, we show how the proposed technique can be adapted and applied in order to compute the longest previous factors under the Hamming distance model within the same complexities. In terms of real-world applications, we show that our technique can be directly applied to the problem of genome mappability.
P. Charalampopoulos—Supported by the Graduate Teaching Scholarship scheme of the Department of Informatics at King’s College London and by the A.G. Leventis Foundation.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Abouelhoda, M.I., Kurtz, S., Ohlebusch, E.: Replacing suffix trees with enhanced suffix arrays. J. Discret. Algorithms 2(1), 53–86 (2004)
Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. J. Mol. Biol. 215(3), 403–410 (1990)
Alzamel, M., Charalampopoulos, P., Iliopoulos, C.S., Pissis, S.P., Radoszewski, J., Sung, W.-K.: Faster algorithms for 1-mappability of a sequence. In: COCOA. LNCS, vol. 10628, pp. 109–121. Springer International Publishing (2017). https://doi.org/10.1007/978-3-319-71147-8_8
Amir, A., Landau, G.M., Lewenstein, M., Sokol, D.: Dynamic text and static pattern matching. ACM Trans. Algorrithms 3(2), 19 (2007)
Antoniou, P., Daykin, J.W., Iliopoulos, C.S., Kourie, D., Mouchard, L., Pissis, S.P.: Mapping uniquely occurring short sequences derived from high throughput technologies to a reference genome. In: ITAB, pp. 1–4. IEEE Computer Society (2009)
Barthet, M., Plumbley, M.D., Kachkaev, A., Dykes, J., Wolff, D., Weyde, T.: Big chord data extraction and mining. In: CIM (2014)
Bender, M.A., Farach-Colton, M.: The LCA problem revisited. In: Gonnet, G.H., Viola, A. (eds.) LATIN 2000. LNCS, vol. 1776, pp. 88–94. Springer, Heidelberg (2000). https://doi.org/10.1007/10719839_9
Bufe, C.: Understandable Guide to Music Theory: The Most Useful Aspects of Theory for Rock, Jazz, and Blues Musicians. See Sharp Press, Tucson (1994)
Cole, R., Gottlieb, L.-A., Lewenstein, M.: Dictionary matching and indexing with errors and don’t cares. In: STOC 2004, pp. 91–100. ACM (2004)
Crochemore, M., Ilie, L., Iliopoulos, C.S., Kubica, M., Rytter, W., Waleń, T.: Computing the longest previous factor. Eur. J. Comb. 34(1), 15–26 (2013)
Crochemore, M., Ilie, L., Smyth, W.F.: A simple algorithm for computing the Lempel Ziv factorization. In: DCC, pp. 482–488. IEEE Computer Society (2008)
Derrien, T., Estellé, J., Sola, S.M., Knowles, D., Raineri, E., Guigó, R., Ribeca, P.: Fast computation and applications of genome mappability. PLoS ONE 7(1), e30377 (2012)
Fischer, J.: Inducing the LCP-array. In: Dehne, F., Iacono, J., Sack, J.-R. (eds.) WADS 2011. LNCS, vol. 6844, pp. 374–385. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-22300-6_32
Fischer, J., Köppl, D., Kurpicz, F.: On the benefit of merging suffix array intervals for parallel pattern matching. In: CPM 2016. LIPIcs, vol. 54, pp. 26:1–26:11. Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik (2016)
Fonseca, N.A., Rung, J., Brazma, A., Marioni, J.C.: Tools for mapping high-throughput sequencing data. Bioinformatics 28(24), 3169–3177 (2012)
Grabowski, S.: A note on the longest common substring with \(k\)-mismatches problem. Inf. Process. Lett. 115(6–8), 640–642 (2015)
Kärkkäinen, J., Kempa, D.: Faster external memory LCP array construction. In: ESA. LIPIcs, vol. 57, pp. 61:1–61:16. Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik (2016)
Karlin, S., Ghandour, G., Ost, F., Tavare, S., Korn, L.J.: New approaches for computer analysis of nucleic acid sequences. Proc. Natl. Acad. Sci. U.S.A. 80(18), 5660–5664 (1983)
Khmelev, D.V., Teahan, W.J.: A repetition based measure for verification of text collections and for text categorization. In: ACM SIGIR 2003, pp. 104–110. ACM (2003)
Kolpakov, R., Bana, G., Kucherov, G.: MREPS: efficient and flexible detection of tandem repeats in DNA. Nucleic Acids Res. 31(13), 3672–3678 (2003)
Liang, K.-H.: Bioinformatics for Biomedical Science and Clinical Applications. Woodhead Publishing Series in Biomedicine. Woodhead Publishing, Cambridge (2013)
Manber, U., Myers, E.W.: Suffix arrays: a new method for on-line string searches. SIAM J. Comput. 22(5), 935–948 (1993)
Manzini, G.: Longest common prefix with mismatches. In: Iliopoulos, C., Puglisi, S., Yilmaz, E. (eds.) SPIRE 2015. LNCS, vol. 9309, pp. 299–310. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-23826-5_29
Médigue, C., Rose, M., Viari, A., Danchin, A.: Detecting and analyzing DNA sequencing errors: toward a higher quality of the bacillus subtilis genome sequence. Genome Res. 9(11), 1116–1127 (1999)
Metzker, M.L.: Sequencing technologies - the next generation. Nat. Rev. Genet. 11(1), 31–46 (2010)
Nong, G., Zhang, S., Chan, W.H.: Linear suffix array construction by almost pure induced-sorting. In: DCC, pp. 193–202. IEEE (2009)
Smit, A.F.A.: Interspersed repeats and other mementos of transposable elements in mammalian genomes. Curr. Opin. Genet. Dev. 9(6), 657–663 (1999)
Thankachan, S.V., Apostolico, A., Aluru, S.: A provably efficient algorithm for the k-mismatch average common substring problem. J. Comput. Biol. 23(6), 472–482 (2016)
Thankachan, S.V., Chockalingam, S.P., Liu, Y., Apostolico, A., Aluru, S.: ALFRED: a practical method for alignment-free distance computation. J. Comput. Biol. 23(6), 452–460 (2016)
Weiner, P.: Linear pattern matching algorithms. In: SWAT 1973, pp. 1–11. IEEE Computer Society (1973)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG
About this paper
Cite this paper
Alamro, H., Ayad, L.A.K., Charalampopoulos, P., Iliopoulos, C.S., Pissis, S.P. (2018). Longest Common Prefixes with k-Mismatches and Applications. In: Tjoa, A., Bellatreche, L., Biffl, S., van Leeuwen, J., Wiedermann, J. (eds) SOFSEM 2018: Theory and Practice of Computer Science. SOFSEM 2018. Lecture Notes in Computer Science(), vol 10706. Edizioni della Normale, Cham. https://doi.org/10.1007/978-3-319-73117-9_45
Download citation
DOI: https://doi.org/10.1007/978-3-319-73117-9_45
Published:
Publisher Name: Edizioni della Normale, Cham
Print ISBN: 978-3-319-73116-2
Online ISBN: 978-3-319-73117-9
eBook Packages: Computer ScienceComputer Science (R0)