Skip to main content

An Efficient Algorithm for Finding All Pairs k-Mismatch Maximal Common Substrings

  • Conference paper
  • First Online:
Bioinformatics Research and Applications (ISBRA 2016)

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 9683))

Included in the following conference series:

Abstract

Identifying long pairwise maximal common substrings among a large set of sequences is a frequently used construct in computational biology, with applications in DNA sequence clustering and assembly. Due to errors made by sequencers, algorithms that can accommodate a small number of differences are of particular interest, but obtaining provably efficient solutions for such problems has been elusive. In this paper, we present a provably efficient algorithm with an expected run time guarantee of \(O(N\log ^k N+\mathsf {occ})\), where \(\mathsf {occ}\) is the output size, for the following problem: Given a collection \({\mathcal D}=\{S_1,S_2,\dots , S_n\}\) of n sequences of total length N, a length threshold \(\phi \) and a mismatch threshold \(k \ge 0\), report all k-mismatch maximal common substrings of length at least \(\phi \) over all pairs of sequences in \({\mathcal D}\). In addition, we present a result showing the hardness of this problem.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Aluru, S., Apostolico, A., Thankachan, S.V.: Efficient alignment free sequence comparison with bounded mismatches. In: Przytycka, T.M. (ed.) RECOMB 2015. LNCS, vol. 9029, pp. 1–12. Springer, Heidelberg (2015)

    Google Scholar 

  2. Bender, M.A., Farach-Colton, M.: The LCA problem revisited. In: Gonnet, G.H., Viola, A. (eds.) LATIN 2000. LNCS, vol. 1776, pp. 88–94. Springer, Heidelberg (2000)

    Chapter  Google Scholar 

  3. Devroye, L., Szpankowski, W., Rais, B.: A note on the height of suffix trees. SIAM J. Comput. 21(1), 48–53 (1992)

    Article  MathSciNet  MATH  Google Scholar 

  4. Farach-Colton, M., Ferragina, P., Muthukrishnan, S.: On the sorting-complexity of suffix tree construction. J. ACM 47(6), 987–1011 (2000)

    Article  MathSciNet  MATH  Google Scholar 

  5. Gog, S., Beller, T., Moffat, A., Petri, M.: From theory to practice: plug and play with succinct data structures. In: Gudmundsson, J., Katajainen, J. (eds.) SEA 2014. LNCS, vol. 8504, pp. 326–337. Springer, Heidelberg (2014)

    Google Scholar 

  6. Kärkkäinen, J., Sanders, P.: Simple linear work suffix array construction. In: Baeten, J.C.M., Lenstra, J.K., Parrow, J., Woeginger, G.J. (eds.) ICALP 2003. LNCS, vol. 2719, pp. 943–955. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  7. Kasai, T., Lee, G.H., Arimura, H., Arikawa, S., Park, K.: Linear-time longest-common-prefix computation in suffix arrays and its applications. In: Amir, A., Landau, G.M. (eds.) CPM 2001. LNCS, vol. 2089, pp. 181–192. Springer, Heidelberg (2001)

    Chapter  Google Scholar 

  8. Kim, D.-K., Sim, J.S., Park, H.-J., Park, K.: Linear-time construction of suffix arrays. In: Baeza-Yates, R., Chávez, E., Crochemore, M. (eds.) CPM 2003. LNCS, vol. 2676, pp. 186–199. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  9. Ko, P., Aluru, S.: Space efficient linear time construction of suffix arrays. In: Baeza-Yates, R., Chávez, E., Crochemore, M. (eds.) CPM 2003. LNCS, vol. 2676, pp. 200–210. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  10. Manber, U., Myers, G.: Suffix arrays: a new method for on-line string searches. SIAM J. Comput. 22(5), 935–948 (1993)

    Article  MathSciNet  MATH  Google Scholar 

  11. Edward, M.: McCreight.: a space-economical suffix tree construction algorithm. J. ACM 23(2), 262–272 (1976)

    Article  MATH  Google Scholar 

  12. Mori, Y.: Libdivsufsort: a lightweight suffix array construction library, pp. 1–12 (2003). https://github.com/y-256/libdivsufsort

  13. Peterlongo, P., Pisanti, N., Boyer, F., Lago, A.P.D., Sagot, M.-F.: Lossless filter for multiple repetitions with hamming distance. J. Discrete Algorithms 6(3), 497–509 (2008)

    Article  MathSciNet  MATH  Google Scholar 

  14. Ukkonen, E.: On-line construction of suffix trees. Algorithmica 14(3), 249–260 (1995)

    Article  MathSciNet  MATH  Google Scholar 

  15. Weiner, P.: Linear pattern matching algorithms. In: Switching and Automata Theory, pp. 1–11 (1973)

    Google Scholar 

  16. Williams, V.V.: Multiplying matrices faster than coppersmith-winograd. In: Proceedings of the 44th Symposium on Theory of Computing Conference (STOC), New York, NY, USA, pp. 887–898, 19–22 May 2012 (2012)

    Google Scholar 

  17. Yu, H.: An improved combinatorial algorithm for boolean matrix multiplication. In: Halldórsson, M.M., Iwama, K., Kobayashi, N., Speckmann, B. (eds.) ICALP 2015. LNCS, vol. 9134, pp. 1094–1105. Springer, Heidelberg (2015)

    Google Scholar 

Download references

Acknowledgment

This research is supported in part by the U.S. National Science Foundation under IIS-1416259.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Srinivas Aluru .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Thankachan, S.V., Chockalingam, S.P., Aluru, S. (2016). An Efficient Algorithm for Finding All Pairs k-Mismatch Maximal Common Substrings. In: Bourgeois, A., Skums, P., Wan, X., Zelikovsky, A. (eds) Bioinformatics Research and Applications. ISBRA 2016. Lecture Notes in Computer Science(), vol 9683. Springer, Cham. https://doi.org/10.1007/978-3-319-38782-6_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-38782-6_1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-38781-9

  • Online ISBN: 978-3-319-38782-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics