Skip to main content
Log in

Approximate regional sequence matching for genomic databases

  • Regular Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

Recent advances in computational biology have raised sequence matching requirements that result in new types of sequence database problems. In this work, we introduce an important class of such problems, the approximate regional sequence matching (ARSM) problem. Given a data and a pattern sequence, an ARSM result is an approximate occurrence of a region of the pattern in the data sequence under two conditions. First, the region must contain a predetermined area of the pattern sequence, termed core. Second, the allowable deviation between the region of the pattern and its occurrence in the data sequence depends on the length of the region. We propose the PS-ARSM method that processes holistically the regions of a pattern, taking advantage of their overlaps to efficiently identify the ARSM results. Its performance is evaluated with respect to existing techniques adapted to the ARSM problem.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Altschul S.F., Gish W., Miller W., Myers E.W., Lipman D.J.: Basic local alignment search tool. J. Mol. Biol. 215(3), 403–410 (1990)

    Google Scholar 

  2. Altschul S.F., Madden T.L., Schaffer A.A., Zhang J., Zhang Z., Miller W., Lipman D.J.: Gapped blast and psi-blast: a new generation of protein database search programs. Nucleic Acids Res. 25(17), 3389–3402 (1997)

    Article  Google Scholar 

  3. Baeza-Yates R.A., Navarro G.: Faster approximate string matching. Algorithmica 23(2), 127–158 (1999)

    Article  MathSciNet  MATH  Google Scholar 

  4. Baeza-Yates R.A., Navarro G.: New and faster filters for multiple approximate string matching. Random Struct. Algorithms 20(1), 23–49 (2002)

    Article  MathSciNet  MATH  Google Scholar 

  5. Baeza-Yates R.A., Perleberg C.H.: Fast and practical approximate string matching. Inf. Process. Lett. 59(1), 21–27 (1996)

    Article  MathSciNet  MATH  Google Scholar 

  6. Chang, W.I., Marr, T.G.: Approximate string matching and local similarity. In: Combinatorial Pattern Matching (CPM), Lecture Notes in Computer Science (LNCS), vol. 807, pp. 259–273. Springer, New York (1994)

  7. Doench J.G., Sharp P.A.: Specificity of microrna target selection in translational repression. Genes Dev. 18(5), 504–511 (2004)

    Article  Google Scholar 

  8. Fredriksson K., Navarro G.: Average-optimal single and multiple approximate string matching. ACM J. Exp. Algorithms 9, 1–4 (2004)

    MathSciNet  Google Scholar 

  9. Gusfield D.: Algorithms on Strings, Trees, and Sequences. Cambridge University Press, Cambridge (1999)

    Google Scholar 

  10. Hyyrö, H., Navarro, G.: Faster bit-parallel approximate string matching. In: Combinatorial Pattern Matching (CPM), Lecture Notes in Computer Science (LNCS), vol. 2373, pp. 203–224. Springer, New York (2002)

  11. Jokinen P., Tarhio J., Ukkonen E.: A comparison of approximate string matching algorithms. Softw. Pract. Exp. 26(12), 1439–1458 (1996)

    Article  Google Scholar 

  12. Kim Y.J., Boyd A., Athey B.D., Patel J.M.: miblast: scalable evaluation of a batch of nucleotide sequence queries with blast. Nucleic Acids Res. 33(13), 4335–4344 (2005)

    Article  Google Scholar 

  13. Korf I., Gish W.: Mpblast: improved blast performance with multiplexed queries. Bioinformatics 16(11), 1052–1053 (2000)

    Article  Google Scholar 

  14. Levenshtein V.: Binary codes capable of correcting spurious insertions and deletions of ones. Probl. Inf. Transm. 1, 8–17 (1965)

    Google Scholar 

  15. Levenshtein, V.: Binary codes capable of correcting deletions, insertions and reversals. Sov. Phys. Dokl. 10(8), 707–710 (1966). Original in Russian in Dokl. Akad. Nauk SSSR 163(4), 845–848 (1965)

  16. Li, Y., Terrell, A., Patel, J.M.: Wham: a high-throughput sequence alignment method. In: SIGMOD Conference, pp. 445–456 (2011)

  17. Lipman D.J., Pearson W.R.: Rapid and sensitive protein similarity searches. Science 227(4693), 1435–1441 (1985)

    Article  Google Scholar 

  18. Maragkakis M., Reczko M., Simossis V.A., Alexiou P., Papadopoulos G.L., Dalamagas T., Giannopoulos G., Goumas G., Koukis E., Kourtis K., Vergoulis T., Koziris N., Sellis T., Tsanakas P., Hatzigeorgiou A.G.: Diana-microt web server: elucidating microrna functions through target prediction. Nucleic Acids Res. 37(suppl 2), W273–W276 (2009)

    Article  Google Scholar 

  19. Meek, C., Patel, J.M., Kasetty, S.: Oasis: an online and accurate technique for local-alignment searches on biological sequences. In: VLDB, pp. 910–921 (2003)

  20. Muth, R., Mamber, U.: Approximate multiple string search. In: Combinatorial Pattern Matching (CPM), Lecture Notes in Computer Science (LNCS), vol. 1075, pp. 75–86. Springer, New York (1996)

  21. Navarro G.: A guided tour to approximate string matching. ACM Comput. Surv. (CSUR) 33(1), 31–88 (2001)

    Article  Google Scholar 

  22. Navarro G., Baeza-Yates R.A., Sutinen E., Tarhio J.: Indexing methods for approximate string matching. IEEE Data Eng. Bull. (DEBU) 24(4), 19–27 (2001)

    Google Scholar 

  23. Navarro G., Fredriksson K.: Average complexity of exact and approximate multiple string matching. Theor. Comput. Sci. 321(2–3), 283–290 (2004)

    Article  MathSciNet  MATH  Google Scholar 

  24. Needleman S.B., Wunsch C.D.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48(3), 443–453 (1970)

    Article  Google Scholar 

  25. Papapetrou P., Athitsos V., Kollios G., Gunopulos D.: Reference-based alignment in large sequence databases. PVLDB 2(1), 205–216 (2009)

    Google Scholar 

  26. Pearson W.R., Lipman D.J.: Improved tools for biological sequence comparison. Proc. Natl. Acad. Sci. USA 85(8), 2444–2448 (1988)

    Article  Google Scholar 

  27. Sankoff D., Kruskal J.: Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison. Addison-Wesley, Reading, MA (1983)

    Google Scholar 

  28. Sellers P.H.: An algorithm for the distance between two finite sequences. J. Combin. Theory Ser. A 16, 253–258 (1974)

    Article  MathSciNet  MATH  Google Scholar 

  29. Sellers P.H.: The theory and computation of evolutionary distances: pattern recognition. J. Algorithms 1(4), 359–373 (1980)

    Article  MathSciNet  MATH  Google Scholar 

  30. Smith T.F., Waterman M.S.: Identification of common molecular subsequences. J. Mol. Biol. 147(1), 195–207 (1981)

    Article  Google Scholar 

  31. Ukkonen E.: Finding approximate patterns in strings. J. Algorithms 6, 132–137 (1985)

    Article  MathSciNet  MATH  Google Scholar 

  32. Zhang Z., Schwartz S., Wagner L., Miller W.: A greedy algorithm for aligning dna sequences. J. Comput. Biol. 7(1–2), 203–214 (2000)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Thanasis Vergoulis.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Vergoulis, T., Dalamagas, T., Sacharidis, D. et al. Approximate regional sequence matching for genomic databases. The VLDB Journal 21, 779–795 (2012). https://doi.org/10.1007/s00778-012-0270-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-012-0270-1

Keywords

Navigation