Abstract
Recent advances in computational biology have raised sequence matching requirements that result in new types of sequence database problems. In this work, we introduce an important class of such problems, the approximate regional sequence matching (ARSM) problem. Given a data and a pattern sequence, an ARSM result is an approximate occurrence of a region of the pattern in the data sequence under two conditions. First, the region must contain a predetermined area of the pattern sequence, termed core. Second, the allowable deviation between the region of the pattern and its occurrence in the data sequence depends on the length of the region. We propose the PS-ARSM method that processes holistically the regions of a pattern, taking advantage of their overlaps to efficiently identify the ARSM results. Its performance is evaluated with respect to existing techniques adapted to the ARSM problem.
Similar content being viewed by others
References
Altschul S.F., Gish W., Miller W., Myers E.W., Lipman D.J.: Basic local alignment search tool. J. Mol. Biol. 215(3), 403–410 (1990)
Altschul S.F., Madden T.L., Schaffer A.A., Zhang J., Zhang Z., Miller W., Lipman D.J.: Gapped blast and psi-blast: a new generation of protein database search programs. Nucleic Acids Res. 25(17), 3389–3402 (1997)
Baeza-Yates R.A., Navarro G.: Faster approximate string matching. Algorithmica 23(2), 127–158 (1999)
Baeza-Yates R.A., Navarro G.: New and faster filters for multiple approximate string matching. Random Struct. Algorithms 20(1), 23–49 (2002)
Baeza-Yates R.A., Perleberg C.H.: Fast and practical approximate string matching. Inf. Process. Lett. 59(1), 21–27 (1996)
Chang, W.I., Marr, T.G.: Approximate string matching and local similarity. In: Combinatorial Pattern Matching (CPM), Lecture Notes in Computer Science (LNCS), vol. 807, pp. 259–273. Springer, New York (1994)
Doench J.G., Sharp P.A.: Specificity of microrna target selection in translational repression. Genes Dev. 18(5), 504–511 (2004)
Fredriksson K., Navarro G.: Average-optimal single and multiple approximate string matching. ACM J. Exp. Algorithms 9, 1–4 (2004)
Gusfield D.: Algorithms on Strings, Trees, and Sequences. Cambridge University Press, Cambridge (1999)
Hyyrö, H., Navarro, G.: Faster bit-parallel approximate string matching. In: Combinatorial Pattern Matching (CPM), Lecture Notes in Computer Science (LNCS), vol. 2373, pp. 203–224. Springer, New York (2002)
Jokinen P., Tarhio J., Ukkonen E.: A comparison of approximate string matching algorithms. Softw. Pract. Exp. 26(12), 1439–1458 (1996)
Kim Y.J., Boyd A., Athey B.D., Patel J.M.: miblast: scalable evaluation of a batch of nucleotide sequence queries with blast. Nucleic Acids Res. 33(13), 4335–4344 (2005)
Korf I., Gish W.: Mpblast: improved blast performance with multiplexed queries. Bioinformatics 16(11), 1052–1053 (2000)
Levenshtein V.: Binary codes capable of correcting spurious insertions and deletions of ones. Probl. Inf. Transm. 1, 8–17 (1965)
Levenshtein, V.: Binary codes capable of correcting deletions, insertions and reversals. Sov. Phys. Dokl. 10(8), 707–710 (1966). Original in Russian in Dokl. Akad. Nauk SSSR 163(4), 845–848 (1965)
Li, Y., Terrell, A., Patel, J.M.: Wham: a high-throughput sequence alignment method. In: SIGMOD Conference, pp. 445–456 (2011)
Lipman D.J., Pearson W.R.: Rapid and sensitive protein similarity searches. Science 227(4693), 1435–1441 (1985)
Maragkakis M., Reczko M., Simossis V.A., Alexiou P., Papadopoulos G.L., Dalamagas T., Giannopoulos G., Goumas G., Koukis E., Kourtis K., Vergoulis T., Koziris N., Sellis T., Tsanakas P., Hatzigeorgiou A.G.: Diana-microt web server: elucidating microrna functions through target prediction. Nucleic Acids Res. 37(suppl 2), W273–W276 (2009)
Meek, C., Patel, J.M., Kasetty, S.: Oasis: an online and accurate technique for local-alignment searches on biological sequences. In: VLDB, pp. 910–921 (2003)
Muth, R., Mamber, U.: Approximate multiple string search. In: Combinatorial Pattern Matching (CPM), Lecture Notes in Computer Science (LNCS), vol. 1075, pp. 75–86. Springer, New York (1996)
Navarro G.: A guided tour to approximate string matching. ACM Comput. Surv. (CSUR) 33(1), 31–88 (2001)
Navarro G., Baeza-Yates R.A., Sutinen E., Tarhio J.: Indexing methods for approximate string matching. IEEE Data Eng. Bull. (DEBU) 24(4), 19–27 (2001)
Navarro G., Fredriksson K.: Average complexity of exact and approximate multiple string matching. Theor. Comput. Sci. 321(2–3), 283–290 (2004)
Needleman S.B., Wunsch C.D.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48(3), 443–453 (1970)
Papapetrou P., Athitsos V., Kollios G., Gunopulos D.: Reference-based alignment in large sequence databases. PVLDB 2(1), 205–216 (2009)
Pearson W.R., Lipman D.J.: Improved tools for biological sequence comparison. Proc. Natl. Acad. Sci. USA 85(8), 2444–2448 (1988)
Sankoff D., Kruskal J.: Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison. Addison-Wesley, Reading, MA (1983)
Sellers P.H.: An algorithm for the distance between two finite sequences. J. Combin. Theory Ser. A 16, 253–258 (1974)
Sellers P.H.: The theory and computation of evolutionary distances: pattern recognition. J. Algorithms 1(4), 359–373 (1980)
Smith T.F., Waterman M.S.: Identification of common molecular subsequences. J. Mol. Biol. 147(1), 195–207 (1981)
Ukkonen E.: Finding approximate patterns in strings. J. Algorithms 6, 132–137 (1985)
Zhang Z., Schwartz S., Wagner L., Miller W.: A greedy algorithm for aligning dna sequences. J. Comput. Biol. 7(1–2), 203–214 (2000)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Vergoulis, T., Dalamagas, T., Sacharidis, D. et al. Approximate regional sequence matching for genomic databases. The VLDB Journal 21, 779–795 (2012). https://doi.org/10.1007/s00778-012-0270-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00778-012-0270-1