Approximate regional sequence matching for genomic databases

Vergoulis, Thanasis; Dalamagas, Theodore; Sacharidis, Dimitris; Sellis, Timos

doi:10.1007/s00778-012-0270-1

Approximate regional sequence matching for genomic databases

Regular Paper
Published: 18 March 2012

Volume 21, pages 779–795, (2012)
Cite this article

The VLDB Journal Aims and scope Submit manuscript

Thanasis Vergoulis¹,
Theodore Dalamagas²,
Dimitris Sacharidis² &
…
Timos Sellis¹

230 Accesses
2 Citations
Explore all metrics

Abstract

Recent advances in computational biology have raised sequence matching requirements that result in new types of sequence database problems. In this work, we introduce an important class of such problems, the approximate regional sequence matching (ARSM) problem. Given a data and a pattern sequence, an ARSM result is an approximate occurrence of a region of the pattern in the data sequence under two conditions. First, the region must contain a predetermined area of the pattern sequence, termed core. Second, the allowable deviation between the region of the pattern and its occurrence in the data sequence depends on the length of the region. We propose the PS-ARSM method that processes holistically the regions of a pattern, taking advantage of their overlaps to efficiently identify the ARSM results. Its performance is evaluated with respect to existing techniques adapted to the ARSM problem.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Altschul S.F., Gish W., Miller W., Myers E.W., Lipman D.J.: Basic local alignment search tool. J. Mol. Biol. 215(3), 403–410 (1990)
Google Scholar
Altschul S.F., Madden T.L., Schaffer A.A., Zhang J., Zhang Z., Miller W., Lipman D.J.: Gapped blast and psi-blast: a new generation of protein database search programs. Nucleic Acids Res. 25(17), 3389–3402 (1997)
Article Google Scholar
Baeza-Yates R.A., Navarro G.: Faster approximate string matching. Algorithmica 23(2), 127–158 (1999)
Article MathSciNet MATH Google Scholar
Baeza-Yates R.A., Navarro G.: New and faster filters for multiple approximate string matching. Random Struct. Algorithms 20(1), 23–49 (2002)
Article MathSciNet MATH Google Scholar
Baeza-Yates R.A., Perleberg C.H.: Fast and practical approximate string matching. Inf. Process. Lett. 59(1), 21–27 (1996)
Article MathSciNet MATH Google Scholar
Chang, W.I., Marr, T.G.: Approximate string matching and local similarity. In: Combinatorial Pattern Matching (CPM), Lecture Notes in Computer Science (LNCS), vol. 807, pp. 259–273. Springer, New York (1994)
Doench J.G., Sharp P.A.: Specificity of microrna target selection in translational repression. Genes Dev. 18(5), 504–511 (2004)
Article Google Scholar
Fredriksson K., Navarro G.: Average-optimal single and multiple approximate string matching. ACM J. Exp. Algorithms 9, 1–4 (2004)
MathSciNet Google Scholar
Gusfield D.: Algorithms on Strings, Trees, and Sequences. Cambridge University Press, Cambridge (1999)
Google Scholar
Hyyrö, H., Navarro, G.: Faster bit-parallel approximate string matching. In: Combinatorial Pattern Matching (CPM), Lecture Notes in Computer Science (LNCS), vol. 2373, pp. 203–224. Springer, New York (2002)
Jokinen P., Tarhio J., Ukkonen E.: A comparison of approximate string matching algorithms. Softw. Pract. Exp. 26(12), 1439–1458 (1996)
Article Google Scholar
Kim Y.J., Boyd A., Athey B.D., Patel J.M.: miblast: scalable evaluation of a batch of nucleotide sequence queries with blast. Nucleic Acids Res. 33(13), 4335–4344 (2005)
Article Google Scholar
Korf I., Gish W.: Mpblast: improved blast performance with multiplexed queries. Bioinformatics 16(11), 1052–1053 (2000)
Article Google Scholar
Levenshtein V.: Binary codes capable of correcting spurious insertions and deletions of ones. Probl. Inf. Transm. 1, 8–17 (1965)
Google Scholar
Levenshtein, V.: Binary codes capable of correcting deletions, insertions and reversals. Sov. Phys. Dokl. 10(8), 707–710 (1966). Original in Russian in Dokl. Akad. Nauk SSSR 163(4), 845–848 (1965)
Li, Y., Terrell, A., Patel, J.M.: Wham: a high-throughput sequence alignment method. In: SIGMOD Conference, pp. 445–456 (2011)
Lipman D.J., Pearson W.R.: Rapid and sensitive protein similarity searches. Science 227(4693), 1435–1441 (1985)
Article Google Scholar
Maragkakis M., Reczko M., Simossis V.A., Alexiou P., Papadopoulos G.L., Dalamagas T., Giannopoulos G., Goumas G., Koukis E., Kourtis K., Vergoulis T., Koziris N., Sellis T., Tsanakas P., Hatzigeorgiou A.G.: Diana-microt web server: elucidating microrna functions through target prediction. Nucleic Acids Res. 37(suppl 2), W273–W276 (2009)
Article Google Scholar
Meek, C., Patel, J.M., Kasetty, S.: Oasis: an online and accurate technique for local-alignment searches on biological sequences. In: VLDB, pp. 910–921 (2003)
Muth, R., Mamber, U.: Approximate multiple string search. In: Combinatorial Pattern Matching (CPM), Lecture Notes in Computer Science (LNCS), vol. 1075, pp. 75–86. Springer, New York (1996)
Navarro G.: A guided tour to approximate string matching. ACM Comput. Surv. (CSUR) 33(1), 31–88 (2001)
Article Google Scholar
Navarro G., Baeza-Yates R.A., Sutinen E., Tarhio J.: Indexing methods for approximate string matching. IEEE Data Eng. Bull. (DEBU) 24(4), 19–27 (2001)
Google Scholar
Navarro G., Fredriksson K.: Average complexity of exact and approximate multiple string matching. Theor. Comput. Sci. 321(2–3), 283–290 (2004)
Article MathSciNet MATH Google Scholar
Needleman S.B., Wunsch C.D.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48(3), 443–453 (1970)
Article Google Scholar
Papapetrou P., Athitsos V., Kollios G., Gunopulos D.: Reference-based alignment in large sequence databases. PVLDB 2(1), 205–216 (2009)
Google Scholar
Pearson W.R., Lipman D.J.: Improved tools for biological sequence comparison. Proc. Natl. Acad. Sci. USA 85(8), 2444–2448 (1988)
Article Google Scholar
Sankoff D., Kruskal J.: Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison. Addison-Wesley, Reading, MA (1983)
Google Scholar
Sellers P.H.: An algorithm for the distance between two finite sequences. J. Combin. Theory Ser. A 16, 253–258 (1974)
Article MathSciNet MATH Google Scholar
Sellers P.H.: The theory and computation of evolutionary distances: pattern recognition. J. Algorithms 1(4), 359–373 (1980)
Article MathSciNet MATH Google Scholar
Smith T.F., Waterman M.S.: Identification of common molecular subsequences. J. Mol. Biol. 147(1), 195–207 (1981)
Article Google Scholar
Ukkonen E.: Finding approximate patterns in strings. J. Algorithms 6, 132–137 (1985)
Article MathSciNet MATH Google Scholar
Zhang Z., Schwartz S., Wagner L., Miller W.: A greedy algorithm for aligning dna sequences. J. Comput. Biol. 7(1–2), 203–214 (2000)
Article Google Scholar

Download references

Author information

Authors and Affiliations

NTUA & IMIS, Athena RC, Athens, Greece
Thanasis Vergoulis & Timos Sellis
IMIS, Athena RC, Athens, Greece
Theodore Dalamagas & Dimitris Sacharidis

Authors

Thanasis Vergoulis
View author publications
You can also search for this author in PubMed Google Scholar
Theodore Dalamagas
View author publications
You can also search for this author in PubMed Google Scholar
Dimitris Sacharidis
View author publications
You can also search for this author in PubMed Google Scholar
Timos Sellis
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Thanasis Vergoulis.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Vergoulis, T., Dalamagas, T., Sacharidis, D. et al. Approximate regional sequence matching for genomic databases. The VLDB Journal 21, 779–795 (2012). https://doi.org/10.1007/s00778-012-0270-1

Download citation

Received: 05 June 2011
Revised: 01 December 2011
Accepted: 29 February 2012
Published: 18 March 2012
Issue Date: December 2012
DOI: https://doi.org/10.1007/s00778-012-0270-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Approximate regional sequence matching for genomic databases

Abstract

Access this article

Similar content being viewed by others

ADaM: augmenting existing approximate fast matching algorithms with efficient and exact range queries

String-Matching and Alignment Algorithms for Finding Motifs in NGS Data

The Sequence Reconstruction Problem

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Approximate regional sequence matching for genomic databases

Abstract

Access this article

Similar content being viewed by others

ADaM: augmenting existing approximate fast matching algorithms with efficient and exact range queries

String-Matching and Alignment Algorithms for Finding Motifs in NGS Data

The Sequence Reconstruction Problem

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation