Locating Longest Common Subsequences with Limited Penalty

Wang, Bin; Yang, Xiaochun; Li, Jinxu

doi:10.1007/978-3-319-55699-4_12

Bin Wang¹⁸,
Xiaochun Yang¹⁸ &
Jinxu Li¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10178))

Included in the following conference series:

International Conference on Database Systems for Advanced Applications

2496 Accesses

Abstract

Locating longest common subsequences is a typical and important problem. The original version of locating longest common subsequences stretches a longer alignment between a query and a database sequence finds all alignments corresponding to the maximal length of common subsequences. However, the original version produces a lot of results, some of which are meaningless in practical applications and rise to a lot of time overhead. In this paper, we firstly define longest common subsequences with limited penalty to compute the longest common subsequences whose penalty values are not larger than a threshold \(\tau \). This helps us to find answers with good locality. We focus on the efficiency of this problem. We propose a basic approach for finding longest common subsequences with limited penalty. We further analyze features of longest common subsequences with limited penalty, and based on it we propose a filter-refine approach to reduce number of candidates. We also adopt suffix array to efficiently generate common substrings, which helps calculating the problem. Experimental results on three real data sets show the effectiveness and efficiency of our algorithms.

This work is partially supported by the NSF of China for Outstanding Young Scholars under grant No. 61322208, the NSF of China under grant Nos. 61272178 and 61572122.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Notice that, the definition of penalty score of LCS is different from edit distance even when \(\alpha =\beta =1\). The edit distance between two strings represents the minimal number of edit operations transforming from one string to another string, which does not guarantee to find an alignment with longest common subsequences as LCS does.

References

Arnold, M., Ohlebusch, E.: Linear time algorithms for generalizations of the longest common substring problem. Algorithmica 60(4), 806–818 (2011)
Article MathSciNet MATH Google Scholar
Bergroth, L., Hakonen, H., Raita, T.: A survey of longest common subsequence algorithms. In: Seventh International Symposium on String Processing and Information Retrieval, SPIRE 2000, A Coruña, Spain, pp. 39–48, 27–29 September 2000
Google Scholar
Brodal, G.S., Kaligosi, K., Katriel, I., Kutz, M.: Faster algorithms for computing longest common increasing subsequences. In: Proceedings of the 17th Annual Symposium on Combinatorial Pattern Matching, CPM 2006, Barcelona, Spain, pp. 330–341, 5–7 July 2006
Google Scholar
Hirschberg, D.S.: A linear space algorithm for computing maximal common subsequences. Commun. ACM 18(6), 341–343 (1975)
Article MathSciNet MATH Google Scholar
Kärkkäinen, J., Sanders, P.: Simple linear work suffix array construction. In: Baeten, J.C.M., Lenstra, J.K., Parrow, J., Woeginger, G.J. (eds.) ICALP 2003. LNCS, vol. 2719, pp. 943–955. Springer, Heidelberg (2003). doi:10.1007/3-540-45061-0_73
Chapter Google Scholar
Kasai, T., Lee, G., Arimura, H., Arikawa, S., Park, K.: Linear-time longest-common-prefix computation in suffix arrays and its applications. In: Proceedings of the 12th Annual Symposium on Combinatorial Pattern Matching, CPM 2001, Jerusalem, Israel, pp. 181–192, 1–4 July 2001
Google Scholar
Kim, D.K., Sim, J.S., Park, H., Park, K.: Linear-time construction of suffix arrays. In: Proceedings of the 14th Annual Symposium on Combinatorial Pattern Matching, CPM 2003, Morelia, Michocán, Mexico, pp. 186–199, 25–27 June 2003
Google Scholar
Knuth, D.E., Morris Jr., J.H., Pratt, V.R.: Fast pattern matching in strings. SIAM J. Comput. 6(2), 323–350 (1977)
Article MathSciNet MATH Google Scholar
Ko, P., Aluru, S.: Space efficient linear time construction of suffix arrays. J. Discrete Algorithms 3(2–4), 143–156 (2005)
Article MathSciNet MATH Google Scholar
Korf, I., Yandell, M., Bedell, J.A.: BLAST - An Essential Guide to the Basic Local Alignment Search Tool. O’Reilly, Sebastopol (2003)
Google Scholar
Lam, T.W., Sung, W., Tam, S., Wong, C., Yiu, S.: Compressed indexing and local alignment of DNA. Bioinformatics 24(6), 791–797 (2008)
Article Google Scholar
Levenshtein, V.I.: Binary codes capable of correcting spurious insertions and deletions of ones. Probl. Inf. Transm. 1(1), 817 (1965)
MATH Google Scholar
Masek, W.J., Paterson, M.: A faster algorithm computing string edit distances. J. Comput. Syst. Sci. 20(1), 18–31 (1980)
Article MathSciNet MATH Google Scholar
Meek, C., Patel, J.M., Kasetty, S.: OASIS: an online and accurate technique for local-alignment searches on biological sequences. In: VLDB, pp. 910–921 (2003)
Google Scholar
Myers, E.W.: An O(ND) difference algorithm and its variations. Algorithmica 1(2), 251–266 (1986)
Article MathSciNet MATH Google Scholar
Nakatsu, N., Kambayashi, Y., Yajima, S.: A longest common subsequence algorithm suitable for similar text strings. Acta Inf. 18, 171–179 (1982)
Article MATH Google Scholar
Overill, R.E.: Book review: “time warps, string edits, and macromolecules: the theory and practice of sequence comparison” by David Sankoff and Joseph Kruskal. J. Log. Comput. 11(2), 356 (2001)
Article Google Scholar
Smith, T.F., Waterman, M.S.: Identification of common molecular subsequences. J. Mol. Biol. 147(1), 195–197 (1981)
Article Google Scholar
Wagner, R.A., Fischer, M.J.: The string-to-string correction problem. J. ACM 21(1), 168–173 (1974)
Article MathSciNet MATH Google Scholar
Weiner, P.: Linear pattern matching algorithms. In: 14th Annual Symposium on Switching and Automata Theory, Iowa City, Iowa, USA, pp. 1–11, 15–17 October 1973
Google Scholar
Yang, X., Liu, H., Wang, B.: ALAE: accelerating local alignment with affine gap exactly in biosequence databases. PVLDB 5(11), 1507–1518 (2012)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science and Engineering, Northeastern University, Shenyang, 110169, Liaoning, China
Bin Wang, Xiaochun Yang & Jinxu Li

Authors

Bin Wang
View author publications
You can also search for this author in PubMed Google Scholar
Xiaochun Yang
View author publications
You can also search for this author in PubMed Google Scholar
Jinxu Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bin Wang .

Editor information

Editors and Affiliations

Arizona State University, Tempe - Phoenix, Arizona, USA
Selçuk Candan
of Science and Technology, Hong Kong University of Science and Technology, Hong Kong, China
Lei Chen
Aalborg University , Aalborg, Denmark
Torben Bach Pedersen
University of New South Wales , Sydney, New South Wales, Australia
Lijun Chang
The University of Queensland , Brisbane, Queensland, Australia
Wen Hua

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, B., Yang, X., Li, J. (2017). Locating Longest Common Subsequences with Limited Penalty. In: Candan, S., Chen, L., Pedersen, T., Chang, L., Hua, W. (eds) Database Systems for Advanced Applications. DASFAA 2017. Lecture Notes in Computer Science(), vol 10178. Springer, Cham. https://doi.org/10.1007/978-3-319-55699-4_12

Download citation

DOI: https://doi.org/10.1007/978-3-319-55699-4_12
Published: 22 March 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-55698-7
Online ISBN: 978-3-319-55699-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics