Research article
Deposition and extension approach to find longest common subsequence for thousands of long sequences

https://doi.org/10.1016/j.compbiolchem.2010.05.001Get rights and content

Abstract

The problem of finding the longest common subsequence (LCS) for an arbitrary number of sequences is a very interesting and challenging problem in computer science. This problem is NP-complete, but because of its importance, many heuristic algorithms have been proposed, such as Long Run, Expansion Algorithm and THSB. However, the performance, either in result quality or in process time, of many current heuristic algorithms deteriorates fast when the number of sequences and sequence length increase. In this paper, we have proposed a post-process heuristic algorithm for the LCS problem, the Deposition and Extension Algorithm (DEA). This algorithm first generates common subsequence by “sequence deposition” based on fine tuning of search range, and then extends this common subsequence. The algorithm is proven to generate Common Subsequences (CSs) with guaranteed lengths. The experiments on different dataset showed that the results of DEA algorithm were better than those of Long Run and Expansion Algorithm, especially on many long sequences. The algorithm also had superior efficiency both in time and memory space.

Introduction

The problem of finding the longest common subsequences (LCSs) has many applications in different areas of computer science, such as data compression, pattern recognition, file comparison and biological sequence comparisons and analysis (Cormen et al., 2001, Gusfield, 1997). The LCS of a set of sequences can be formulated as this. For two sequences S = s1sm and T = t1tn, S is the subsequence of T (and T is the supersequence of S) if for some 1  i1 < … < im  n, sj=tij. Given a finite set of sequences SS = {S1, S2, … , Sk}, a common subsequence (CS) of S is the sequence S such that each sequence in SS is a supersequence of S, and a LCS of SS is the longest possible S for this set of sequences SS.

The LCS problem has been examined extensively by many researchers (refer to Paterson and Dancik, 1994 for details). The LCS of two sequences each with an arbitrary length N can be computed by dynamic programming in O(N2) time and O(N2) space, and there are a number of papers on the LCS problem for two sequences (with arbitrary lengths) using dynamic programming with reduced time and space (Cormen et al., 2001, Masek and Paterson, 1980). Unfortunately, the LCS problem on arbitrary K sequences is a well-known NP-hard problem that is even hard to approximate in the worst case (Jiang and Li, 1995).

Though this problem is NP-hard, it is so important in application that many heuristic algorithms have been proposed to solve the LCS problem (Paterson and Dancik, 1994, Jiang and Li, 1995, Bonizzoni et al., 2001). In the following part, we will use N as the length of the sequence, K as the number of sequence, and Σ as the alphabet. Note that setting all sequence in one set with the same length N is just for simplicity, which would not affect the final conclusion. For the sequences S over alphabet Σ = {σ1, σ2, … σ|Σ|}, the alphabet content ri of character σi is defined as the number of σi in all of the sequences, over the total lengths of all sequences.

There exists a dynamic programming algorithm to compute the LCS of a set S of K sequences (Sankoff and Kruskal, 1983), but it requires O(NK) time and space. Hence this algorithm is feasible only for small values of N and K (Hakata and Imai, 1992, Hsu and Du, 1984). An improvement of such algorithm based on the Four Russians’ technique (Arlazarov et al., 1970) is possible and the time complexity would be O(NK/log N), but the hidden constants would not make such an algorithm more appealing than the one in (Sankoff and Kruskal, 1983) for arbitrary sequences datasets.

The simple and fast Long Run Algorithm is proposed by Jiang and Li (1995). For K sequences in S = {s1, s2, …, sK} on the finite alphabet set Σ = {σ1, σ2, …, σ|Σ|}. Long Run Algorithm finds maximum m such that there exists an σi in Σ so that σim, which represent m consecutive character σi, is a common subsequence of all input sequences. It outputs σim as the result of LCS. The time complexity of the Long Run Algorithm is O(KN).

The Expansion Algorithm is proposed by Bonizzoni et al. (2001). The strategy of this algorithm is to reduce all sequences in set S to streams first, where a stream is the sequence without consecutive identical characters. For example, we can reduce the sequence s1 = aaagcctt to agct. Then it finds all short common streams for sequences, whose lengths are not more than 2. Among them, a longest common stream z of all sequences in S is chosen and all substrings of z are expanded to find a common subsequence of S with maximal length. For example, the sequence aag can be expanded from ag. The Long Run Algorithm is also embedded in the Expansion Algorithm by finding all short common streams and expanding them. The time complexity of the Expansion Algorithm is O(KN4 log N).

The Best Next heuristic algorithm (Huang et al., 2004) is a typical example of fast LCS heuristic algorithm for two sequences (Huang et al., 2004, Farser, 1995, Johtela et al., 1996), and it can be easily extended to a heuristic program for K sequences. The Best Next algorithm generates common subsequence from left to right, adds one character into the common subsequence in each step, and terminates the whole process when no more character could be added to common subsequence. The Best Next algorithm is also faster than Expansion Algorithm. In terms of the quality of heuristic LCS results, the THSB (Time Horizon Specialized Branching) algorithm by Easton and Singireddy (2008), which is based on branch-and-bound technique, has been shown to outperform Expansion and Best Next algorithms.

However, the performance (result quality and process time) of many of these algorithms deteriorates fast on large LCS instances. By large LCS instances, we mean instances where (a) the sequences in S are long (N is 100 and more) and (b) there are many sequences (K is 100 or more). And other instances are small LCS instances. Solving the LCS problem heuristically on large LCS instances is an upcoming important challenge in computer science. For example, many of current works on high-throughput sequencing projects are based on many long sequences (Kemena and Notredame, 2009), and the LCS of these sequences are also of interest.

We have developed an effective heuristic algorithm, the Deposition and Extension Algorithm (DEA), which is suitable for large LCS instances. The key component of this algorithm is the fine tuning of search range for deposition process. To assess the performance of DEA and other algorithms, a standard for performance analysis is also proposed. We have proven the guaranteed performance of the DEA algorithm, analyzed its results empirically, and also empirically showed the superiority of the algorithm to other heuristic algorithms, especially on many long sequences.

Section snippets

Materials and methods

The Deposition and Extension Algorithm (DEA) is written in Perl. The experiments are performed on a PC with 3 GHz CPU and 1 GB memory, running a Linux system.

We have selected Long Run (LR) (Jiang and Li, 1995), Expansion Algorithm (EA) (Bonizzoni et al., 2001) and THSB (Easton and Singireddy, 2008) for comparison with our algorithm. Both “Most Front” (MF) method and “Min Change” (MC) methods (details in Section 2.2) are used for performance analysis. For the Expansion Algorithm, if not specified,

Analysis on different aspects of the DEA algorithm

We have first analyzed the performance of deposition method and the extension method for DEA algorithm on simulated datasets. The template pool was used for the Extension method. The results are shown in Table 2. In each cell in the column for the results of specific method, the “average value (standard deviation)” were given for each setting based on 10 sets of sequences with same K and N.

It could be observed from Table 2 that the MC deposition method could result in longer LCS templates than

Conclusions

The LCS problem is very important that there is constant need for an efficient and accurate algorithm for this problem. In this paper, we have proposed a post-process approach, the Deposition and Extension Algorithm (DEA) on the LCS problem based on the analysis of the properties of multiple sequences. And we have focused specifically on large LCS instances.

The DEA algorithm first generates good template by MF or MC deposition method based on fine tuning of search range, and then generate a

Acknowledgements

We would like to thank Giancarlo Mauri, Gianluca Della Vedova and Todd Easton for providing their source codes. We also thank anonymous reviewers for insightful comments on this paper.

References (20)

  • C. Blum et al.

    Beam search for the longest common subsequence problem

    Computers & Operations Research

    (2009)
  • P. Bonizzoni et al.

    Experimenting an approximation algorithm for the LCS

    Discrete Applied Mathematics

    (2001)
  • W. Hsu et al.

    New algorithms for the LCS problem

    Journal of Computer and System Sciences

    (1984)
  • W. Masek et al.

    A faster algorithm computing string edit distances

    Journal of Computer and System Sciences

    (1980)
  • V. Arlazarov et al.

    On the economic construction of the transitive closure of a direct graph

    Soviet. Math. Dokl.

    (1970)
  • C. Blum et al.

    Probabilistic beam search for the longest common subsequence problem. Engineering stochastic local search algorithms. Designing, implementing and analyzing effective heuristics

    Lecture Notes in Computer Science

    (2007)
  • F. Chin et al.

    Performance analysis of some simple heuristics for computing longest common subsequences

    Algorithmica

    (1994)
  • T.H. Cormen et al.

    Introduction to Algorithms

    (2001)
  • T. Easton et al.

    A large neighborhood search heuristic for the longest common subsequence problem

    Journal of Heuristics

    (2008)
  • Farser, C.B., 1995. Subsequences and Supersequences of Strings. University of Glasgow, Computing Science Department...
There are more references available in the full text version of this article.

Cited by (5)

  • A hyper-heuristic for the Longest Common Subsequence problem

    2012, Computational Biology and Chemistry
    Citation Excerpt :

    The second algorithm, called MLCS-APP, is based on A* but with a limited number of leaves in the corresponding search tree (Wang et al., 2010b). The last one, called Deposition&Extention (DEA), which is also the most recent heuristic algorithm for the problem to the best of our knowledge, is based on a post-process technique (Ning, 2010). Neither IBS-LCS nor MLCS-APP, as reported in Mousavi and Tabataba (2012) and Wang et al. (2010b), could outperform Blum et al.’s BS over all the benchmarks used in Blum et al. (2009), although they did so for some of the cases.

  • New Construction of Family of MLCS Algorithms

    2021, Journal of Healthcare Engineering
  • Computing a longest common subsequence for multiple sequences

    2016, 2nd International Conference on Electrical Information and Communication Technologies, EICT 2015
View full text