Keywords

1 Introduction

In computer science, a string s is defined as a finite sequence of characters from a (generally finite) alphabet \(\varSigma \). An important characteristic of a string s is its length, denoted by |s|. A string is generally used as a data type for representing and storing sequence information. Words, and even whole texts, may be stored by means of strings. Strings arise, in particular, in the field of bioinformatics, because most of the genetic instructions involved in the growth, development, functioning and reproduction of living organisms are stored in Deoxyribonucleic acid (DNA) molecules which can be represented by strings over the alphabet \(\varSigma = \{A,C,T,G\}\). A string s is called a palindrome if \(s = s^{\mathrm {rev}}\), where \(s^\mathrm {rev}\) is the reverse string of s; for example, madam is a palindrome.

Note that, given a string s, any string t that can be obtained from s by deleting zero or more characters is called a subsequence of s. Palindromic subsequences are especially interesting in the biological context. In many genetic instructions, such as for example DNA sequences, palindromic motifs are found. In the context of a research project on genome sequencing it was discovered that many of the bases on the Y-chromosome are arranged as palindromes [17]. A palindrome structure allows the Y-chromosome to repair itself by bending over at the middle if one side is damaged. Moreover, it is believed that palindromes are also frequently found in proteins [10], but their role in the protein function is less understood. Biologists believe that identifying palindromic subsequences of DNA sequences may help to understand genomic instability [7, 23]. Palindromic subsequences seem to be important for the regulation, for example, of gene activity, because they are often found close to promoters, introns and untranslated regions.

An important way for the comparison of two or more strings is to find long common subsequences. More specifically, given a set of m non-empty strings \(S=\{s_1,\ldots ,s_m\}\), a common subsequence of the strings in S is a subsequence that all strings in S have in common. Moreover, a longest common subsequence of the strings in S is a common subsequence of maximal length. The so-called Longest Common Subsequence (LCS) problem [18] is a classical optimization problem that aims at finding such a longest common subsequence of the strings in S. Apart from applications in computational biology [16], this problem appears, for example, in data compression [22] and the production of circuits in field programmable gate arrays [6]. Finally, a common palindromic subsequence of a set of strings S is a common subsequence of all strings in S which, at the same time, is a palindrome. For biologists it is not only of interest to identify the palindromic subsequences of an individual DNA string, for example, but it is also important to find longest common palindromic subsequences of multiple input strings in order to identify relationships among them.

1.1 Related Work

The LCS problem is known to be NP-hard for an arbitrary number (m) of input strings [18]. Note, however, that for any fixed m the problem is polynomially solvable by dynamic programming [11]. Standard dynamic programming approaches require \(O(n^m)\) time and space, where n is the length of the longest input string. Even though this complexity can be reduced to \(O(n^{m-1})\), see Bergoth et al. [1], dynamic programming becomes quickly impractical when m grows. Concerning simple approximate methods, the expansion algorithm in [5] and the Best-Next heuristic [9, 14] are probably the best-known techniques. A break-through both in terms of computation time and solution quality was achieved with the beam search (BS) approach described by Blum et al. [2]. Beam search is an incomplete tree search algorithm which relies on a solution construction mechanism and bounding information. More specifically, the above BS uses the construction mechanism of the Best-Next heuristic and just a simple upper bound function. Nevertheless, this algorithm was able to outperform all existing algorithms at that time. Later, Mousavi and Tabataba [20] proposed a variant of this BS with a different heuristic function and a different pruning mechanism.

The specific problem tackled in this work—that is, the longest common palindromic subsequence (LCPS) problem—has so far only been studied for the case of \(m=2\) input strings (2-LCPS). Chowdhury et al. [8] propose two different algorithms: a conventional dynamic programming with time and space complexity \(O(n^4)\), and a sparse dynamic programming algorithm with time complexity \(O(R^2 \log ^2 n\log \log n + n)\) and space complexity \(O(R^2)\), where R is the number of matching position pairs between the two input strings. Furthermore, Hasan et al. [13] solved the 2-LCPS by making use of a so-called palindromic subsequence automaton (PSA). This algorithm has a time complexity of \(O(n+R_1|\varSigma |+R_2|\varSigma |+n+R_1R_2|\varSigma |)\), where \(R_1\) and \(R_2\) denote the numbers of states of the two automata constructed for the two input strings and are bounded by \(O(n^2)\). Finally, Inenaga and Hyyrö [15] present an algorithm that runs in \(O(\sigma R^2 + n)\) time and uses \(O(R^2 + n)\) space, where \(\sigma \) denotes the number of distinct characters occurring in both of the input strings.

By reducing the general LCS problem to the LCPS problem in polynomial time, it can be shown that the LCPS problem with an arbitrary number of input strings is NP-hard. To the best of our knowledge, no algorithm has been published yet for solving this general m-LCPS problem, which is henceforth simply called LCPS problem. An instance of the LCPS problem is denoted by \((S, \varSigma )\), where S is the set of input strings over alphabet \(\varSigma \).

1.2 Organization of the Paper

The rest of the paper is organized as follows. In Sect. 2, a fast greedy heuristic is presented. Moreover, two upper bound functions are proposed. In Sect. 3 we present two search algorithms that operate on the same search tree: an exact A\(^*\) search and a heuristic beam search. We further consider a variant of A\(^*\) that has a simple diving mechanism as well as beam search embedded in order to obtain promising complete solutions already early during the search process. In this way, both A\(^*\) and beam search can be used as heuristics to approach large instances. Experimental results are described in Sect. 4. Conclusions as well as an outlook to future work are finally provided in Sect. 5.

2 A Greedy Heuristic for the LCPS Problem

We first introduce some additional notations. As already mentioned before, let \(n=\max _{s_i\in S} |s_i|\) be the maximum input string length. The j-th letter of a string s is stated by s[j], with \(j=1,\ldots ,|s|\). We further denote the concatenation of two strings by operator “\(\cdot \)”, i.e., \(s_1 \cdot s_2\) is the string obtained by appending string \(s_2\) to string \(s_1\). Notation \(s[j,j']\), \(j\le j'\), refers to the substring of s starting at the j-th position and ending at position \(j'\). The same notation refers to the empty string \(\varepsilon \) if \(j > j'\). Finally, let \(|s|_a\) be the number of occurrences of letter \(a \in \varSigma \) in string s, and let \(|s|_A = \sum _{a \in A} |s|_a\) be the total number of occurrences of letters from set \(A \subseteq \varSigma \) in string s.

Inspired by the well-known Best-Next heuristic for the LCS problem [9], we present in the following a constructive greedy heuristic for the LCPS problem. Henceforth, a string s is called a valid partial solution concerning input strings \(S=\{s_1,\ldots ,s_m\}\), if \(s\cdot s^\mathrm {rev}\) or \(s\cdot s[1,|s|-1]^\mathrm {rev}\) is a common palindromic subsequence of the strings in S. The greedy heuristic starts with an empty partial solution \(s=\varepsilon \) and extends, at each construction step, the current partial solution by appending exactly one letter (if possible). During the whole process, the algorithm makes use of pointers \(p^{\mathrm {L}}_i \le p^{\mathrm {R}}_i\) that indicate for each input string \(s_i\), \(i=1,\ldots ,m\), the still relevant substring \(s_i[p^{\mathrm {L}}_i,p^{\mathrm {R}}_i]\) from which the letter for extending s can be chosen. The choice of a letter with respect to a greedy criterion is explained below. At the start of the heuristic, i.e., when \(s=\varepsilon \), the pointers are initialized to \(p^{\mathrm {L}}_i := 1\) and \(p^{\mathrm {R}}_i := |s_i|\), referring to the first, respectively, last letter of each string \(s_i\), \(i=1,\ldots ,m\). In other words, at each iteration the set of relevant substrings denoted by \(S[p^{\mathrm {L}},p^{\mathrm {R}}] =\{s_i[p^{\mathrm {L}}_i,p^{\mathrm {R}}_i] \mid i=1,\ldots ,m\}\) forms an LCPS subproblem, and the current partial solution s is ultimately extended by appending the solution to this subproblem.

One of the questions that remain is how to determine the subset of letters from \(\varSigma \) that can be used to extend a current partial solution s. For this purpose, let \(c_a := \min _{i=1,\ldots ,m} |s_i[p^{\mathrm {L}}_i, p^{\mathrm {R}}_i]|_a\) be the minimum number of occurrences of letter \(a \in \varSigma \) in the relevant substrings with respect to s, and let \(\varSigma _{(p^{\mathrm {L}},p^{\mathrm {R}})} := \{a \in \varSigma \mid c_a \ge 1\}\) be the set of letters appearing at least once in each relevant substring. In principle, any letter from \(\varSigma _{(p^{\mathrm {L}},p^{\mathrm {R}})}\) might be used to extend s. However, there might be dominated letters in this set. In order to introduce the domination relation between two letters, we use the first and last positions at which each letter \(a \in \varSigma _{(p^{\mathrm {L}},p^{\mathrm {R}})}\) appears in each relevant substring \(s_i[p^{\mathrm {L}}_i, p^{\mathrm {R}}_i]\):

$$\begin{aligned} q^\mathrm {L}_{i,a}&:= \min \,\{j = p^{\mathrm {L}}_i,\ldots ,p^{\mathrm {R}}_i \mid s_i[j] = a\} \\ q^\mathrm {R}_{i,a}&:= \max \,\{j = p^{\mathrm {L}}_i,\ldots ,p^{\mathrm {R}}_i \mid s_i[j] = a\} \end{aligned}$$

A letter \(a \in \varSigma _{(p^{\mathrm {L}},p^{\mathrm {R}})}\) is called dominated if there exists a letter \(b \in \varSigma _{(p^{\mathrm {L}},p^{\mathrm {R}})}\), \(b \ne a\), such that \(q^\mathrm {L}_{i,b}< q^\mathrm {L}_{i,a} \wedge q^\mathrm {R}_{i,b} > q^\mathrm {R}_{i,a}\) for \(i=1,\ldots ,m\). Clearly, it is better to delay the consideration of dominated letters and select a non-dominated letter for the extension of s. Furthermore, letters \(a\in \varSigma _{(p^{\mathrm {L}},p^{\mathrm {R}})}\) with \(c_a = 1\), called singletons, should only be considered when no other letters remain in \(\varSigma _{(p^{\mathrm {L}},p^{\mathrm {R}})}\), since only one such letter can be chosen as single middle letter in the final solution. Accordingly, let the set of all non-dominated non-singleton letters from \(\varSigma _{(p^{\mathrm {L}},p^{\mathrm {R}})}\) with respect to s be denoted by \(\varSigma ^\mathrm {nd}_{(p^{\mathrm {L}},p^{\mathrm {R}})}\). Given a partial solution s, the selection of the letter to be appended to s and the adaption of the pointers work as follows:

  1. 1.

    If \(\varSigma _{(p^{\mathrm {L}},p^{\mathrm {R}})}\) is empty, the algorithm terminates with \(s \cdot s^\mathrm {rev}\) as resulting common palindromic subsequence, since no further extension is possible.

  2. 2.

    Otherwise, if \(\varSigma ^\mathrm {nd}_{(p^{\mathrm {L}},p^{\mathrm {R}})}\) is empty, only singletons remain in \(\varSigma _{(p^{\mathrm {L}},p^{\mathrm {R}})}\). The algorithm terminates with the common palindromic subsequence \(s\cdot a \cdot s^\mathrm {rev}\), where a is the first singleton from \(\varSigma _{(p^{\mathrm {L}},p^{\mathrm {R}})}\) in alphabetic order.

  3. 3.

    Otherwise, select a letter \(a\in \varSigma ^\mathrm {nd}_{(p^{\mathrm {L}},p^{\mathrm {R}})}\) that minimizes the greedy function \(g(a,p^{\mathrm {L}},p^{\mathrm {R}})\), which will be discussed in Sect. 2.1. Ties are broken randomly. Extend the current partial solution s and adapt the pointers as follows:

    $$\begin{aligned}&s := s\cdot a&\end{aligned}$$
    (1)
    $$\begin{aligned}&p^{\mathrm {L}}_i := q^\mathrm {L}_{i,a} + 1&i=1,\ldots ,m \end{aligned}$$
    (2)
    $$\begin{aligned}&p^{\mathrm {R}}_i := q^\mathrm {R}_{i,a} - 1&i=1,\ldots ,m \end{aligned}$$
    (3)

2.1 Greedy Function

The greedy function that is used to evaluate any possible extension \(a\in \varSigma ^\mathrm {nd}_{(p^{\mathrm {L}},p^{\mathrm {R}})}\) for a given partial solution extends the one used in [3] in a straight-forward manner. It calculates the sum of those fractions of the relevant substrings \(s_i[p^{\mathrm {L}}_i,p^{\mathrm {R}}_i]\) that will be discarded from further consideration when appending a as next letter to the partial solution:

$$\begin{aligned} g(a,p^{\mathrm {L}},p^{\mathrm {R}}) := \sum _{i=1}^m \frac{q^\mathrm {L}_{i,a}-p^{\mathrm {L}}_i+p^{\mathrm {R}}_i-q^\mathrm {R}_{i,a}}{p^{\mathrm {R}}_i-p^{\mathrm {L}}_i+1}. \end{aligned}$$
(4)

The major advantage of this function is its simplicity, as it can be calculated in time O(m). Obviously, this function also has some weaknesses: (1) it does not take into account that, when choosing a specific letter, as a result, more or less letters might be excluded from further consideration, even in cases in which the chosen letter has a good (low) greedy function value; (2) it does not take into account that of all singletons at most one can finally be selected. Instead of improving the above greedy function along these lines, and thereby increasing its time complexity, we consider it more promising—especially with the type of algorithm in mind that will be presented in the next section—to develop upper bound functions for estimating the length of an LCPS. As we will see, these bounds can also be used as alternative greedy functions to evaluate possible extensions of partial solutions.

2.2 Upper Bounds for the Length of an LCPS

A first upper bound for the length of any palindromic subsequences obtainable for a set of strings S can be calculated by

$$\begin{aligned} \mathrm {UB}_1(S) = \left( 2\sum _{a\in \varSigma _{(p^{\mathrm {L}},p^{\mathrm {R}})}}\left\lfloor \frac{c_a}{2}\right\rfloor \right) + \mathbbm {1}_{\exists a\in \varSigma _{(p^{\mathrm {L}},p^{\mathrm {R}})}\mid c_a \bmod 2 = 1}. \end{aligned}$$
(5)

The last term considers the fact that at most one singleton letter can be added at the end of a solution construction, with \(\mathbbm {1}\) denoting the unit step function that yields one iff the condition in the subscript is fulfilled, i.e., there exists a letter in \(\varSigma _{(p^{\mathrm {L}},p^{\mathrm {R}})}\) with an odd value of \(c_a\). Note that \(\mathrm {UB}_1(S)\) can be calculated in O(mn) time, considering the required re-calculation of the counters \(c_a\).

A second upper bound can be derived as follows. First, each relevant substring \(s_i[p^{\mathrm {L}}_i,p^{\mathrm {R}}_i]\) is reduced by deleting all letters that are not in \(\varSigma _{(p^{\mathrm {L}},p^{\mathrm {R}})}\). The resulting strings are denoted by \(s_i[p^{\mathrm {L}}_i,p^{\mathrm {R}}_i]^{\mathrm {red}}\), \(i=1,\ldots ,m\). Then, a longest palindromic subsequence, denoted by \(\mathrm {LPS}(s_i[p^{\mathrm {L}}_i,p^{\mathrm {R}}_i]^\mathrm {red})\), is calculated for each of these strings individually. As a longest common palindromic subsequence of all strings \(S[p^{\mathrm {L}},p^{\mathrm {R}}]^\mathrm {red}\) cannot be longer than any individual longest palindromic subsequence, we obtain the upper bound

$$\begin{aligned} \mathrm {UB}_2(S) = \min _{i=1,\ldots ,m}|\mathrm {LPS}(s_i[p^{\mathrm {L}}_i,p^{\mathrm {R}}_i]^\mathrm {red})|. \end{aligned}$$
(6)

A longest palindromic subsequence of a single string can be calculated by solving the LCS (longest common subsequence) problem for the problem instance that has as input strings the string itself and its reversal. This can be done by dynamic programming in \(O(n^2)\) time. Masek and Paterson [19] presented a more specific and slightly faster algorithm that runs in \(O(n^2/\log n)\) time.

Note that none of the two upper bounds dominates the other one. For example, for \(S =\{\texttt {abba}, \texttt {abab}\}\) we get \(\mathrm {UB}_1(S)\) = 4 and \(\mathrm {UB}_2(S)\) = 3, whereas for \(S= \{\texttt {aba},\texttt {bab}\}\) we get \(\mathrm {UB}_1(S)\) = 1 and \(\mathrm {UB}_2(S)\) = 3. Therefore, it might be beneficial to consider the minimum of both bounds \(\mathrm {UB}_3(S) = \min (\mathrm {UB}_1(S), \mathrm {UB}_2(S))\).

Finally, observe that our upper bound functions can also directly be applied to evaluate any partial solution s with its still relevant substrings \(S[p^{\mathrm {L}},p^{\mathrm {R}}]\): While \(\mathrm {UB}_x(S[p^{\mathrm {L}},p^{\mathrm {R}}])\), for \(x\in \{1,2,3\}\), is an upper bound for the length by which s may still be extended, \(|s|+\mathrm {UB}_x(S[p^{\mathrm {L}},p^{\mathrm {R}}])\) is an upper bound for the overall length of still achievable solutions. In the greedy heuristic, our upper bound functions can therefore be used instead of \(g(a,p^{\mathrm {L}},p^{\mathrm {R}})\) by temporarily determining the updated \(p^{\mathrm {L}}\) and \(p^{\mathrm {R}}\) when one would accept letter a and calculating \(\mathrm {UB}_x(S[p^{\mathrm {L}},p^{\mathrm {R}}])\).

3 Search Algorithms for the LCPS

In the following we first describe the state graph on which both A\(^*\) and Beam Search will operate. This state graph is a directed acyclic multi-graph \(G=(V,A)\) in which each node (state) \(v=(p^{\mathrm {L},v},p^{\mathrm {R},v})\in V\) corresponds to a unique LCPS subproblem, i.e., a set of still relevant substrings indicated by the respective pointer vectors \(p^{\mathrm {L},v}\) and \(p^{\mathrm {R},v}\). Note that one such node will in general represent multiple different partial solutions in an efficient way. As an example consider \(S=(\texttt {abccdccba}, \texttt {baccdccab})\), and partial solutions \(s = \texttt {ac}\) and \(s' = \texttt {bc}\). It holds that \(p^{\mathrm {L}}=(4,4)\) and \(p^{\mathrm {R}}=(6,6)\) in both cases, and thus, both partial solutions will be represented by a common node. Here, s and \(s'\) have the same length, but this need not be the case in general. Our state graph has the root node \(r=((1,\ldots , 1), (|s_1|,\ldots |s_m|))\) corresponding to the original set of input strings S and representing the empty partial solution. Each node \(v\in V\) stores as additional information the length \(l^v\) of a so far best—i.e., longest—partial solution represented by v. Furthermore, each node \(v\in V\) has an outgoing arc \((v,v',a)\in A\) for each valid extension of the represented partial solutions by a non-dominated non-singleton letter \(a\in \varSigma ^\mathrm {nd}_v = \varSigma ^\mathrm {nd}_{(p^{\mathrm {L},v},p^{\mathrm {R},v})}\).

We emphasize that it is not necessary to store actual partial solutions s in the nodes. As pointed out already in Sect. 2, this is neither necessary for the greedy function evaluation, nor for the upper bound calculation. For any node in the graph a corresponding solution string can finally be efficiently derived in a backward manner by iteratively identifying predecessors in which the \(l^v\)-values always decrease by one.

3.1 A\(^*\) Search for the LCPS Problem

A\(^*\) is a widely used algorithm belonging to the class of informed search methods for finding shortest or longest paths [12]. It maintains two sets of nodes: N stores all so far reached nodes, while Q, the set of open nodes, is the subset of nodes in N that have not yet been expanded, i.e., whose outgoing arcs and respective neighbors have not yet been considered. We realize node set N by means of a hash map in order to be able to efficiently find an already existing node for a state \((p^{\mathrm {L}},p^{\mathrm {R}})\), or to determine that no respective node exists yet. Moreover, Q is a priority queue in which nodes are sorted according to decreasing priority values \(\pi (v) = l^v+\mathrm {UB}_x(S[p^{\mathrm {L},v},p^{\mathrm {R},v}])\), where x specifies the used upper bound.

The pseudo-code of our A\(^*\) search is shown in Algorithm 1. It starts with the root node as unique node in N and Q. At each step, the first node v from Q—that is, the highest priority node—is chosen and removed from Q. If this node is non-extensible, it is first checked if a singleton letter can be added, and afterwards the algorithm stops. Since our priority function is admissible, cf. [12], we can be sure that an optimal solution has been reached. Otherwise, node v is extended by considering all possible extensions from \(\varSigma ^\mathrm {nd}_v\). For each obtained new state it is checked if a respective node exists already in N. If this is the case, the existing node’s length-value is updated in case the new path to this node represents a new longest partial solution. Otherwise, a corresponding new node is created and added to N and Q.

Finally, we remark that both upper bound functions presented in Sect. 2.2, i.e., \(\mathrm {UB}_1\) and \(\mathrm {UB}_2\), as well as their combination \(\mathrm {UB}_3\), are monotonic (also called consistent) because the upper bound value of an extension of a node is always at most as high as the upper bound value of the originating node. Due to this property we can be sure that no re-expansions of already expanded nodes will be necessary, see again [12].

figure a

Diving in A\(^*\). One of the main advantages of A\(^*\) is the fact that the search performs in an asymptotic optimal way with respect to the applied upper bound function, requiring the least possible number of node expansions in order to find a proven optimal solution. On the downside, good approximate solutions are typically only obtained, if at all, very late in the search. To improve this situation and turn our A\(^*\) into an anytime algorithm, which can be terminated almost arbitrarily and still yields a reasonable solution, we augment it by switching in regular intervals to a temporary greedy depth-first search until no further extensible solution is obtained. We call this extension diving. More specifically, diving is initiated at the very beginning and after each \(\delta \) regular A\(^*\) iterations, where \(\delta \) is an external parameter. Starting from the first node taken from Q, i.e., the highest priority node, we always expand as next node a newly generated immediate successor with highest priority value. This depth-first search is performed until no further newly generated successor exists (i.e., we do not further follow any already previously created nodes). If a new best solution is obtained in this way, it is stored in \(s_\mathrm {bsf}\), which is returned in case of an early termination due to an imposed time limit. Note that, when extending a node during diving, the same steps regarding the update of the nodes in N and Q are performed as in A\(^*\). An important difference is, however, that nodes expanded during diving may now require a re-expansion at a later time when a longer partial solution is found for the respective state.

3.2 Beam Search for the LCPS Problem

With Beam Search (BS), we further consider an alternative, purely heuristic way of searching the state graph defined at the beginning of this section. BS [21] is a breadth-first search algorithm that explicitly limits the nodes examined at each level, for example, with an explicit upper bound of their number \(\beta > 0\) called the beam width. Before presenting our specific BS for the LCPS problem, we define a dominance relation for nodes in the state graph considered at the same level of BS: Given nodes \(u,v\in V\) we say u dominates v iff \(u\ne v\) and \(p^{\mathrm {L},u}_i < p^{\mathrm {L},v}_i \wedge p^{\mathrm {R},u}_i \ge p^{\mathrm {R},v}_i\) for all \(i=1,\ldots ,m\).

The pseudo-code of our BS is provided in Algorithm 2. The beam B—that is, the set of nodes considered at each step of the algorithm—is initialized with the root node r. Then, at each step, the nodes of the current beam are extended in all possible ways, dominated nodes are filtered in function \(\mathrm {RemoveDominatedEntries}(V_\mathrm {{ext}})\), and the best \(\beta \) nodes with respect to their priority values are selected in function \(\mathrm {Reduce}(V_\mathrm {{ext}}, \beta )\) to obtain the beam for the next iteration.

figure b

3.3 Embedding Beam Search in A\(^*\)

Instead of the simple diving described for A\(^*\) in Sect. 3.1, we may also apply above BS embedded within A\(^*\) at regular intervals, always starting with the first entry of Q as the initial node in beam B. As in simple diving, BS skips any already earlier encountered nodes (i.e., nodes that are already in N are not added to \(V_\mathrm {{ext}}\)) in order to avoid ineffective re-considerations of parts of the state graph. Therefore, it might happen—just like in the case of simple diving—that the embedded BS ends without delivering any complete solution. Moreover, as in simple diving, for all considered extensions of nodes, the same steps regarding the update of the nodes in N and Q are performed as in A\(^*\), cf. Algorithm 1. Finally, note that with beam width \(\beta = 1\) the embedded BS corresponds to simple diving.

3.4 Tie Breaking

While executing preliminary experiments for A\(^*\), we realized that many ties occur when ordering the nodes in the priority queue Q with respect to their priorities in \(\pi (v)\). To guide the search in better ways, we decided to use the length of a represented longest partial solution as a secondary decision criterion in such cases. This improved the performance significantly, but still suffered from a significant number of ties. In order to also break these, it turned out to be beneficial to additionally consider the p-norm, which is for a node v defined as

$$\begin{aligned} ||v||_p =\left( \sum _{i=1}^m \left| p^{\mathrm {R},v}_i -p^{\mathrm {L},v}_i \right| ^p\right) ^{1/p}. \end{aligned}$$
(7)

Given two nodes \(u \not = v\) with the same priority value and the same maximum length concerning the represented partial solutions, a node with a lower p-norm is finally preferred. The inspiration for making use of this norm is that the smallest still relevant substrings potentially have a higher impact on the final length of complete solutions than the larger ones. However, considering only the shortest one of the still relevant substrings—that is, applying the \(\min \) norm—could be highly misleading. Therefore, a p value from (0, 1) appears meaningful. Following further preliminary experiments, we finally chose \(p=0.5\) for all experiments discussed in the next section.

4 Experimental Results

The proposed algorithms were implemented in C++ using GCC 4.7.3 and all experiments were performed as single threads on Intel Xeon E5649 CPUs with 2.53 GHz and a memory limit of 15 GB.

The benchmark instances used in this work were initially introduced in [4] in the context of the LCS problem and are provided at https://www.ac.tuwien.ac.at/wp/wp-content/uploads/LCPS_instances.zip. This set consists for each combination of the number of input strings \(m\in \{10,50,100,150,200\}\), the length of the input strings \(n \in \{100,500,1000\}\) and the alphabet size \(|\varSigma | \in \{4,12,20\}\) of 10 randomly generated instances, yielding a total of 450 problem instances.

4.1 Comparison of Upper Bound Functions

In order to study the differences and mutual benefits of the two upper bound functions from Sect. 2.2, BS with \(\beta = 10\) was applied both using only \(\mathrm {UB}_1\) and using \(\mathrm {UB}_3\), that is, the minimum of \(\mathrm {UB}_1\) and \(\mathrm {UB}_2\). The outcome is presented in Table 1. Each row shows average results over the 10 problem instances for each combination of m and n. The results of BS with \(\mathrm {UB}_1\) are presented in terms of the obtained average solution quality (\(\overline{|s|}\)) and the average required computation time (\(\overline{t}[s]\)) in the third and fourth table column. The corresponding results of BS with \(\mathrm {UB}_3\) are listed in the fifth and sixth table column. The best result per table row is printed bold. The “–” symbol indicates that no complete solution of length greater than zero was derived within a CPU time limit of 900 s since the bound calculation took already too much time.

Table 1. Comparison of BS with \(\mathrm {UB}_1\) to BS with \(\mathrm {UB}_3\) on the 150 problem instances with \(|\varSigma | =\) 4

The following can be observed. First, when it is not too costly to calculate \(\mathrm {UB}_2\), as it is always the case for the instances with \(n=100\) and mostly when \(n=500\), BS using \(\mathrm {UB}_3\) is able to outperform BS using only \(\mathrm {UB}_1\). However, the high time complexity for calculating \(\mathrm {UB}_2\)—that is, \(O(mn^2)\)—is a major obstacle in the context of larger problem instances. Because of these limitations, we perform all further experiments for BS, A\(^*\), and the hybrid using only \(\mathrm {UB}_1\).

Nevertheless, the additional four columns in Table 1 clearly indicate that the usage of \(\mathrm {UB}_2\) can be promising also for larger instances. These columns show the percentages of nodes for which \(\mathrm {UB}_2\) dominates \(\mathrm {UB}_1\) (\(> (\%)\)), the percentages of nodes for which \(\mathrm {UB}_1\) dominates \(\mathrm {UB}_2\) (\(< (\%)\)), the percentage of nodes where both bounds are the same (\(= (\%)\)), and the average absolute values of subtracting \(\mathrm {UB}_2\) from \(\mathrm {UB}_1\) (− (avg)). Results show that \(\mathrm {UB}_2\) dominates \(\mathrm {UB}_1\) especially for long input strings. A promising idea seems to be to either limit the time for calculating \(\mathrm {UB}_2\) or to calculate this bound only for a suitably chosen subset of all nodes. However, these studies are left for future work.

4.2 Main Results

We now compare the performance of our four solution approaches: (1) the greedy algorithm from Sect. 2, henceforth referred to as Greedy; (2) BS; (3) A\(^*\) with simple diving, henceforth referred to as A\(^*\)+Dive; and (4) A\(^*\) with embedded BS, henceforth referred to as A\(^*\)+BS.

Fig. 1.
figure 1

Average final solution lengths and runtimes of BS with different beam widths \(\beta \)

Table 2. Results for \(|\varSigma | \)=4

For deciding how to choose the beam width \(\beta \) in the stand-alone BS as well as in A\(^*\)+BS, we applied BS to each of the 450 problem instances. Average final solution lengths and runtimes are shown in Fig. 1. As expected, with increasing beam width \(\beta \) also the solution quality increases. However, this comes at the cost of an approximately linear increase of the runtime. Since the solution quality for \(\beta =400\) is only slightly better than that with \(\beta =200\), but the required times are about twice as large, we chose \(\beta =200\) for the standalone BS. For the embedded BS, we decided to use \(\beta =10\) due to the still relatively good results and small average runtime of only 0.77 s per instance.

The two variants of A\(^*\) further require a setting for \(\delta \), the number of regular A\(^*\) iterations between diving/BS. We considered 5, 10, 50, 100, 500, and 1000 iterations and conducted preliminary experiments in a similar way as for \(\beta \). Results (not shown) indicated that for \(\delta = 10\), A\(^*\) performs on average slightly but significantly better than with the other values. Therefore, we adopt this setting in our further tests for A\(^*\)+Dive and A\(^*\)+BS.

Results from the comparison of the four solution approaches are presented separately for instances of different alphabet sizes in Tables 2, 3 and 4. Again, shown values are averages over the 10 instances of the same type, and best results from each row are printed bold. Optimal solution values (as determined by A\(^*\)+Dive and/or A\(^*\)+BS) are marked with an asterisk. For each algorithm, the table shows final average solution lengths, average runtimes, and additionally, for the algorithms A\(^*\)+Dive and A\(^*\)+BS the column \(\mathrm {\overline{t}_{best}}[s]\) which shows the average computation times at which the best found solutions were obtained. A limit of 900 s was imposed per run. The following observations can be made:

Table 3. Results for \(|\varSigma | \)=12
Table 4. Results for \(|\varSigma | \)=20
  • By far the fastest algorithm is Greedy. However, Greedy also produces the weakest results in the comparison. Runtimes of the A\(^*\) variants are generally higher, but of course these partly include proofs of optimality.

  • Both A\(^*\)+Dive and A\(^*\)+BS are able to find optimal solutions for all instances with input string length \(n=100\). This corresponds to 15 out of 45 cases (table rows).

  • In most of those cases in which the A\(^*\) variants cannot find optimal solutions, A\(^*\)+BS outperforms A\(^*\)+Dive. This shows the benefit of using BS as embedded heuristic as opposed to simple diving.

  • In those cases where the A\(^*\) versions are able to find optimal solutions and prove their optimality, BS is most of the time also able to find solutions of equal quality. However, this seems to become more difficult for BS when the alphabet size decreases. In particular, BS failed to find all optimal solutions in three out of five cases with \(|\varSigma |=4\).

  • From a pure heuristic point of view, BS outperforms A\(^*\)+BS more and more when the length of the input strings increases. More specifically, while the results obtained by BS and the A\(^*\)+BS are comparable for instances with \(n=500\), BS generally outperforms A\(^*\)+BS for instances with \(n=1000\).

5 Conclusions and Future Work

We proposed different algorithms for solving the LCPS problem with an arbitrary number of strings heuristically as well as exactly. A general state graph was defined that can be searched by different strategies. With BS we provided a pure heuristic search that scales well to also large instances. With A\(^*\) we provided an efficient method for solving instances with up to 200 strings of lengths up to 100 to proven optimality. Since for instances with even larger strings, A\(^*\) search cannot find a complete solution in a reasonable time, it is upgraded to an anytime algorithm by embedding either the simple diving or the more advanced BS. For the instances where our hybrid algorithms do not find optimal solution, the optimality gaps between final (heuristic) solutions and the corresponding upper bounds produced by A\(^*\) are not so tight. The reason for this is that \(\mathrm {UB}_1\) partly provides only rather weak bounds. Using \(\mathrm {UB}_1\) in combination with \(\mathrm {UB}_2\), i.e., \(\mathrm {UB}_3\), would clearly be beneficial from the quality point-of-view, but the larger time complexity of \(\mathrm {UB}_2\) makes this approach prohibitive for larger instances. In future work the strengthening of the upper bounds seems to be most promising. We believe that this can be achieved by applying \(\mathrm {UB}_2\) only for subproblems up to a certain size or by finding an approximation of \(\mathrm {UB}_2\) that can be calculated in a faster way. Testing the algorithms with real world instances, e.g., coming from protein, DNA and virus structure sequences, would also be interesting, since such instances may have special structures on which the algorithms might perform differently or which might be further exploited.