Efficient CGM-based parallel algorithms for the longest common subsequence problem with multiple substring-exclusion constraints

doi:10.1016/j.parco.2019.102598

Parallel Computing

Volume 91, March 2020, 102598

https://doi.org/10.1016/j.parco.2019.102598 Get rights and content

Highlights

•
Detail of the Wang et al.’s sequential dynamic programming algotihm solving the M-STR-EC-LCS problem.
•
Describing of a multi-level Direct Acyclic Graph (task graph) for produce a bottom-up approach following the Wang et al.’s recursive formula for solve a M-STR-EC-LCS problem.
•
Describing two BSP/CGM parallel algorithms based on our task graph.
•
Experimental study.
•
Comparison between theorical and computational results.

Abstract

A variant of the Longest Common Subsequence (LCS) problem is the LCS problem with multiple substring-exclusion constraints (M-STR-EC-LCS), which has great importance in many fields especially in bioinformatics. This problem consists to compute the LCS of two strings X and Y of length n and m respectively that excluded a set of d constraints $P = {P_{1}, P_{2}, \dots, P_{d}}$ of total length r. Recently, Wang et al. proposed a sequential solution based on the dynamic programming technique that requires $O (n m r)$ execution time and space. To the best of our knowledge, there is no parallel solutions for this problem. This paper describes new efficient parallel algorithms on Coarse Grained Multicomputer model (CGM) to solve this problem. Firstly, we propose a multi-level Direct Acyclic Graph (DAG) that determines the correct evaluation order of sub-problems in order to avoid redundancy due to overlap. Secondly, we propose two CGM parallel algorithms based on our DAG. The first algorithm is based on a regular partitioning of the DAG and requires $O (\frac{n m r}{p})$ execution time with $O (p)$ communication rounds where p is the number of processors used. Its main drawback is high idleness time of processors because due to the dependencies between the nodes in the DAG, over time it has many idle processors. The second algorithm uses an irregular partitioning of the DAG that minimizes this idleness time by allowing the processors to stay active as long as possible. It requires $O (\frac{n m r}{p})$ execution time with $O (k p)$ communication rounds. k is a constant integer allowing to setup the irregular partitioning. The both algorithms require $O (\frac{r | Σ |}{p})$ preprocessing time where |Σ| is the length of the alphabet. The experimental results performed show a good agreement with theoretical predictions.

Introduction

Finding the longest common subsequence (LCS) of two sequences is a well-known measurement for computing the similarity of two strings, and it is crucial in various applications. It is therefore a well-studied problem in computer science, and can be widely applied in various fields such as text and music information retrieval, file comparison, pattern matching, spelling correction and computational biology [1], [2], [3], [4]. Formally given two strings X and Y, the LCS problem consists to find another string Z which is common to X and Y and which is of maximal size. To solve this problem, Wagner and Fischer [5] firstly proposed a quadratic time and space solution based on the dynamic programing technique which find the LCS by computing the edit distance between the sequences. To ameliorate this solution, many advanced solutions were proposed in the past decade [6], [7], [8], [9].

Many variants of the LCS problem have been proposed for particular applications. For example, the mosaic LCS problem, the merged LCS problem, the cyclic chain correction problem and the block editing problem. The most recent variant which have receive most attention is the Constrained-LCS (CLCS) problem and was first addressed by Tsai [10]. More recently, Chen and Chao [11] proposed a more generalized form of the CLCS problem (GC-LCS) which is formally define as follow: given two input sequences, $X = x_{1} x_{2} \dots x_{n}$ and $Y = y_{1} y_{2} \dots y_{m},$ of length n and m respectively, and a constraint string $P = p_{1} p_{2} \dots p_{r},$ of length r, r ≤ min (n, m) the GC-LCS problem is a set of four problems that find the LCS of X and Y which includes/excludes P as a subsequence/substring. This set of problems can be more generalized by considering many constraints. Therefore, instead of using a single constraint P of length r, they generalize it to a set of d constraints ${P_{1}, P_{2}, \dots, P_{d}}$ of total length r. This generalization is shown in Table 1 [11].

In this work, we tackle the problem of parallelizing the Wang et al. sequential algorithm [12] for M-STR-EC-LCS problem on the BSP/CGM model (Bulk Synchronous Parallel/Coarse Grained Multicomputer) [13], [14], [15], [16], [17]. This model seems to be the best for the design of algorithms that are not too dependent on a particular architecture. A BSP/CGM machine is a set of p processors, each having its own local memory of size s (with $O (s) ≫ O (1)$ ) and connected to each other through a router able to deliver messages in point-to-point manner. Each BSP/CGM parallel algorithm is an alternation of local computations and global communication rounds. Each communication round consists in routing a single h-relation with $h = O (s)$ . Each CGM computation or communication round corresponds to a BSP super-step having a communication cost g × s [17]. Here, g is the cost of a communication of a word in the BSP model. In order to produce an efficient BSP/CGM parallel algorithm, the effort of the designers must be to maximize speed-up and minimize the number of communication rounds (ideally, it must be independent from the problem size, and constant in the optimum).

Many CGM-based parallel algorithms modelled by dynamic programming for some problem including the Minimum Cost Parenthesizing problem and the Optimal Binary Search Tree problem have been proposed [18], [19]. For the LCS problem, Garcia et al. [20] proposed a CGM-based solution which require $O (\frac{n^{2}}{p})$ computation time and p communication rounds (n is the length of the two sequences and p is the number of processors). Alves et al. [21] proposed for all-substrings LCS problem, a solution which runs in $O (\frac{n m}{p})$ time steps and $O (\log p)$ communication rounds on p processors (n and m are lengths of the original strings).

Myoupo et al. [22] proposed the first CGM-based solutions for the string-excluding constrained LCS problem which require respectively $O (\frac{n m r}{\sqrt{p}})$ time steps and $O (\sqrt{2 p})$ communication rounds, and $O (\frac{n m r}{p})$ time steps with $O (p)$ communication rounds (n and m are the length of the two sequences, and r is length of the constraint string). However, there is no parallel solution for the M-STR-EC-LCS problem to the best of our knowledge.

Recently, Chen and Chao [11] proposed for the M-STR-IC-LCS and M-STR-EC-LCS problems an exponential solution and therefore declare without proof that they are NP-Hard. Furthermore, Wang et al. [12] proposed a simple dynamic programming algorithm (by defining a recursive formula) for the M-STR-EC-LCS problem that requires $O (n m r)$ execution time and space, where n and m are the lengths of the two given input strings, and r is the total length of the d constraints. This solution uses a state machine also called finite automata defined by Aho and Corasick [23] which speedup the exact string matching of multiple patterns.

Our contribution is to propose two BSP/CGM parallel algorithms based on the Wang et al.’s sequential algorithm [12] to solve the M-STR-EC-LCS problem. Firstly, we propose a multi-level Direct Acyclic Graph (DAG) called task graph that determines the correct evaluation order of sub-problems, in order to avoid redundancy due to overlap and produce a bottom-up approach following the Wang et al.’s recursive formula that will solve the entire problem. Secondly, based on our task graph, we propose two BSP/CGM parallel algorithms. It can be summarized as follows:

•
In the first algorithm, we subdivide this task graph into sub-graphs (or blocks) of same size, and distribute them fairly onto the processors. It requires $O (\frac{n m r}{p})$ execution time with $O (p)$ communication rounds and $O (\frac{r | Σ |}{p})$ preprocessing time. |Σ| is the length of the alphabet and p is the number of processors used. This strategy promotes load-balancing between the processors, but its main drawback is high idleness time of processors. Indeed, due to the dependencies between the nodes in the DAG, over time several processors cannot be active at the same time;
•
The second algorithm uses an irregular partitioning, based on the fragmentation of the blocks of the task graph, to reduce the idleness time of the processors. It requires $O (\frac{n m r}{p})$ execution time with $O (k p)$ communication rounds. k is a constant integer defining the number of fragmentation to perform.

Finally, we implement our algorithms and we conclude that the experimental results performed show a good agreement with theoretical predictions.

The organization of this paper is as follow: in Section 2, we present the principle of using finite automaton to detect an occurrence of a substring in a sequence and describe its usage in the Wang et al.’s sequential algorithm. We describe our multi-level DAG model in Section 3. Our BSP/CGM algorithms are presented in Section 4 and Section 5. Section 6 presents experimental results. We end this paper with some concluding remarks and future research directions.

Section snippets

Basic definitions

Definition 1

Let Σ be any finite set of σ symbols called an alphabet. A sequence is an ordered list of symbols over Σ. Σ* is the set of all strings over Σ, including the empty one ε. $Σ^{+}$ denotes $Σ^{*} - {ε}$ . For any string X, |X| denotes the length of X. Note that $| ε | = 0$ . Let X be a sequence of Σ. A subsequence of a sequence X, is obtained by deleting zero or more symbols (not necessarily contiguous) from X. A substring of a sequence X, is a subsequence of successive symbols within X.

Definition 2

For a given sequence $X = x_{1} x_{2} \dots x_{n}$

New graph model: the multi-level DAG

In this section, we are going to subdivide the M-STR-EC-LCS problem into sub-problems, study their dependency relationship in order to organize them into a multi-level Direct Acyclic Graph (DAG). From this DAG, we will produce a correct evaluation order which avoid redundancy due to overlap.

From recurrence (1), it follows that the computation of a value of f(i, j, k) depends on the values of $f (i - 1, j, k),$ $f (i, j - 1, k),$ $f (i - 1, j - 1, k)$ and $f (i - 1, j - 1, \bar{k})$ with $k \neq \bar{k}$ . Hence, we can derive the following

First CGM algorithm for the M-STR-EC-LCS problem

This section describes our first CGM algorithm. As the sequential algorithm, our algorithm is subdivided in two parts. In the first part, we present a CGM solution for the computation of preliminary works, and we propose a parallel strategy to compute the length of the M-STR-EC-LCS in the second part.

Second CGM algorithm for the M-STR-EC-LCS problem

This section describes our second CGM algorithm. This algorithm is based on irregular partitioning of the DAG. The parallelization of preliminary works is identical to the first CGM algorithm in Section 4.1.

Experimental results

Here we present the results of the implementation of ours BSP/CGM algorithms for solving the M-STR-EC-LCS problem.

Conclusion and future research directions

In this paper, we presented two BSP/CGM parallel algorithms based on the Wang et al.’s sequential algorithm [12] to solve the M-STR-EC-LCS problem on p processors. Firstly, we propose a multi-level Direct Acyclic Graph (DAG) that determines the correct evaluation order of sub-problems, in order to avoid redundancy due to overlap and produce a bottom-up approach following the Wang et al.’s recursive formula that will solve the entire problem. Secondly, we propose two BSP/CGM parallel algorithms

Declaration of Competing Interest

The authors declare that they have no conflict of interest.

Acknowledgments

The authors wish to express their gratitude to the computer Lab-MIS of University of Picardie Jules Verne which made it possible to carry out the experimentations of this work.

The authors also thank the anonymous reviewers whose valuable comments and suggestions have significantly improved the presentation and the readability of this work.

References (28)

AnnH.-Y. et al.
Efficient algorithms for the block edit problems
Inf. Comput.
(2010)
ChinF.Y.L. et al.
A simple algorithm for the constrained sequence problems
Inf. Process. Lett.
(2004)
V. Freschi et al.
Longest common subsequence between run-length-encoded strings: a new algorithm with improved parallelism
Inf. Process. Lett.
(2004)
HuangK.S. et al.
Dynamic programming algorithms for the mosaic longest common subsequence problem
Inf. Process. Lett.
(2007)
TsaiY.-T.
The constrained longest common subsequence problem
Inf. Process. Lett.
(2003)
WangX. et al.
A polynomial time algorithm for a generalized longest common subsequence problem
Green, Pervasive, and Cloud Computing
(2016)
A. Apostolico et al.
The longest common subsequence problem revisited
Algorithmica
(1987)
M. Crochemore et al.
Algorithms on Strings
(2007)
A. Apostolico
The myriad virtues of subword trees
Combinatorial algorithms on words
(1985)
TangC.Y. et al.
Constrained multiple sequence alignment tool development and its application to RNase family alignment
J. Bioinform. Comput. Biol.
(2003)

Y. Jingfeng et al.

An adaptive strategy applied to memetic algorithms

IAENG Int. J. Comput. Sci.

(2015)

ChenY.-C. et al.

On the generalized constrained longest common subsequence problems

J. Combinator. Optim.

(2011)

ValiantL.

A bridging model for parallel computation

Commun. ACM

(1990)

F. Dehne et al.

Scalable parallel computational geometry for coarse grained multicomputers

Int. J. Comput. Geom.

(1994)

Cited by (8)

A coarse-grained multicomputer parallel algorithm for the sequential substring constrained longest common subsequence problem
2022, Parallel Computing
Citation Excerpt :
Indeed, a dynamic-programming algorithm is said serial when a solution of a subproblem of a given level depends exclusively on the solutions of the subproblems of the immediately preceding level; otherwise, it is called a nonserial dynamic-programming algorithm [22]. These previous strategies can be easily applied to the SSCLCS problem [26,27]. However, Tseng et al.’s sequential algorithm [15] is a polyadic-nonserial dynamic-programming algorithm.
In this paper, we study the sequential substring constrained longest common subsequence (SSCLCS) problem. It is widely used in the bioinformatics field. Given two strings $X$ and $Y$ with respective lengths $m$ and $n$ , formed on an alphabet $Σ$ and a constraint sequence $C$ formed by ordered strings $(c^{1}, c^{2}, \dots, c^{l})$ with total length $r$ , the SSCLCS problem is to find the longest common subsequence $D$ between $X$ and $Y$ such that $D$ contains in an ordered way $c^{1}, c^{2}, \dots, c^{l}$ . To solve this problem, Tseng et al. proposed a dynamic-programming algorithm that runs in $O (m n r + (m + n) | Σ |)$ time. We rely on this work to propose a parallel algorithm for the SSCLCS problem on the Coarse-Grained Multicomputer (CGM) model. We design a three-dimensional partitioning technique of the corresponding dependency graph to reduce the latency time of processors by ensuring that at each step, the size of the subproblems to be performed by processors is small. It also minimizes the number of communications between processors. Our solution requires $O (\frac{n m r + (m + n) | Σ |}{p})$ execution time with $O (p)$ communication rounds on $p$ processors. The experimental results show that our solution speedups up to 59.7 on 64 processors. This is better than the CGM-based parallel techniques that have been used in solving similar problems.
A branch and bound irredundant graph algorithm for large-scale MLCS problems
2021, Pattern Recognition
Finding the multiple longest common subsequences (MLCS) among many long sequences (i.e., the large scale MLCS problem) has many important applications, such as gene alignment, disease diagnosis, and documents similarity check, etc. It is an NP-hard problem (Maier et al., 1978). The key bottle neck of this problem is that the existing state-of-the-art algorithms must construct a huge graph (called direct acyclic graph, briefly DAG), and the computer usually has no enough space to store and handle this graph. Thus the existing algorithms cannot solve the large scale MLCS problem. In order to quickly solve the large-scale MLCS problem within limited computer resources, this paper therefore proposes a branch and bound irredundant graph algorithm called Big-MLCS, which constructs a much smaller DAG (called Small-DAG) than the existing algorithms do by a branch and bound method, and designs a new data structure to efficiently store and handle Small-DAG. By these schemes, Big-MLCS is more efficient than the existing algorithms. Also, we compare the proposed algorithm with two state-of-the-art algorithms through the experiments, and the results show that the proposed algorithm outperforms the compared algorithms and is more suitable to large-scale MLCS problems.
Parallel Longest Common SubSequence Analysis In Chapel
2023, arXiv
Efficient Parallel Output-Sensitive Edit Distance
2023, Leibniz International Proceedings in Informatics, LIPIcs
Efficient Parallel Output-Sensitive Edit Distance
2023, arXiv
Parallel Longest Common SubSequence Analysis In Chapel
2023, 2023 IEEE High Performance Extreme Computing Conference, HPEC 2023

View all citing articles on Scopus

View full text

Efficient CGM-based parallel algorithms for the longest common subsequence problem with multiple substring-exclusion constraints

Highlights

Abstract

Introduction

Section snippets

Basic definitions

New graph model: the multi-level DAG

First CGM algorithm for the M-STR-EC-LCS problem

Second CGM algorithm for the M-STR-EC-LCS problem

Experimental results

Conclusion and future research directions

Declaration of Competing Interest

Acknowledgments

Inf. Comput.

Inf. Process. Lett.

Inf. Process. Lett.

Inf. Process. Lett.

Inf. Process. Lett.

The longest common subsequence problem revisited

Algorithmica

Algorithms on Strings

The myriad virtues of subword trees

Combinatorial algorithms on words

Constrained multiple sequence alignment tool development and its application to RNase family alignment

J. Bioinform. Comput. Biol.

An adaptive strategy applied to memetic algorithms

IAENG Int. J. Comput. Sci.

On the generalized constrained longest common subsequence problems

J. Combinator. Optim.

A bridging model for parallel computation

Commun. ACM

Scalable parallel computational geometry for coarse grained multicomputers

Int. J. Comput. Geom.