Elsevier

Parallel Computing

Volume 91, March 2020, 102598
Parallel Computing

Efficient CGM-based parallel algorithms for the longest common subsequence problem with multiple substring-exclusion constraints

https://doi.org/10.1016/j.parco.2019.102598Get rights and content

Highlights

  • Detail of the Wang et al.’s sequential dynamic programming algotihm solving the M-STR-EC-LCS problem.

  • Describing of a multi-level Direct Acyclic Graph (task graph) for produce a bottom-up approach following the Wang et al.’s recursive formula for solve a M-STR-EC-LCS problem.

  • Describing two BSP/CGM parallel algorithms based on our task graph.

  • Experimental study.

  • Comparison between theorical and computational results.

Abstract

A variant of the Longest Common Subsequence (LCS) problem is the LCS problem with multiple substring-exclusion constraints (M-STR-EC-LCS), which has great importance in many fields especially in bioinformatics. This problem consists to compute the LCS of two strings X and Y of length n and m respectively that excluded a set of d constraints P={P1,P2,,Pd} of total length r. Recently, Wang et al. proposed a sequential solution based on the dynamic programming technique that requires O(nmr) execution time and space. To the best of our knowledge, there is no parallel solutions for this problem. This paper describes new efficient parallel algorithms on Coarse Grained Multicomputer model (CGM) to solve this problem. Firstly, we propose a multi-level Direct Acyclic Graph (DAG) that determines the correct evaluation order of sub-problems in order to avoid redundancy due to overlap. Secondly, we propose two CGM parallel algorithms based on our DAG. The first algorithm is based on a regular partitioning of the DAG and requires O(nmrp) execution time with O(p) communication rounds where p is the number of processors used. Its main drawback is high idleness time of processors because due to the dependencies between the nodes in the DAG, over time it has many idle processors. The second algorithm uses an irregular partitioning of the DAG that minimizes this idleness time by allowing the processors to stay active as long as possible. It requires O(nmrp) execution time with O(kp) communication rounds. k is a constant integer allowing to setup the irregular partitioning. The both algorithms require O(r|Σ|p) preprocessing time where |Σ| is the length of the alphabet. The experimental results performed show a good agreement with theoretical predictions.

Introduction

Finding the longest common subsequence (LCS) of two sequences is a well-known measurement for computing the similarity of two strings, and it is crucial in various applications. It is therefore a well-studied problem in computer science, and can be widely applied in various fields such as text and music information retrieval, file comparison, pattern matching, spelling correction and computational biology [1], [2], [3], [4]. Formally given two strings X and Y, the LCS problem consists to find another string Z which is common to X and Y and which is of maximal size. To solve this problem, Wagner and Fischer [5] firstly proposed a quadratic time and space solution based on the dynamic programing technique which find the LCS by computing the edit distance between the sequences. To ameliorate this solution, many advanced solutions were proposed in the past decade [6], [7], [8], [9].

Many variants of the LCS problem have been proposed for particular applications. For example, the mosaic LCS problem, the merged LCS problem, the cyclic chain correction problem and the block editing problem. The most recent variant which have receive most attention is the Constrained-LCS (CLCS) problem and was first addressed by Tsai [10]. More recently, Chen and Chao [11] proposed a more generalized form of the CLCS problem (GC-LCS) which is formally define as follow: given two input sequences, X=x1x2xn and Y=y1y2ym, of length n and m respectively, and a constraint string P=p1p2pr, of length r, r ≤ min (n, m) the GC-LCS problem is a set of four problems that find the LCS of X and Y which includes/excludes P as a subsequence/substring. This set of problems can be more generalized by considering many constraints. Therefore, instead of using a single constraint P of length r, they generalize it to a set of d constraints {P1,P2,,Pd} of total length r. This generalization is shown in Table 1 [11].

In this work, we tackle the problem of parallelizing the Wang et al. sequential algorithm [12] for M-STR-EC-LCS problem on the BSP/CGM model (Bulk Synchronous Parallel/Coarse Grained Multicomputer) [13], [14], [15], [16], [17]. This model seems to be the best for the design of algorithms that are not too dependent on a particular architecture. A BSP/CGM machine is a set of p processors, each having its own local memory of size s (with O(s)O(1)) and connected to each other through a router able to deliver messages in point-to-point manner. Each BSP/CGM parallel algorithm is an alternation of local computations and global communication rounds. Each communication round consists in routing a single h-relation with h=O(s). Each CGM computation or communication round corresponds to a BSP super-step having a communication cost g × s [17]. Here, g is the cost of a communication of a word in the BSP model. In order to produce an efficient BSP/CGM parallel algorithm, the effort of the designers must be to maximize speed-up and minimize the number of communication rounds (ideally, it must be independent from the problem size, and constant in the optimum).

Many CGM-based parallel algorithms modelled by dynamic programming for some problem including the Minimum Cost Parenthesizing problem and the Optimal Binary Search Tree problem have been proposed [18], [19]. For the LCS problem, Garcia et al. [20] proposed a CGM-based solution which require O(n2p) computation time and p communication rounds (n is the length of the two sequences and p is the number of processors). Alves et al. [21] proposed for all-substrings LCS problem, a solution which runs in O(nmp) time steps and O(logp) communication rounds on p processors (n and m are lengths of the original strings).

Myoupo et al. [22] proposed the first CGM-based solutions for the string-excluding constrained LCS problem which require respectively O(nmrp) time steps and O(2p) communication rounds, and O(nmrp) time steps with O(p) communication rounds (n and m are the length of the two sequences, and r is length of the constraint string). However, there is no parallel solution for the M-STR-EC-LCS problem to the best of our knowledge.

Recently, Chen and Chao [11] proposed for the M-STR-IC-LCS and M-STR-EC-LCS problems an exponential solution and therefore declare without proof that they are NP-Hard. Furthermore, Wang et al. [12] proposed a simple dynamic programming algorithm (by defining a recursive formula) for the M-STR-EC-LCS problem that requires O(nmr) execution time and space, where n and m are the lengths of the two given input strings, and r is the total length of the d constraints. This solution uses a state machine also called finite automata defined by Aho and Corasick [23] which speedup the exact string matching of multiple patterns.

Our contribution is to propose two BSP/CGM parallel algorithms based on the Wang et al.’s sequential algorithm [12] to solve the M-STR-EC-LCS problem. Firstly, we propose a multi-level Direct Acyclic Graph (DAG) called task graph that determines the correct evaluation order of sub-problems, in order to avoid redundancy due to overlap and produce a bottom-up approach following the Wang et al.’s recursive formula that will solve the entire problem. Secondly, based on our task graph, we propose two BSP/CGM parallel algorithms. It can be summarized as follows:

  • In the first algorithm, we subdivide this task graph into sub-graphs (or blocks) of same size, and distribute them fairly onto the processors. It requires O(nmrp) execution time with O(p) communication rounds and O(r|Σ|p) preprocessing time. |Σ| is the length of the alphabet and p is the number of processors used. This strategy promotes load-balancing between the processors, but its main drawback is high idleness time of processors. Indeed, due to the dependencies between the nodes in the DAG, over time several processors cannot be active at the same time;

  • The second algorithm uses an irregular partitioning, based on the fragmentation of the blocks of the task graph, to reduce the idleness time of the processors. It requires O(nmrp) execution time with O(kp) communication rounds. k is a constant integer defining the number of fragmentation to perform.

Finally, we implement our algorithms and we conclude that the experimental results performed show a good agreement with theoretical predictions.

The organization of this paper is as follow: in Section 2, we present the principle of using finite automaton to detect an occurrence of a substring in a sequence and describe its usage in the Wang et al.’s sequential algorithm. We describe our multi-level DAG model in Section 3. Our BSP/CGM algorithms are presented in Section 4 and Section 5. Section 6 presents experimental results. We end this paper with some concluding remarks and future research directions.

Section snippets

Basic definitions

Definition 1

Let Σ be any finite set of σ symbols called an alphabet. A sequence is an ordered list of symbols over Σ. Σ* is the set of all strings over Σ, including the empty one ε. Σ+ denotes Σ*{ε}. For any string X, |X| denotes the length of X. Note that |ε|=0. Let X be a sequence of Σ. A subsequence of a sequence X, is obtained by deleting zero or more symbols (not necessarily contiguous) from X. A substring of a sequence X, is a subsequence of successive symbols within X.

Definition 2

For a given sequence X=x1x2xn

New graph model: the multi-level DAG

In this section, we are going to subdivide the M-STR-EC-LCS problem into sub-problems, study their dependency relationship in order to organize them into a multi-level Direct Acyclic Graph (DAG). From this DAG, we will produce a correct evaluation order which avoid redundancy due to overlap.

From recurrence (1), it follows that the computation of a value of f(i, j, k) depends on the values of f(i1,j,k), f(i,j1,k), f(i1,j1,k) and f(i1,j1,k¯) with kk¯. Hence, we can derive the following

First CGM algorithm for the M-STR-EC-LCS problem

This section describes our first CGM algorithm. As the sequential algorithm, our algorithm is subdivided in two parts. In the first part, we present a CGM solution for the computation of preliminary works, and we propose a parallel strategy to compute the length of the M-STR-EC-LCS in the second part.

Second CGM algorithm for the M-STR-EC-LCS problem

This section describes our second CGM algorithm. This algorithm is based on irregular partitioning of the DAG. The parallelization of preliminary works is identical to the first CGM algorithm in Section 4.1.

Experimental results

Here we present the results of the implementation of ours BSP/CGM algorithms for solving the M-STR-EC-LCS problem.

Conclusion and future research directions

In this paper, we presented two BSP/CGM parallel algorithms based on the Wang et al.’s sequential algorithm [12] to solve the M-STR-EC-LCS problem on p processors. Firstly, we propose a multi-level Direct Acyclic Graph (DAG) that determines the correct evaluation order of sub-problems, in order to avoid redundancy due to overlap and produce a bottom-up approach following the Wang et al.’s recursive formula that will solve the entire problem. Secondly, we propose two BSP/CGM parallel algorithms

Declaration of Competing Interest

The authors declare that they have no conflict of interest.

Acknowledgments

The authors wish to express their gratitude to the computer Lab-MIS of University of Picardie Jules Verne which made it possible to carry out the experimentations of this work.

The authors also thank the anonymous reviewers whose valuable comments and suggestions have significantly improved the presentation and the readability of this work.

References (28)

  • Y. Jingfeng et al.

    An adaptive strategy applied to memetic algorithms

    IAENG Int. J. Comput. Sci.

    (2015)
  • ChenY.-C. et al.

    On the generalized constrained longest common subsequence problems

    J. Combinator. Optim.

    (2011)
  • ValiantL.

    A bridging model for parallel computation

    Commun. ACM

    (1990)
  • F. Dehne et al.

    Scalable parallel computational geometry for coarse grained multicomputers

    Int. J. Comput. Geom.

    (1994)
  • Cited by (8)

    • A coarse-grained multicomputer parallel algorithm for the sequential substring constrained longest common subsequence problem

      2022, Parallel Computing
      Citation Excerpt :

      Indeed, a dynamic-programming algorithm is said serial when a solution of a subproblem of a given level depends exclusively on the solutions of the subproblems of the immediately preceding level; otherwise, it is called a nonserial dynamic-programming algorithm [22]. These previous strategies can be easily applied to the SSCLCS problem [26,27]. However, Tseng et al.’s sequential algorithm [15] is a polyadic-nonserial dynamic-programming algorithm.

    • Efficient Parallel Output-Sensitive Edit Distance

      2023, Leibniz International Proceedings in Informatics, LIPIcs
    • Parallel Longest Common SubSequence Analysis In Chapel

      2023, 2023 IEEE High Performance Extreme Computing Conference, HPEC 2023
    View all citing articles on Scopus
    View full text