A hybrid genetic algorithm for the repetition free longest common subsequence problem
Introduction
One of the most important problem in the field of algorithms on strings is the computation of the longest common subsequence (LCS) of two sequences, over a given alphabet. Its importance is mostly due to its wide range of applications, covering many research fields. For example, several applications of different variants of the LCS problem can be found in bioinformatics, which are used to perform analysis on sequences of DNA or RNA. Other interesting applications of variants of the LCS problem can be found in [2]. More specifically, the LCS problem between two given strings and asks for finding a longest sequence which is subsequence of both the input strings and (see Section 2 for a formal definition of the problem). Although the LCS problem can be solved in polynomial time (see [14]) using for example a dynamic programming approach (see [7]), its generalization to a set of sequences, which ask for finding the longest sequence that is subsequence of all the input sequences, is NP-hard (see [12]).
Different variants of the longest common subsequence problem have been proposed [4], [5], [6], [17] to compare biological sequences, where, given two strings and , the computed longest common subsequence is required to satisfy some constrains. An interesting variant of the LCS problem is the so- called repetition-free longest common subsequence () which, given two sequences and , asks for finding the longest sequence that is subsequence of both and and such that it contains at most one occurrence of each symbol [1]. This variant is used to model the genome rearrangement with respect to the duplications of some genes. The goal of this study is to infer the (supposed) original sequence of genes, in which every gene has only one occurrence, starting from a set of sequences, each one (possibly) containing multiple copies of some genes [16]. This particular problem has been strongly investigated and the complexity results can be found in [8]. Finally, other recent studies focused on the so-called doubly-constrained longest common subsequence (DC-LCS), which generalizes the mathematical formulation of the (and other constrained variants of LCS problems). In particular, some complexity results on this problem can be found in [5], while the work proposed in [3] discusses the parameterized complexity of the problem.
In this paper we define a hybrid genetic algorithm (GA) to address the problem. The idea of the proposed technique is to use standard genetic algorithms and estimation of distribution algorithms. In particular a genetic algorithm is used to explore the search space, while an estimation of distribution algorithm provides a simple, efficient and theoretically sound method to guarantee that the constrains of the problem are satisfied.
The paper is organized as follows: Section 2 introduces the basic definitions that will be used in the rest of the paper; Section 3 introduces the genetic algorithms (GAs) describing their computational model, while Section 4 focuses on the estimation of distribution algorithms (EDAs), presenting their properties and the main differences with GAs. Section 5 presents the hybrid algorithm developed to solve the problem, while Section 6 describes the experimental settings, and discusses the obtained results comparing them with the ones obtained with other approximation algorithms.
Section snippets
Basic definitions
In this section we give some basic concepts and notations that will be used in the rest of the paper. Let be a string over an alphabet of size . Given a string , we denote by the symbol occurring at position in string . Let also be the substring of starting at position and ending at position . Given two sequences and over a finite alphabet , is a subsequence of if can be obtained from by removing some (possibly zero) characters. When is a
Genetic algorithm
Genetic Algorithms (GAs) [9], [10] are a class of computational models that mimic the process of natural evolution. GAs are often viewed as function optimizers although the range of problems to which they have been applied is quite broad. Although different variants of GAs exist, most of the methods called “GAs” have at least the following elements in common: populations of chromosomes, selection according to fitness, crossover to produce new offspring, and random mutation of new offspring.
Estimation of distribution algorithm
As observed in [15], from an abstract point of view, the selected set of promising solutions can be viewed as a sample drawn from a probability distribution. Although the true probability distribution is unknown, there are algorithms that are able to estimate that probability distribution by using the selected set of solutions itself and use this estimate to generate new solutions. These algorithms are called estimation of distribution algorithms (EDAs) [11]. In EDAs better solutions are
Method
In this section, the proposed algorithm used to address the problem is presented. Let and be the two sequences over an alphabet that represent the input of the problem. We assume, without the loss of generality, that the two sequences have the same length . A possible solution for the problem is codified by means of a binary string that has the same length of the two sequences and . More in detail, ‘1’ in a particular position (with ) indicates that the
Experimental study
In this section the experimental phase is described. In order to compare the performances obtained by the proposed hybrid GA, three existing algorithms designed to address the problem have been considered. The three algorithms are presented in [1], and Section 6.2 reports a brief description of the three algorithms. Using the same notation defined in [1], from now on the three algorithms used as benchmarks will be referred as A1, A2 and A3. The proposed hybrid GA is tested against the
References (17)
- et al.
Repetition-free longest common subsequence
Discrete Applied Mathematics
(2010) - et al.
On the parameterized complexity of the repetition free longest common subsequence problem
Information Processing Letters
(2012) - et al.
Variants of constrained longest common subsequence
Information Processing Letters
(2010) - et al.
A simple algorithm for the constrained sequence problems
Information Processing Letters
(2004) The constrained longest common subsequence problem
Information Processing Letters
(2003)- et al.
A survey of longest common subsequence algorithms
- et al.
Exemplar longest common subsequence
IEEE/ACM Transactions on Computational Biology and Bioinformatics
(2007) - et al.
Introduction to Algorithms
(2001)
Cited by (16)
Solving longest common subsequence problems via a transformation to the maximum clique problem
2021, Computers and Operations ResearchCitation Excerpt :One of the common measures when comparing two (or more) strings is the length of their longest common subsequence (Iliopoulos and Sohel Rahman, 2009; Castelli et al., 2013).
Exact algorithms for the repetition-bounded longest common subsequence problem
2020, Theoretical Computer ScienceHybrid techniques based on solving reduced problem instances for a longest common subsequence problem
2018, Applied Soft ComputingCitation Excerpt :In particular, arc-annotated sequences have shown to be useful for the structural comparison of RNA sequences. One of the usual measures when comparing two (or more) sequences is the length of their longest common subsequence (LCS); see, for example, [4,5]. In this context, given a sequence x over a finite alphabet Σ, sequence t is called a subsequence of x, if t can be produced from x by deleting characters.
Repetition-free longest common subsequence of random sequences
2016, Discrete Applied MathematicsCitation Excerpt :Some other extensions of these two measures were considered under the name of constrainedLCS and doubly-constrainedLCS [7]. All of these variants were shown to be hard to compute [1,5–7], so some heuristics and approximation algorithms for them were proposed and experimentally tested [1,6,14,10]. It is worth mentioning that the above discussion refers to sequences randomly, uniformly, and independently chosen over an alphabet.
On the role of metaheuristic optimization in bioinformatics
2023, International Transactions in Operational ResearchLongest Order Conserved Repetition-free Subsequences
2023, Proceedings - 2023 2023 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2023