Elsevier

Operations Research Letters

Volume 41, Issue 6, November 2013, Pages 644-649
Operations Research Letters

A hybrid genetic algorithm for the repetition free longest common subsequence problem

https://doi.org/10.1016/j.orl.2013.09.002Get rights and content

Abstract

Computing the longest common subsequence of two sequences is one of the most studied algorithmic problems. In this work we focus on a particular variant of the problem, called repetition free longest common subsequence (RF-LCS), which has been proved to be NP-hard. We propose a hybrid genetic algorithm, which combines standard genetic algorithms and estimation of distribution algorithms, to solve this problem. An experimental comparison with some well-known approximation algorithms shows the suitability of the proposed technique.

Introduction

One of the most important problem in the field of algorithms on strings is the computation of the longest common subsequence (LCS) of two sequences, over a given alphabet. Its importance is mostly due to its wide range of applications, covering many research fields. For example, several applications of different variants of the LCS problem can be found in bioinformatics, which are used to perform analysis on sequences of DNA or RNA. Other interesting applications of variants of the LCS problem can be found in  [2]. More specifically, the LCS problem between two given strings S1 and S2 asks for finding a longest sequence S which is subsequence of both the input strings S1 and S2 (see Section  2 for a formal definition of the problem). Although the LCS problem can be solved in polynomial time (see  [14]) using for example a dynamic programming approach (see  [7]), its generalization to a set of sequences, which ask for finding the longest sequence that is subsequence of all the input sequences, is NP-hard (see  [12]).

Different variants of the longest common subsequence problem have been proposed  [4], [5], [6], [17] to compare biological sequences, where, given two strings S1 and S2, the computed longest common subsequence is required to satisfy some constrains. An interesting variant of the LCS problem is the so- called repetition-free longest common subsequence (RF-LCS) which, given two sequences S1 and S2, asks for finding the longest sequence S that is subsequence of both S1 and S2 and such that it contains at most one occurrence of each symbol  [1]. This variant is used to model the genome rearrangement with respect to the duplications of some genes. The goal of this study is to infer the (supposed) original sequence of genes, in which every gene has only one occurrence, starting from a set of sequences, each one (possibly) containing multiple copies of some genes  [16]. This particular problem has been strongly investigated and the complexity results can be found in  [8]. Finally, other recent studies focused on the so-called doubly-constrained longest common subsequence (DC-LCS), which generalizes the mathematical formulation of the RF-LCS (and other constrained variants of LCS problems). In particular, some complexity results on this problem can be found in  [5], while the work proposed in  [3] discusses the parameterized complexity of the RF-LCS problem.

In this paper we define a hybrid genetic algorithm (GA) to address the RF-LCS problem. The idea of the proposed technique is to use standard genetic algorithms and estimation of distribution algorithms. In particular a genetic algorithm is used to explore the search space, while an estimation of distribution algorithm provides a simple, efficient and theoretically sound method to guarantee that the constrains of the RF-LCS problem are satisfied.

The paper is organized as follows: Section  2 introduces the basic definitions that will be used in the rest of the paper; Section  3 introduces the genetic algorithms (GAs) describing their computational model, while Section  4 focuses on the estimation of distribution algorithms (EDAs), presenting their properties and the main differences with GAs. Section  5 presents the hybrid algorithm developed to solve the RF-LCS problem, while Section  6 describes the experimental settings, and discusses the obtained results comparing them with the ones obtained with other approximation algorithms.

Section snippets

Basic definitions

In this section we give some basic concepts and notations that will be used in the rest of the paper. Let S be a string over an alphabet Σ of size |Σ|. Given a string S, we denote by S[i] the symbol occurring at position i in string S. Let also S[ij] be the substring of S starting at position  i and ending at position  j. Given two sequences S and S over a finite alphabet Σ, S is a subsequence of S if S can be obtained from S by removing some (possibly zero) characters. When S is a

Genetic algorithm

Genetic Algorithms (GAs)  [9], [10] are a class of computational models that mimic the process of natural evolution. GAs are often viewed as function optimizers although the range of problems to which they have been applied is quite broad. Although different variants of GAs exist, most of the methods called “GAs” have at least the following elements in common: populations of chromosomes, selection according to fitness, crossover to produce new offspring, and random mutation of new offspring.

Estimation of distribution algorithm

As observed in  [15], from an abstract point of view, the selected set of promising solutions can be viewed as a sample drawn from a probability distribution. Although the true probability distribution is unknown, there are algorithms that are able to estimate that probability distribution by using the selected set of solutions itself and use this estimate to generate new solutions. These algorithms are called estimation of distribution algorithms (EDAs)  [11]. In EDAs better solutions are

Method

In this section, the proposed algorithm used to address the RF-LCS problem is presented. Let S1 and S2 be the two sequences over an alphabet Σ that represent the input of the problem. We assume, without the loss of generality, that the two sequences have the same length l. A possible solution for the RF-LCS problem is codified by means of a binary string K that has the same length of the two sequences S1 and S2. More in detail, ‘1’ in a particular position i (with il) indicates that the

Experimental study

In this section the experimental phase is described. In order to compare the performances obtained by the proposed hybrid GA, three existing algorithms designed to address the RF-LCS problem have been considered. The three algorithms are presented in  [1], and Section  6.2 reports a brief description of the three algorithms. Using the same notation defined in  [1], from now on the three algorithms used as benchmarks will be referred as A1, A2 and A3. The proposed hybrid GA is tested against the

References (17)

There are more references available in the full text version of this article.

Cited by (16)

  • Solving longest common subsequence problems via a transformation to the maximum clique problem

    2021, Computers and Operations Research
    Citation Excerpt :

    One of the common measures when comparing two (or more) strings is the length of their longest common subsequence (Iliopoulos and Sohel Rahman, 2009; Castelli et al., 2013).

  • Hybrid techniques based on solving reduced problem instances for a longest common subsequence problem

    2018, Applied Soft Computing
    Citation Excerpt :

    In particular, arc-annotated sequences have shown to be useful for the structural comparison of RNA sequences. One of the usual measures when comparing two (or more) sequences is the length of their longest common subsequence (LCS); see, for example, [4,5]. In this context, given a sequence x over a finite alphabet Σ, sequence t is called a subsequence of x, if t can be produced from x by deleting characters.

  • Repetition-free longest common subsequence of random sequences

    2016, Discrete Applied Mathematics
    Citation Excerpt :

    Some other extensions of these two measures were considered under the name of constrainedLCS and doubly-constrainedLCS [7]. All of these variants were shown to be hard to compute [1,5–7], so some heuristics and approximation algorithms for them were proposed and experimentally tested [1,6,14,10]. It is worth mentioning that the above discussion refers to sequences randomly, uniformly, and independently chosen over an alphabet.

  • On the role of metaheuristic optimization in bioinformatics

    2023, International Transactions in Operational Research
  • Longest Order Conserved Repetition-free Subsequences

    2023, Proceedings - 2023 2023 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2023
View all citing articles on Scopus
View full text