Comparing multiobjective swarm intelligence metaheuristics for DNA motif discovery

doi:10.1016/j.engappai.2012.06.014

Engineering Applications of Artificial Intelligence

Volume 26, Issue 1, January 2013, Pages 314-326

https://doi.org/10.1016/j.engappai.2012.06.014 Get rights and content

Abstract

In recent years, a huge number of biological problems have been successfully addressed through computational techniques, among all these computational techniques we highlight metaheuristics. Also, most of these biological problems are directly related to genomic, studying the microorganisms, plants, and animals genomes. In this work, we solve a DNA sequence analysis problem called Motif Discovery Problem (MDP) by using two novel algorithms based on swarm intelligence: Artificial Bee Colony (ABC) and Gravitational Search Algorithm (GSA). To guide the pattern search to solutions that have a better biological relevance, we have redefined the problem formulation and incorporated several biological constraints that should be satisfied by each solution. One of the most important characteristics of the problem definition is the application of multiobjective optimization (MOO), maximizing three conflicting objectives: motif length, support, and similarity. So, we have adapted our algorithms to the multiobjective context. This paper presents an exhaustive comparison of both multiobjective proposals on instances of different nature: real instances, generic instances, and instances generated according to a Markov chain. To analyze their operations we have used several indicators and statistics, comparing their results with those obtained by standard algorithms in multiobjective computation, and by 14 well-known biological methods.

Introduction

Nowadays, we can easily find many optimization problems that require a huge computational time. There are even problems that can not be solved optimally with the existing computers. Such problems are called NP-hard problems. Currently, all known algorithms for solving NP-hard problems require an exponential time with respect to the input size. It is unknown if there will be so fast algorithms, therefore, to solve an NP-hard problem of an arbitrary size it is common to use techniques such as metaheuristics (Glover and Kochenberger, 2003). In computer science, we can define a metaheuristics as a problem optimization method that applies an iterative process to improve the quality of possible solutions, taking into account a given fitness function. Also, we can easily adapt metaheuristics to several problems. These techniques do not guarantee finding the optimal solution, they find quasi-optimal solutions in a reasonable time. Within the vast world of metaheuristics it is defined the concept of swarm intelligence. This concept is taking a lot of strength in recent years. Swarm intelligence is the discipline that deals with systems composed of a set of decentralized and self-organized individuals, and which are normally based on natural phenomenons. In particular, this discipline takes advantage of the collective behavior of individuals who relate to each other and the environment. These algorithms could be divided into two groups: those based on the animal behaviors and those based on physics or nature behaviors. In recent years, many algorithms based on these collective behaviors are being successfully applied in different problems of several fields, due to this, we have decided to analyze the behavior of swarm intelligence algorithms in this work, comparing two novel algorithms such as the Artificial Bee Colony (ABC) algorithm (Karaboga, 2005), which is an optimization algorithm based on the intelligent foraging behavior of honey bee swarm; and the Gravitational Search Algorithm (GSA) (Rashedi et al., 2009), a new optimization algorithm based on the law of gravity and mass interactions. In this way, we can compare an algorithm from each group: one based on the animal behaviors (ABC), and other based on physics and nature behaviors (GSA).

The main objective of this work is to analyze which kind of swarm intelligence algorithm is able to solve better the Motif Discovery Problem (MDP). MDP is an NP-hard optimization problem as defined in literature (Maier, 1978, Rivière et al., 2008), applied to the specific task of discovering novel Transcription Factor Binding Sites (TFBS) in DNA sequences (D'haeseleer, 2006). Predicting common patterns, motifs, is one of the most important sequence analysis problem, and it has not yet been resolved in an efficient manner. In addition, in this work we have expanded the formulation of this problem with several constraints to adapt the solution search to the real biology. In biology, finding and decoding the true meaning of these DNA patterns can help us to explain the complexity and development of living organisms. MDP maximizes three conflicting objectives: motif length, support, and similarity. Due to this, we have to use multiobjective techniques for its resolution. Moreover, we have to adapt the operation of our algorithms to the multiobjective context. Therefore, the swarm intelligence based algorithms compared in this paper are adapted to this context: the Multiobjective ABC (MOABC) algorithm (González-Álvarez et al., 2011a), and the Multiobjective GSA (MO-GSA) (González-Álvarez et al., 2011b), based on the single-objective ABC and GSA algorithms, respectively. For the results presentation, we have used typical multiobjective indicators such as hypervolume (Zitzler and Thiele, 1999) or coverage relation (Zitzler et al., 2000), and thus, we facilitate future comparisons. We also want to emphasize that to ensure that the solutions found by our proposals are biologically relevant, we have made several analysis by using biological indicators such as Sensitivity, the Positive Predictive Value, the Performance Coefficient, and the Correlation Coefficient. On whole, our main objectives in this work are: compare novel swarm intelligence based techniques to solve a well-known bioinformatics problem that still has not been resolved in an efficient manner, incorporate new rules and constraints to adapt the MDP to the real biological world, obtaining good and relevant results.

In the remainder of the paper, we briefly mention a number of existing works dedicated to the motif discovery in Section 2. Thereafter, in Section 3, we describe the MDP in detail. Section 4 presents the metaheuristics compared, explaining their performances. In Section 5, we include the experimental methodology, the instances used, and the best configuration of each algorithm. In Section 6, we analyze the behavior of our proposals, making comparisons between them and with two standard multiobjective evolutionary algorithms. We also compare the two proposed swarm-based algorithms with other previously proposed metaheuristics in Section 7. Section 8 compares the motifs discovered by our algorithms with those predicted by other 14 well-known biological methods. Finally, we outline the conclusions of this paper.

Section snippets

Related work

In this section we present some of the research literature related to the MDP. First, we will describe some of the latest research that apply evolutionary computation to discover motifs in DNA sequences. Next, we will organize and analyze the biological methods most commonly used to solve this problem.

There are many proposals based on evolutionary techniques for finding DNA motifs, an example is the algorithm FMGA (Liu et al., 2004), a genetic algorithm based on the SAGA operators (Notredame

Motif discovery problem

Gene expression is the process by which a gene is transcribed to form an RNA sequence. Then this sequence is used to produce the corresponding protein sequence. This process starts when a macro-molecule called Transcription Factor (TF) has been bounded to a short subsequence in the promoter region of the gene, called TFBS (Zare-Mirakabad et al., 2009). Finding TFBSs in DNA sequences (problem known as MDP) is important for uncovering the underlying regulatory relationship and understanding the

Description of the algorithms

In this section we detail the representation of the individuals, and we include a brief description of the algorithms compared in this work.

Methodology

In this section we explain the methodology followed to configure each algorithm, we detail the data sets used in our experiments, and we show the results obtained by our algorithms. In the following sections we will compare the results obtained by our algorithms with those obtained by several standard multiobjective algorithms, and with those obtained by other 14 well-known biological methods. We will also include a powerful statistical analysis of the results to ensure their statistical

Algorithm comparisons

As mentioned above, we have structured the comparisons into three groups, in the first one, we use “Generic” instances, the second one uses “Real” instances, and the last one applies “Markov” instances. Now we proceed to analyze the behavior of the algorithms with “Generic” instances. The first analysis was made by using the hypervolume indicator. The hypervolume indicator (While et al., 2006) is a measure of quality of a set $P = {p^{(1)}, p^{(2)}, \dots, p^{(n)}}$ of n nondominated objective vectors produced in

Comparison with previous works

In this section we compare the results obtained by our two swarm-based algorithms with those achieved by our previously proposed metaheuristics. The first techniques that we applied to solve the MDP as a multiobjective optimization problem were Differential Evolution with Pareto Tournaments (DEPT, González-Álvarez et al., 2010a) and Multiobjective Variable Neighbourhood Search (MO-VNS, González-Álvarez et al., 2010b). At present, after better understanding the biological aspects of the

Comparison with other biological methods

In this section we analyze the motifs obtained by our two proposals, MOABC and MO-GSA algorithms. To that end we have compared the best motifs (nondominated solutions) discovered by both heuristics with the best solutions predicted by 14 well-known biological methods in motif discovery. Thus, we demonstrate that the motifs predicted by MOABC and MO-GSA have an important biological relevance. The biological methods compared in this section are AlignACE (Roth et al., 1998), ANN_Spec (Workman and

Conclusions and future work

In this paper, we have compared two novel swarm intelligence based algorithms: ABC and GSA to solve the MDP. Moreover we have adapted these algorithms to the multiobjective context, resulting in two new algorithms named MOABC and MO-GSA. This work differs from previous approaches to MDP, because our new constraints focuses on real-world aspects of biology, e.g., the motif complexity concept. In this work, we have also combined computational and biological aspects, demonstrating through several

Acknowledgments

This work was partially funded by the Spanish Ministry of Science and Innovation and ERDF (the European Regional Development Fund), under the contract TIN2008-06491-C04-04 (the M^⁎ project). Thanks also to the Fundación Valhondo, for the economic support offered to David L. González-Álvarez to make this research.

References (49)

M. Kaya
MOGAMOD: Multi-objective genetic algorithm for motif discovery
Expert Syst. Appl.
(2009)
N. Mladenovic et al.
Variable neighborhood search
Comput. Oper. Res.
(1997)
E. Rashedi et al.
GSA: a gravitational search algorithm
Inf. Sci.
(2009)
R. Rivière et al.
Shuffling biological sequences with motif constraints
J. Discrete Algorithms
(2008)
J. van Helden et al.
Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies
J. Mol. Biol.
(1998)
W. Ao et al.
Environmentally induced foregut remodeling by PHA-4/FoxA and DAF-12/NHR
Science
(2004)
T.L. Bailey et al.
Unsupervised learning of multiple motifs in biopolymers using expectation maximization
Mach. Learn.
(1995)
Che, Y., Song, D., Rashedd, K., 2005. MDGA: Motif discovery using a genetic algorithm. In: Proceedings of the 2005...
K. Deb
Multi-objective Optimization Using Evolutionary Algorithms
(2001)
K. Deb et al.
A fast and elitist multiobjective genetic algorithm: NSGA-II
IEEE Trans. Evolut. Comput.
(2002)

P. D'haeseleer

What are DNA sequence motifs?

Nat. Biotechnol.

(2006)

E. Eskin et al.

Finding composite regulatory patterns in DNA sequences

Bioinformatics

(2002)

A.V. Favorov et al.

A Gibbs sampler for identification of symmetrically structured, spaced DNA motifs with improved estimation of the signal length

Bioinformatics

(2005)

G.B. Fogel et al.

Discovery of sequence motifs related to coexpression of genes using evolutionary computation

Nucl. Acids Res.

(2004)

G.B. Fogel et al.

Evolutionary computation for discovery of composite transcription factor binding sites

Nucl. Acids Res.

(2008)

M.C. Frith et al.

Finding functional sequence elements by multiple local alignment

Nucl. Acids Res.

(2004)

F. Glover et al.

Handbook of Metaheuristics

(2003)

González-Álvarez, D.L., Vega-Rodríguez, M.A., Gómez-Pulido, J.A., Sánchez-Pérez, J.M., 2010a. Solving the motif...

González-Álvarez, D.L., Vega-Rodríguez, M.A., Gómez-Pulido, J.A., Sánchez-Pérez, J.M., 2010b. A multiobjective variable...

González-Álvarez, D.L., Vega-Rodríguez, M.A., Gómez-Pulido, J.A., Sánchez-Pérez, J.M., 2011a. Finding motifs in DNA...

González-Álvarez, D.L., Vega-Rodríguez, M.A., Gómez-Pulido, J.A., Sánchez-Pérez, J.M., 2011b. Applying a multiobjective...

D.L. González-Álvarez et al.

Predicting DNA motifs by using evolutionary multiobjective optimization

IEEE Trans. Syst. Man Cybern. Part C: Appl. Rev.

(2011)

G.Z. Hertz et al.

Identifying DNA and protein patterns with statistically significant alignments of multiple sequences

Bioinformatics

(1999)

G.Z. Hertz et al.

Identification of consensus patterns in unaligned DNA sequences known to be functionally related

Comput. Appl. Biosci.

(1990)

Cited by (19)

IRMAC: Interpretable Refined Motifs in Binary Classification for smart grid applications
2023, Engineering Applications of Artificial Intelligence
Citation Excerpt :
Motifs, defined as approximately repeated sub-patterns in a long time series, were first proposed in 2003 (Chiu et al., 2003). Since then, motifs have been used as representative patterns for long time series data in various data mining applications, e.g., classification, clustering, and rule discovery (González-Álvarez et al., 2013; Fu, 2011). However, efficient ways to extract motifs were needed as the brute-force solution was computationally untenable (Linardi et al., 2020).
Modern power systems are experiencing the challenge of high uncertainty with the increasing penetration of renewable energy resources and the electrification of heating systems. In this paradigm shift, understanding electricity users’ demand is of utmost value to the retailers, aggregators, and policymakers. However, behind-the-meter (BTM) equipment and appliances at the household level are unknown to the other stakeholders mainly due to privacy concerns and tight regulations. In this paper, we seek to identify residential consumers based on their BTM equipment, mainly rooftop photovoltaic (PV) systems and electric heating, using imported/purchased energy data from utility meters. To solve this problem with an interpretable, fast, secure, and maintainable solution, we propose an integrated method called Interpretable Refined Motifs And binary Classification (IRMAC). The proposed method comprises a novel shape-based pattern extraction technique, called Refined Motif (RM) discovery, and a single-neuron classifier. The first part extracts a sub-pattern from the long time series considering the frequency of occurrences, average dissimilarity, and time dynamics while emphasising specific times with annotated distances. The second part identifies users’ types with linear complexity while preserving the transparency of the algorithms. With the real data from Australia and Denmark, the proposed method is tested and verified in identifying PV owners and electrical heating system users. The performance of the IRMAC is studied and compared with various state-of-the-art methods. The proposed method reached an accuracy of 96% in identifying rooftop PV users and 94.4% in identifying electric heating users, which is comparable to the best solution based on deep learning, while the speed of the inference model is a thousand times faster. Last but not least, the proposed method is a transparent algorithm, which can tackle the concerns regarding the agnostic decision-making process when policies prohibit some machine learning methods.
A Memetic Chaotic Gravitational Search Algorithm for unconstrained global optimization problems
2019, Applied Soft Computing Journal
Citation Excerpt :
In the case of GSA, it has been widely applied to real-world problems, ranging from power or electrical engineering to the water industry or biology. Thus, GSA has been widely applied in many problems of electrical engineering (see [23,24]), machine learning (see ((; ))), mechanical engineering (see [27]) and even biology (see [28]). For a deeper review, see [20].
Metaheuristic optimization algorithms address two main tasks in the process of problem solving: i) exploration (also called diversification) and ii) exploitation (also called intensification). Guaranteeing a trade-off between these operations is critical to good performance. However, although many methods have been proposed by which metaheuristics can achieve a balance between the exploration and exploitation stages, they are still worse than exact algorithms at exploitation tasks, where gradient-based mechanisms outperform metaheuristics when a local minimum is approximated. In this paper, a quasi-Newton method is introduced into a Chaotic Gravitational Search Algorithm as an exploitation method, with the purpose of improving the exploitation capabilities of this recent and promising population-based metaheuristic. The proposed approach, referred to as a Memetic Chaotic Gravitational Search Algorithm, is used to solve forty-five benchmark problems, both synthetic and real-world, to validate the method. The numerical results show that the adding of quasi-Newton search directions to the original (Chaotic) Gravitational Search Algorithm substantially improves its performance. Also, a comparison with the state-of-the-art algorithms: Particle Swarm Optimization, Genetic Algorithm, Rcr-JADE, COBIDE and RLMPSO, shows that the proposed approach is promising for certain real-world problems.
A comprehensive survey on gravitational search algorithm
2018, Swarm and Evolutionary Computation
Citation Excerpt :
Additionally, a hybrid form of rainfall-runoff was modeled in Ref. [219] by integrating the variable infiltration capacity model and wavelet neural network based on BGSA. In biology, Álvarez, et al. [220] solved a DNA sequence analysis problem entitled Motif Discovery (MDP) using GSA. Later, Amoozegar, et al. [221] employed GSA for designing the optimal primers in successful DNA sequencing.
Gravitational Search Algorithm (GSA) is an optimization method inspired by the theory of Newtonian gravity in physics. Till now, many variants of GSA have been introduced, most of them are motivated by gravity-related theories such as relativity and astronomy. On the one hand, to solve different kinds of optimization problems, modified versions of GSA have been presented such as continuous (real), binary, discrete, multimodal, constraint, single-objective, and multi-objective GSA. On the other hand, to tackle the difficulties in real-world problems, the efficiency of GSA has been improved using specialized operators, hybridization, local search, and designing the self-adaptive algorithms. Researchers have utilized GSA to solve various engineering optimization problems in diverse fields of applications ranging from electrical engineering to bioinformatics. Here, we discussed a comprehensive investigation of GSA and a brief review of GSA developments in solving different engineering problems to build up a global picture and to open the mind to explore possible applications. We also made a number of suggestions that can be undertaken to help move the area forward.
On the role of metaheuristic optimization in bioinformatics
2023, International Transactions in Operational Research
Current studies and applications of Krill Herd and Gravitational Search Algorithms in healthcare
2023, Artificial Intelligence Review
Current Studies and Applications of Krill Herd and Gravitational Search Algorithms in Healthcare
2023, arXiv

View all citing articles on Scopus

View full text

Comparing multiobjective swarm intelligence metaheuristics for DNA motif discovery

Abstract

Introduction

Section snippets

Related work

Motif discovery problem

Description of the algorithms

Methodology

Algorithm comparisons

Comparison with previous works

Comparison with other biological methods

Conclusions and future work

Acknowledgments

Expert Syst. Appl.

Comput. Oper. Res.

Inf. Sci.

J. Discrete Algorithms

J. Mol. Biol.

Environmentally induced foregut remodeling by PHA-4/FoxA and DAF-12/NHR

Science

Unsupervised learning of multiple motifs in biopolymers using expectation maximization

Mach. Learn.

Multi-objective Optimization Using Evolutionary Algorithms

A fast and elitist multiobjective genetic algorithm: NSGA-II

IEEE Trans. Evolut. Comput.

What are DNA sequence motifs?

Nat. Biotechnol.

Finding composite regulatory patterns in DNA sequences

Bioinformatics

A Gibbs sampler for identification of symmetrically structured, spaced DNA motifs with improved estimation of the signal length

Bioinformatics

Discovery of sequence motifs related to coexpression of genes using evolutionary computation

Nucl. Acids Res.

Evolutionary computation for discovery of composite transcription factor binding sites

Nucl. Acids Res.

Finding functional sequence elements by multiple local alignment

Nucl. Acids Res.

Handbook of Metaheuristics

Predicting DNA motifs by using evolutionary multiobjective optimization

IEEE Trans. Syst. Man Cybern. Part C: Appl. Rev.

Identifying DNA and protein patterns with statistically significant alignments of multiple sequences

Bioinformatics

Identification of consensus patterns in unaligned DNA sequences known to be functionally related

Comput. Appl. Biosci.