Hybrid multiobjective artificial bee colony for multiple sequence alignment

doi:10.1016/j.asoc.2015.12.034

Applied Soft Computing

Volume 41, April 2016, Pages 157-168

https://doi.org/10.1016/j.asoc.2015.12.034 Get rights and content

Highlights

•
We propose multiobjective evolutionary computation for multiple sequence alignment.
•
We optimize two widely-used objective functions in the literature: WSP and TC.
•
Hybridization of ABC and Kalign2 to obtain more accurate alignments.
•
A study between our proposal and 13 aligners from the bioinformatics field.

Abstract

In the bioinformatics community, it is really important to find an accurate and simultaneous alignment among diverse biological sequences which are assumed to have an evolutionary relationship. From the alignment, the sequences homology is inferred and the shared evolutionary origins among the sequences are extracted by using phylogenetic analysis. This problem is known as the multiple sequence alignment (MSA) problem. In the literature, several approaches have been proposed to solve the MSA problem, such as progressive alignments methods, consistency-based algorithms, or genetic algorithms (GAs). In this work, we propose a Hybrid Multiobjective Evolutionary Algorithm based on the behaviour of honey bees for solving the MSA problem, the hybrid multiobjective artificial bee colony (HMOABC) algorithm. HMOABC considers two objective functions with the aim of preserving the quality and consistency of the alignment: the weighted sum-of-pairs function with affine gap penalties (WSP) and the number of totally conserved (TC) columns score. In order to assess the accuracy of HMOABC, we have used the BAliBASE benchmark (version 3.0), which according to the developers presents more challenging test cases representing the real problems encountered when aligning large sets of complex sequences. Our multiobjective approach has been compared with 13 well-known methods in bioinformatics field and with other 6 evolutionary algorithms published in the literature.

Graphical abstract

Introduction

Any living species is represented by its biological sequence and; therefore, an accurate alignment among several biological sequences is critical for finding an evolutionary relationship among different species [1], [2]. This problem is known in the literature as the multiple sequence alignment (MSA) problem [3]. However, MSAs not only allow us to infer phylogenetic relationships among living species, but also they can provide biological facts about proteins – most conserved regions are biologically significant [4]. Furthermore, an accurate MSA is highly valuable in the formulation and test hypotheses about protein 3-D structure and function, that is to say, it helps us to detect which regions of a gene are susceptible to mutation and which can have one residue replaced by another without changing the function.

The natural formulation of the MSA problem, in computational terms, is to define a model of sequence evolution that assigns probabilities to all possible elementary sequence edits and then to seek an optimal directed graph in which edges represents edits and terminal nodes represents the observed sequences [5]. Unfortunately, in biologically realistic models it is not possible to determining an optimal directed graph; therefore, we need to turn to approximate heuristics. A well-known heuristic is to optimize the sum of alignment score (SP score) between each pair of sequence.

The MSA problem may be defined as an NP-hard optimization problem [6] which can be solved by using dynamic programming with a time and space complexity of O(k2^kL^k) [7] when aligning k sequences of length L. Although the use of dynamic programming guarantees mathematically optimal alignments, the problem space increases significantly with the number of sequences and with the length. In order to overcome this drawback, several heuristics have been proposed in the literature. We can classify them into two main categories: progressive and iterative alignments.

Progressive alignment is the most widely used technique for multiple sequence alignment in the literature. It basically starts aligning the closest evolutionary sequences and after that, continues with the more distant ones until all the sequences are aligned. This method presents the advantage of being simple and very fast; however, a certain level of accuracy is not guaranteed. In this way, we can highlight that the main disadvantage of this method is that it can be trapped in suboptimal alignments. Among the main multiple sequence aligners published in the literature that make use of progressive alignment are Clustal W [8], or Clustal Ω [9], Tree-based Consistency Objective Function For alignment Evaluation (T-Coffee) [10], PRANK [11], Fast Statistical Alignment (FSA) [12], or Kalign [13].

The iterative alignment techniques make use of one method to produce an initial alignment (such a progressive method) and then refine this initial alignment by performing diverse iterations until a given stopping criterion. The main idea behind this technique is therefore to consider the initial alignment as suboptimal and then refine it until no further improvements can be achieved. In the literature we find several approaches that takes the advantage of performing an iterative refinement in order to obtain more accurate alignments, among the main ones are MUltiple Sequence Comparison by Log-Expectation (MUSCLE) [5], Multiple Alignment using Fast Fourier Transform (MAFFT) [14], PROBabilistic CONSistency-based multiple sequence alignment (ProbCons) [15], MSAProbs [16], or MUMMALS [17]. Genetic algorithms and evolutionary computation have also been considered for solving the multiple sequence alignment problem, we find diverse genetic algorithms (GA) in the literature: Sequence Alignment by Genetic Algorithm (SAGA) [18], Multiple Sequence Alignment Genetic Algorithm (MSA-GA) [19], Rubber Band Technique Genetic Algorithm (RBT-GA) [20], Vertical Decomposition Genetic Algorithm (VDGA) [21], Genetic Algorithm for Multiple Sequence Alignment using Progressive Alignment Method (GAPAM) [22], Multiobjective Optimizer for Sequence Alignments based on Structural Evaluations (MO-SAStrE) [23]. In addition, we find other single-objective approaches based on swarm intelligence, such as Artificial Bee Colony (ABC) [24], [25], Ant Colony Optimization algorithm (ACO) [26], [27], or Immune Artificial System Algorithm (IMSA) [28].

In the last years some efforts were done on incorporating structural information for obtaining more accurate alignments. Basically, these methods use Protein Data Bank (PDB) structures as template in order to guide the alignment of a given set of unaligned sequences using structure-based sequence alignment methods, two examples of structural-based methods are 3D-COFFEE [29] and MO-SAStrE. The main drawback of these methods is the limited availability of PDB structures.

One of the main contributions of this work is to use multiobjective evolutionary computation to solve the MSA problem. In the literature, we find evolutionary approaches that optimize the sum-of-pairs function (SAGA [18], MSA-GA [19], RBT-GA [20], VDGA [21], GAPAM [22], ABC [24], [25], ACO [27], or IMSA [28]) or the column score (RBT-GA [20], ACO [26]). In [30], a multiobjective evolutionary algorithm was implemented with the aim of assembling previously aligned sequences, trying to optimize jointly the sum-of-pairs function and the column score.

In this work, we also optimize at the same time two of the most widely-used objective functions in the literature: the weighted sum-of-pairs function with affine gap penalties (WSP) and the number of totally conserved (TC) columns score. Therefore, each objective function focuses on either preserving the quality of the alignment and consistency; respectively.

In addition, we apply a well-known swarm intelligence approach, the Artificial Bee Colony (ABC) algorithm [31]; but adapted to handle multiobjective problems, we refer to it as MOABC. The ABC algorithm was developed by D. Karaboga, inspired by the foraging and dance of honey bee colonies [31]. The swarm algorithms, such as ABC, have been successfully applied to solve real-world problems in different domains, such as the design and manufacturing problem [32], selection of cutting parameters in machining operations [33], the structural damage detection problem [34], image segmentation problems [35], image classification [36], the abnormal brain detection [37], or in the path planning problem [38].

As we have mentioned, several Genetic Algorithms (GAs) have been proposed in the literature for solving the MSA problem (SAGA [18], MSA-GA [19], RBT-GA [20], VDGA [21], GAPAM [22], or MO-SAStrE [23]). Whereas GAs take the information from 2–3 parents to generate a new solution; the algorithms based on swarm intelligence produce new individuals taking into account information not only from their parents, but also from the rest of the population. The effectiveness and goodness of the ABC against traditional GAs has been widely studied in the literature [39], [40].

In the ABC algorithm, we find three types of bees: employed bees, onlooker bees, and scout bees. In the canonical ABC, an employed bee becomes scout if it reaches a certain number of iterations with no improvements, which means that this bee is replaced by a new random solution. In our proposal, when an employed bee becomes scout, its stagnated solution (alignment) will be processed by the fast and accurate Kalign [13], avoiding the stagnation of the algorithm and promoting the diversity of the population as a result. In this way, the multiobjective ABC algorithm proposed in this paper was hybridized with the progressive, fast, and accurate Kalign to boost the accuracy and effectiveness of the algorithm, we refer to it as hybrid multiobjective artificial bee colony (HMOABC). In [41], a hybrid multi-objective artificial bee colony is proposed for burdening optimization of copper strip production. The main difference between the approach proposed in [41] and ours relies on the use of a deterministic heuristics (Kalign) in the scout phase of the ABC algorithm.

The remainder of this paper is organized as follows. Section 2 describes the multiple sequence alignment problem. A detailed description of HMOABC is presented in Section 3. Section 4 is devoted to analysis of the experiments carried out and also a comparison with other approaches published in the literature. Finally, Section 5 summarizes the conclusions of the paper and discusses possible lines of future work.

Section snippets

Multiple sequence alignment

Multiple sequence alignment (MSA) is simply an alignment of more than two sequences and is considered as an NP-hard optimization problem [42]. The MSA problem can be defined as follows:

Given a set of sequences S: {s₁, s₂, …, s_k} of lengths |s₁|, |s₂|, …, |s_k| defined over an alphabet Σ, for example Σ_DNA = {A, C, G, T} or Σ_protein = {A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y}.

A multiple sequence alignment of S is defined as S′: ${s_{1}^{'}$ , $s_{2}^{'}$ , …, $s_{k}^{'}}$ , where the length of the all the k

Hybrid multiobjective artificial bee colony

The Artificial Bee Colony (ABC) algorithm is a swarm-based evolutionary algorithm created by Dervis Karaboga [31] and it is inspired by the intelligent behaviour of honey bees. In this algorithm, the position of a food source represents a feasible solution, that is to say, a feasible alignment; and the amount of nectar indicates the quality of the food source (fitness value).

In ABC, the population of individuals is known as colony of bees, in which we find three types of bees: employed, onlooker

Experimental results

In our experiments, we have chosen the well-known Benchmark ALignment dataBASE (BALiBASE), which was developed to evaluate and compare multiple alignment methods. In particular, we have used the version 3.0 [45], which contains 218 alignments. It is freely available to download.¹ BALiBASE3.0 benchmark is divided into six different groups or families: RV11, RV12, RV20, RV30, RV40, and RV50; each group presents different biological characteristics, for further

Conclusions and future work

In this paper, we have introduced a novel multiobjective evolutionary algorithm based on swarm intelligence for the multiple sequence alignment problem. By using our approach (HMOABC), we are able to obtain a set of accurate and consistent solutions by simultaneously maximizing two well-known objective functions: the weighted sum-of-pairs function with affine gap penalties (WSP) and the number of totally conserved (TC) columns score.

The proposed approach is inspired by the foraging behaviour of

Acknowledgments

Álvaro Rubio-Largo is supported by the post-doctoral fellowship SFRH/BPD/100872/2014 granted by Fundação para a Ciência e a Tecnologia (FCT), Portugal. Furthermore, this work has been partially funded by the Spanish Ministry of Science and Innovation and ERDF (the European Regional Development Fund), under contract TIN2012-30685 (BIO project).

References (48)

D.J. Bacon et al.
Multiple sequence alignment
J. Mol. Biol.
(1986)
M. Waterman et al.
Some biological sequence metrics
Adv. Math.
(1976)
C. Notredame et al.
T-Coffee: a novel method for fast and accurate multiple sequence alignment
J. Mol. Biol.
(2000)
O. O'Sullivan et al.
3dcoffee: combining protein sequences and structures within multiple sequence alignments
J. Mol. Biol.
(2004)
A.R. Yildiz
A new hybrid artificial bee colony algorithm for robust optimal design and manufacturing
Appl. Soft Comput.
(2013)
A.R. Yildiz
Optimization of cutting parameters in multi-pass turning using artificial bee colony-based approach
Inf. Sci.
(2013)
D. Karaboga et al.
On the performance of artificial bee colony (ABC) algorithm
Appl. Soft Comput.
(2008)
D. Karaboga et al.
A comparative study of artificial bee colony algorithm
Appl. Math. Comput.
(2009)
H. Zhang et al.
A hybrid multi-objective artificial bee colony algorithm for burdening optimization of copper strip production
Appl. Math. Model.
(2012)
R. Doolittle
Similar amino acid sequences: chance or common ancestry?
Science
(1981)

D. Feng et al.

Progressive sequence alignment as a prerequisite to correct phylogenetic trees

J. Mol. Evol.

(1987)

C. Notredame

Recent progresses in multiple sequence alignment: a survey

Pharmacogenomics

(2002)

R.C. Edgar

MUSCLE: multiple sequence alignment with high accuracy and high throughput

Nucleic Acids Res.

(2004)

L. Wang et al.

On the complexity of multiple sequence alignment

J. Comput. Biol.

(1994)

J.D. Thompson et al.

CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting position-specific gap penalties and weight matrix choice

Nucleic Acids Res.

(1994)

F. Sievers et al.

Fast, Scalable Generation of High-quality Protein Multiple Sequence Alignments Using

(2011)

A. Loytynoja et al.

An algorithm for progressive multiple alignment of sequences with insertions

Proc. Natl. Acad. Sci. U. S. A.

(2005)

R.K. Bradley et al.

Fast statistical alignment

PLoS Comput. Biol.

(2009)

T. Lassmann et al.

Kalign2: high-performance multiple alignment of protein and nucleotide sequences allowing external features

Nucleic Acids Res.

(2009)

K. Katoh et al.

MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform

Nucleic Acids Res.

(2002)

C.B. Do et al.

ProbCons: probabilistic consistency-based multiple sequence alignment

Genome Res.

(2005)

Y. Liu et al.

MSAProbs: multiple sequence alignment based on pair hidden Markov models and partition function posterior probabilities

Bioinformatics

(2010)

J. Pei et al.

MUMMALS: multiple sequence alignment improved by using hidden Markov models with local structural information

Nucleic Acids Res.

(2006)

C. Notredame et al.

Saga: sequence alignment by genetic algorithm

Nucleic Acids Res.

(1996)

Cited by (42)

Metaheuristics for multiple sequence alignment: A systematic review
2021, Computational Biology and Chemistry
Citation Excerpt :
Nowadays, the use of metaheuristics to perform MSA still an open research field. There is a continuous research evolution, with multi-objective and hybrid approaches (Rubio-Largo et al., 2016), besides traditional approaches. Hence, this work aims to provide a Systematic Literature Review (SLR) of metaheuristics for multiple sequence alignment.
The Multiple Sequence Alignment (MSA) is a key task in bioinformatics, because it is used in different important biological analysis, such as function and structure prediction of unknown proteins. There are several approaches to perform MSA and the use of metaheuristics stands out because of the search ability of these methods, which generally leads to good results in a reasonable amount of time. This paper presents a Systematic Literature Review (SLR) on metaheuristics for MSA, compiling relevant works published between 2014 and 2019. The results of our SLR show the constant interest in this subject, due to the several recent publications that use different metaheuristics to obtain more accurate alignments. Moreover, the final results of our SLR show a multi-objective and hybrid approaches trends, which generally leads these methods to achieve even better results. Thus, we show in this work how the use of metaheuristics to perform MSA still remains an important and promising open research field.
Hybrid multi-objective evolutionary algorithm based on Search Manager framework for big data optimization problems
2020, Applied Soft Computing Journal
Citation Excerpt :
Hybrid metaheuristics or hybrid evolutionary algorithms combine features of different techniques in order to exploit the strengths of the various algorithms and get better algorithmic improvements [20,21]. Regarding the multi-objective optimization, although hybrid multi-objective evolutionary algorithms (HMOEAs) have been investigated in many fields [22–26], very few studies of hybrid methods have been implemented to solve Big-Opt problems. In this paper, five hybrid MOEA (HMOEA) are proposed based on the Search Manager framework [27] for this purpose.
Big Data optimization (Big-Opt) refers to optimization problems which require to manage the properties of big data analytics. In the present paper, the Search Manager (SM), a recently proposed framework for hybridizing metaheuristics to improve the performance of optimization algorithms, is extended for multi-objective problems (MOSM), and then five configurations of it by combination of different search strategies are proposed to solve the EEG signal analysis problem which is a member of the big data optimization problems class. Experimental results demonstrate that the proposed configurations of MOSM are efficient in this kind of problems. The configurations are also compared with NSGA-III with uniform crossover and adaptive mutation operators (NSGA-III UCAM), which is a recently proposed method for Big-Opt problems.
Swarm intelligence for optimizing the parameters of multiple sequence aligners
2018, Swarm and Evolutionary Computation
Different aligner heuristics can be found in the literature to solve the Multiple Sequence Alignment problem. These aligners rely on the parameter configuration proposed by their authors (also known as default parameter configuration), that tried to obtain good results (alignments with high accuracy and conservation) for any input set of unaligned sequences. However, the default parameter configuration is not always the best parameter configuration for every input set; namely, depending on the biological characteristics of the input set, one may be able to find a better parameter configuration that outputs a more accurate and conservative alignment. This work's main contributions include: to study the input set's biological characteristics and to then apply the best parameter configuration found depending on those characteristics. The framework uses a pre-computed file to take the best parameter configuration found for a dataset with similar biological characteristics. In order to create this file, we use a Particle Swarm Optimization (PSO) algorithm, that is, an algorithm based on swarm intelligence. To test the effectiveness of the characteristic-based framework, we employ five well-known aligners: Clustal W, DIALIGN-TX, Kalign2, MAFFT, and MUSCLE. The results of these aligners see clear improvements when using the proposed characteristic-based framework.
A new hybrid optimization method combining artificial bee colony and limited-memory BFGS algorithms for efficient numerical optimization
2018, Applied Soft Computing Journal
In this paper, a new optimization method, which is developed especially for optimization of functions with a large number of local minima, is presented. The proposed method is a hybrid optimization algorithm which employs the artificial bee colony (ABC) and limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) algorithms for combining their powerful features. The most prominent feature of the proposed method over other methods is that it provides accurate results and valuable convergence speeds, as well as easy implementation at the same time. Extensive simulation results supported by detailed statistical analyses show that the proposed method can be used for efficient optimization of functions including well-known benchmark functions and CEC2016 competition functions.
Multiobjective characteristic-based framework for very-large multiple sequence alignment
2018, Applied Soft Computing Journal
Citation Excerpt :
In the literature, we can find several iterative refinement aligners, among the most important ones are: MUltiple Sequence Comparison by Log-Expectation (MUSCLE) [14] and Multiple Alignment using Fast Fourier Transform (MAFFT) [20]. In this group, we can find some evolutionary and/or genetic algorithms techniques for the MSA problem: VDGA [28], GAPAM [29], MO-SAStrE [32], HMOABC [35], H4MSA [36]. The vast majority of the aforementioned methods makes use of flags to modify certain alignment parameters.
In the literature, we can find several heuristics for solving the multiple sequence alignment problem. The vast majority of them makes use of flags in order to modify certain alignment parameters; however, if no flags are used, the aligner will run with the default parameter configuration, which, often, is not the optimal one. In this work, we propose a framework that, depending on the biological characteristics of the input dataset, runs the aligner with the best parameter configuration found for another dataset that has similar biological characteristics, improving the accuracy and conservation of the obtained alignment. To train the framework, we use three well-known multiobjective evolutionary algorithms: NSGA-II, IBEA, and MOEA/D. Then, we perform a comparative study between several aligners proposed in the literature and the characteristic-based version of Kalign, MAFFT, and MUSCLE, when solving widely-used benchmarks (PREFAB v4.0 and SABmark v1.65) and very-large benchmarks with thousands of unaligned sequences (HomFam).
A novel hybrid multi-objective artificial bee colony algorithm for blocking lot-streaming flow shop scheduling problems
2018, Knowledge-Based Systems
A blocking lot-streaming flow shop (BLSFS) scheduling problem is to schedule a number of jobs on more than one machine, where each job is split into a number of sublots while no intermediate buffers exist between adjacent machines. The BLSFS scheduling problem roots from traditional job shop scheduling problems but with additional constraints. It is more difficult to be solved than traditional job shop scheduling problems, yet very popular in real-world applications, and research on the problem has been in its infancy to date. This paper presents a hybrid multi-objective discrete artificial bee colony (HDABC) algorithm for the BLSFS scheduling problem with two conflicting criteria: the makespan and the earliness time. The main contributions of this paper include: (1) developing an initialization approach using a prior knowledge which can produce a number of promising solutions, (2) proposing two crossover operators by taking advantage of valuable information extracted from all the non-dominated solutions in the current population, and (3) presenting an efficient Pareto local search operator based on the Pareto dominance relation. The proposed algorithm is empirically compared with four state-of-the-art multi-objective evolutionary algorithms on 18 test subsets of the BLSFS scheduling problem. The experimental results show that the proposed algorithm significantly outperforms the compared ones in terms of several widely-used performance metrics.

View all citing articles on Scopus

View full text

Hybrid multiobjective artificial bee colony for multiple sequence alignment

Highlights

Abstract

Graphical abstract

Introduction

Section snippets

Multiple sequence alignment

Hybrid multiobjective artificial bee colony

Experimental results

Conclusions and future work

Acknowledgments

J. Mol. Biol.

Adv. Math.

J. Mol. Biol.

J. Mol. Biol.

Appl. Soft Comput.

Inf. Sci.

Appl. Soft Comput.

Appl. Math. Comput.

Appl. Math. Model.

Similar amino acid sequences: chance or common ancestry?

Science

Progressive sequence alignment as a prerequisite to correct phylogenetic trees

J. Mol. Evol.

Recent progresses in multiple sequence alignment: a survey

Pharmacogenomics

MUSCLE: multiple sequence alignment with high accuracy and high throughput

Nucleic Acids Res.

On the complexity of multiple sequence alignment

J. Comput. Biol.

CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting position-specific gap penalties and weight matrix choice

Nucleic Acids Res.

Fast, Scalable Generation of High-quality Protein Multiple Sequence Alignments Using

An algorithm for progressive multiple alignment of sequences with insertions

Proc. Natl. Acad. Sci. U. S. A.

Fast statistical alignment

PLoS Comput. Biol.

Kalign2: high-performance multiple alignment of protein and nucleotide sequences allowing external features

Nucleic Acids Res.

MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform

Nucleic Acids Res.

ProbCons: probabilistic consistency-based multiple sequence alignment

Genome Res.

MSAProbs: multiple sequence alignment based on pair hidden Markov models and partition function posterior probabilities

Bioinformatics

MUMMALS: multiple sequence alignment improved by using hidden Markov models with local structural information

Nucleic Acids Res.

Saga: sequence alignment by genetic algorithm

Nucleic Acids Res.