Hybrid multiobjective artificial bee colony for multiple sequence alignment
Graphical abstract
Introduction
Any living species is represented by its biological sequence and; therefore, an accurate alignment among several biological sequences is critical for finding an evolutionary relationship among different species [1], [2]. This problem is known in the literature as the multiple sequence alignment (MSA) problem [3]. However, MSAs not only allow us to infer phylogenetic relationships among living species, but also they can provide biological facts about proteins – most conserved regions are biologically significant [4]. Furthermore, an accurate MSA is highly valuable in the formulation and test hypotheses about protein 3-D structure and function, that is to say, it helps us to detect which regions of a gene are susceptible to mutation and which can have one residue replaced by another without changing the function.
The natural formulation of the MSA problem, in computational terms, is to define a model of sequence evolution that assigns probabilities to all possible elementary sequence edits and then to seek an optimal directed graph in which edges represents edits and terminal nodes represents the observed sequences [5]. Unfortunately, in biologically realistic models it is not possible to determining an optimal directed graph; therefore, we need to turn to approximate heuristics. A well-known heuristic is to optimize the sum of alignment score (SP score) between each pair of sequence.
The MSA problem may be defined as an NP-hard optimization problem [6] which can be solved by using dynamic programming with a time and space complexity of O(k2kLk) [7] when aligning k sequences of length L. Although the use of dynamic programming guarantees mathematically optimal alignments, the problem space increases significantly with the number of sequences and with the length. In order to overcome this drawback, several heuristics have been proposed in the literature. We can classify them into two main categories: progressive and iterative alignments.
Progressive alignment is the most widely used technique for multiple sequence alignment in the literature. It basically starts aligning the closest evolutionary sequences and after that, continues with the more distant ones until all the sequences are aligned. This method presents the advantage of being simple and very fast; however, a certain level of accuracy is not guaranteed. In this way, we can highlight that the main disadvantage of this method is that it can be trapped in suboptimal alignments. Among the main multiple sequence aligners published in the literature that make use of progressive alignment are Clustal W [8], or Clustal Ω [9], Tree-based Consistency Objective Function For alignment Evaluation (T-Coffee) [10], PRANK [11], Fast Statistical Alignment (FSA) [12], or Kalign [13].
The iterative alignment techniques make use of one method to produce an initial alignment (such a progressive method) and then refine this initial alignment by performing diverse iterations until a given stopping criterion. The main idea behind this technique is therefore to consider the initial alignment as suboptimal and then refine it until no further improvements can be achieved. In the literature we find several approaches that takes the advantage of performing an iterative refinement in order to obtain more accurate alignments, among the main ones are MUltiple Sequence Comparison by Log-Expectation (MUSCLE) [5], Multiple Alignment using Fast Fourier Transform (MAFFT) [14], PROBabilistic CONSistency-based multiple sequence alignment (ProbCons) [15], MSAProbs [16], or MUMMALS [17]. Genetic algorithms and evolutionary computation have also been considered for solving the multiple sequence alignment problem, we find diverse genetic algorithms (GA) in the literature: Sequence Alignment by Genetic Algorithm (SAGA) [18], Multiple Sequence Alignment Genetic Algorithm (MSA-GA) [19], Rubber Band Technique Genetic Algorithm (RBT-GA) [20], Vertical Decomposition Genetic Algorithm (VDGA) [21], Genetic Algorithm for Multiple Sequence Alignment using Progressive Alignment Method (GAPAM) [22], Multiobjective Optimizer for Sequence Alignments based on Structural Evaluations (MO-SAStrE) [23]. In addition, we find other single-objective approaches based on swarm intelligence, such as Artificial Bee Colony (ABC) [24], [25], Ant Colony Optimization algorithm (ACO) [26], [27], or Immune Artificial System Algorithm (IMSA) [28].
In the last years some efforts were done on incorporating structural information for obtaining more accurate alignments. Basically, these methods use Protein Data Bank (PDB) structures as template in order to guide the alignment of a given set of unaligned sequences using structure-based sequence alignment methods, two examples of structural-based methods are 3D-COFFEE [29] and MO-SAStrE. The main drawback of these methods is the limited availability of PDB structures.
One of the main contributions of this work is to use multiobjective evolutionary computation to solve the MSA problem. In the literature, we find evolutionary approaches that optimize the sum-of-pairs function (SAGA [18], MSA-GA [19], RBT-GA [20], VDGA [21], GAPAM [22], ABC [24], [25], ACO [27], or IMSA [28]) or the column score (RBT-GA [20], ACO [26]). In [30], a multiobjective evolutionary algorithm was implemented with the aim of assembling previously aligned sequences, trying to optimize jointly the sum-of-pairs function and the column score.
In this work, we also optimize at the same time two of the most widely-used objective functions in the literature: the weighted sum-of-pairs function with affine gap penalties (WSP) and the number of totally conserved (TC) columns score. Therefore, each objective function focuses on either preserving the quality of the alignment and consistency; respectively.
In addition, we apply a well-known swarm intelligence approach, the Artificial Bee Colony (ABC) algorithm [31]; but adapted to handle multiobjective problems, we refer to it as MOABC. The ABC algorithm was developed by D. Karaboga, inspired by the foraging and dance of honey bee colonies [31]. The swarm algorithms, such as ABC, have been successfully applied to solve real-world problems in different domains, such as the design and manufacturing problem [32], selection of cutting parameters in machining operations [33], the structural damage detection problem [34], image segmentation problems [35], image classification [36], the abnormal brain detection [37], or in the path planning problem [38].
As we have mentioned, several Genetic Algorithms (GAs) have been proposed in the literature for solving the MSA problem (SAGA [18], MSA-GA [19], RBT-GA [20], VDGA [21], GAPAM [22], or MO-SAStrE [23]). Whereas GAs take the information from 2–3 parents to generate a new solution; the algorithms based on swarm intelligence produce new individuals taking into account information not only from their parents, but also from the rest of the population. The effectiveness and goodness of the ABC against traditional GAs has been widely studied in the literature [39], [40].
In the ABC algorithm, we find three types of bees: employed bees, onlooker bees, and scout bees. In the canonical ABC, an employed bee becomes scout if it reaches a certain number of iterations with no improvements, which means that this bee is replaced by a new random solution. In our proposal, when an employed bee becomes scout, its stagnated solution (alignment) will be processed by the fast and accurate Kalign [13], avoiding the stagnation of the algorithm and promoting the diversity of the population as a result. In this way, the multiobjective ABC algorithm proposed in this paper was hybridized with the progressive, fast, and accurate Kalign to boost the accuracy and effectiveness of the algorithm, we refer to it as hybrid multiobjective artificial bee colony (HMOABC). In [41], a hybrid multi-objective artificial bee colony is proposed for burdening optimization of copper strip production. The main difference between the approach proposed in [41] and ours relies on the use of a deterministic heuristics (Kalign) in the scout phase of the ABC algorithm.
The remainder of this paper is organized as follows. Section 2 describes the multiple sequence alignment problem. A detailed description of HMOABC is presented in Section 3. Section 4 is devoted to analysis of the experiments carried out and also a comparison with other approaches published in the literature. Finally, Section 5 summarizes the conclusions of the paper and discusses possible lines of future work.
Section snippets
Multiple sequence alignment
Multiple sequence alignment (MSA) is simply an alignment of more than two sequences and is considered as an NP-hard optimization problem [42]. The MSA problem can be defined as follows:
Given a set of sequences S: {s1, s2, …, sk} of lengths |s1|, |s2|, …, |sk| defined over an alphabet Σ, for example ΣDNA = {A, C, G, T} or Σprotein = {A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y}.
A multiple sequence alignment of S is defined as S′: , , …, , where the length of the all the k
Hybrid multiobjective artificial bee colony
The Artificial Bee Colony (ABC) algorithm is a swarm-based evolutionary algorithm created by Dervis Karaboga [31] and it is inspired by the intelligent behaviour of honey bees. In this algorithm, the position of a food source represents a feasible solution, that is to say, a feasible alignment; and the amount of nectar indicates the quality of the food source (fitness value).
In ABC, the population of individuals is known as colony of bees, in which we find three types of bees: employed, onlooker
Experimental results
In our experiments, we have chosen the well-known Benchmark ALignment dataBASE (BALiBASE), which was developed to evaluate and compare multiple alignment methods. In particular, we have used the version 3.0 [45], which contains 218 alignments. It is freely available to download.1 BALiBASE3.0 benchmark is divided into six different groups or families: RV11, RV12, RV20, RV30, RV40, and RV50; each group presents different biological characteristics, for further
Conclusions and future work
In this paper, we have introduced a novel multiobjective evolutionary algorithm based on swarm intelligence for the multiple sequence alignment problem. By using our approach (HMOABC), we are able to obtain a set of accurate and consistent solutions by simultaneously maximizing two well-known objective functions: the weighted sum-of-pairs function with affine gap penalties (WSP) and the number of totally conserved (TC) columns score.
The proposed approach is inspired by the foraging behaviour of
Acknowledgments
Álvaro Rubio-Largo is supported by the post-doctoral fellowship SFRH/BPD/100872/2014 granted by Fundação para a Ciência e a Tecnologia (FCT), Portugal. Furthermore, this work has been partially funded by the Spanish Ministry of Science and Innovation and ERDF (the European Regional Development Fund), under contract TIN2012-30685 (BIO project).
References (48)
- et al.
Multiple sequence alignment
J. Mol. Biol.
(1986) - et al.
Some biological sequence metrics
Adv. Math.
(1976) - et al.
T-Coffee: a novel method for fast and accurate multiple sequence alignment
J. Mol. Biol.
(2000) - et al.
3dcoffee: combining protein sequences and structures within multiple sequence alignments
J. Mol. Biol.
(2004) A new hybrid artificial bee colony algorithm for robust optimal design and manufacturing
Appl. Soft Comput.
(2013)Optimization of cutting parameters in multi-pass turning using artificial bee colony-based approach
Inf. Sci.
(2013)- et al.
On the performance of artificial bee colony (ABC) algorithm
Appl. Soft Comput.
(2008) - et al.
A comparative study of artificial bee colony algorithm
Appl. Math. Comput.
(2009) - et al.
A hybrid multi-objective artificial bee colony algorithm for burdening optimization of copper strip production
Appl. Math. Model.
(2012) Similar amino acid sequences: chance or common ancestry?
Science
(1981)
Progressive sequence alignment as a prerequisite to correct phylogenetic trees
J. Mol. Evol.
Recent progresses in multiple sequence alignment: a survey
Pharmacogenomics
MUSCLE: multiple sequence alignment with high accuracy and high throughput
Nucleic Acids Res.
On the complexity of multiple sequence alignment
J. Comput. Biol.
CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting position-specific gap penalties and weight matrix choice
Nucleic Acids Res.
Fast, Scalable Generation of High-quality Protein Multiple Sequence Alignments Using
An algorithm for progressive multiple alignment of sequences with insertions
Proc. Natl. Acad. Sci. U. S. A.
Fast statistical alignment
PLoS Comput. Biol.
Kalign2: high-performance multiple alignment of protein and nucleotide sequences allowing external features
Nucleic Acids Res.
MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform
Nucleic Acids Res.
ProbCons: probabilistic consistency-based multiple sequence alignment
Genome Res.
MSAProbs: multiple sequence alignment based on pair hidden Markov models and partition function posterior probabilities
Bioinformatics
MUMMALS: multiple sequence alignment improved by using hidden Markov models with local structural information
Nucleic Acids Res.
Saga: sequence alignment by genetic algorithm
Nucleic Acids Res.
Cited by (42)
Metaheuristics for multiple sequence alignment: A systematic review
2021, Computational Biology and ChemistryCitation Excerpt :Nowadays, the use of metaheuristics to perform MSA still an open research field. There is a continuous research evolution, with multi-objective and hybrid approaches (Rubio-Largo et al., 2016), besides traditional approaches. Hence, this work aims to provide a Systematic Literature Review (SLR) of metaheuristics for multiple sequence alignment.
Hybrid multi-objective evolutionary algorithm based on Search Manager framework for big data optimization problems
2020, Applied Soft Computing JournalCitation Excerpt :Hybrid metaheuristics or hybrid evolutionary algorithms combine features of different techniques in order to exploit the strengths of the various algorithms and get better algorithmic improvements [20,21]. Regarding the multi-objective optimization, although hybrid multi-objective evolutionary algorithms (HMOEAs) have been investigated in many fields [22–26], very few studies of hybrid methods have been implemented to solve Big-Opt problems. In this paper, five hybrid MOEA (HMOEA) are proposed based on the Search Manager framework [27] for this purpose.
Swarm intelligence for optimizing the parameters of multiple sequence aligners
2018, Swarm and Evolutionary ComputationA new hybrid optimization method combining artificial bee colony and limited-memory BFGS algorithms for efficient numerical optimization
2018, Applied Soft Computing JournalMultiobjective characteristic-based framework for very-large multiple sequence alignment
2018, Applied Soft Computing JournalCitation Excerpt :In the literature, we can find several iterative refinement aligners, among the most important ones are: MUltiple Sequence Comparison by Log-Expectation (MUSCLE) [14] and Multiple Alignment using Fast Fourier Transform (MAFFT) [20]. In this group, we can find some evolutionary and/or genetic algorithms techniques for the MSA problem: VDGA [28], GAPAM [29], MO-SAStrE [32], HMOABC [35], H4MSA [36]. The vast majority of the aforementioned methods makes use of flags to modify certain alignment parameters.