Elsevier

Applied Soft Computing

Volume 41, April 2016, Pages 157-168
Applied Soft Computing

Hybrid multiobjective artificial bee colony for multiple sequence alignment

https://doi.org/10.1016/j.asoc.2015.12.034Get rights and content

Highlights

  • We propose multiobjective evolutionary computation for multiple sequence alignment.

  • We optimize two widely-used objective functions in the literature: WSP and TC.

  • Hybridization of ABC and Kalign2 to obtain more accurate alignments.

  • A study between our proposal and 13 aligners from the bioinformatics field.

Abstract

In the bioinformatics community, it is really important to find an accurate and simultaneous alignment among diverse biological sequences which are assumed to have an evolutionary relationship. From the alignment, the sequences homology is inferred and the shared evolutionary origins among the sequences are extracted by using phylogenetic analysis. This problem is known as the multiple sequence alignment (MSA) problem. In the literature, several approaches have been proposed to solve the MSA problem, such as progressive alignments methods, consistency-based algorithms, or genetic algorithms (GAs). In this work, we propose a Hybrid Multiobjective Evolutionary Algorithm based on the behaviour of honey bees for solving the MSA problem, the hybrid multiobjective artificial bee colony (HMOABC) algorithm. HMOABC considers two objective functions with the aim of preserving the quality and consistency of the alignment: the weighted sum-of-pairs function with affine gap penalties (WSP) and the number of totally conserved (TC) columns score. In order to assess the accuracy of HMOABC, we have used the BAliBASE benchmark (version 3.0), which according to the developers presents more challenging test cases representing the real problems encountered when aligning large sets of complex sequences. Our multiobjective approach has been compared with 13 well-known methods in bioinformatics field and with other 6 evolutionary algorithms published in the literature.

Introduction

Any living species is represented by its biological sequence and; therefore, an accurate alignment among several biological sequences is critical for finding an evolutionary relationship among different species [1], [2]. This problem is known in the literature as the multiple sequence alignment (MSA) problem [3]. However, MSAs not only allow us to infer phylogenetic relationships among living species, but also they can provide biological facts about proteins – most conserved regions are biologically significant [4]. Furthermore, an accurate MSA is highly valuable in the formulation and test hypotheses about protein 3-D structure and function, that is to say, it helps us to detect which regions of a gene are susceptible to mutation and which can have one residue replaced by another without changing the function.

The natural formulation of the MSA problem, in computational terms, is to define a model of sequence evolution that assigns probabilities to all possible elementary sequence edits and then to seek an optimal directed graph in which edges represents edits and terminal nodes represents the observed sequences [5]. Unfortunately, in biologically realistic models it is not possible to determining an optimal directed graph; therefore, we need to turn to approximate heuristics. A well-known heuristic is to optimize the sum of alignment score (SP score) between each pair of sequence.

The MSA problem may be defined as an NP-hard optimization problem [6] which can be solved by using dynamic programming with a time and space complexity of O(k2kLk) [7] when aligning k sequences of length L. Although the use of dynamic programming guarantees mathematically optimal alignments, the problem space increases significantly with the number of sequences and with the length. In order to overcome this drawback, several heuristics have been proposed in the literature. We can classify them into two main categories: progressive and iterative alignments.

Progressive alignment is the most widely used technique for multiple sequence alignment in the literature. It basically starts aligning the closest evolutionary sequences and after that, continues with the more distant ones until all the sequences are aligned. This method presents the advantage of being simple and very fast; however, a certain level of accuracy is not guaranteed. In this way, we can highlight that the main disadvantage of this method is that it can be trapped in suboptimal alignments. Among the main multiple sequence aligners published in the literature that make use of progressive alignment are Clustal W [8], or Clustal Ω [9], Tree-based Consistency Objective Function For alignment Evaluation (T-Coffee) [10], PRANK [11], Fast Statistical Alignment (FSA) [12], or Kalign [13].

The iterative alignment techniques make use of one method to produce an initial alignment (such a progressive method) and then refine this initial alignment by performing diverse iterations until a given stopping criterion. The main idea behind this technique is therefore to consider the initial alignment as suboptimal and then refine it until no further improvements can be achieved. In the literature we find several approaches that takes the advantage of performing an iterative refinement in order to obtain more accurate alignments, among the main ones are MUltiple Sequence Comparison by Log-Expectation (MUSCLE) [5], Multiple Alignment using Fast Fourier Transform (MAFFT) [14], PROBabilistic CONSistency-based multiple sequence alignment (ProbCons) [15], MSAProbs [16], or MUMMALS [17]. Genetic algorithms and evolutionary computation have also been considered for solving the multiple sequence alignment problem, we find diverse genetic algorithms (GA) in the literature: Sequence Alignment by Genetic Algorithm (SAGA) [18], Multiple Sequence Alignment Genetic Algorithm (MSA-GA) [19], Rubber Band Technique Genetic Algorithm (RBT-GA) [20], Vertical Decomposition Genetic Algorithm (VDGA) [21], Genetic Algorithm for Multiple Sequence Alignment using Progressive Alignment Method (GAPAM) [22], Multiobjective Optimizer for Sequence Alignments based on Structural Evaluations (MO-SAStrE) [23]. In addition, we find other single-objective approaches based on swarm intelligence, such as Artificial Bee Colony (ABC) [24], [25], Ant Colony Optimization algorithm (ACO) [26], [27], or Immune Artificial System Algorithm (IMSA) [28].

In the last years some efforts were done on incorporating structural information for obtaining more accurate alignments. Basically, these methods use Protein Data Bank (PDB) structures as template in order to guide the alignment of a given set of unaligned sequences using structure-based sequence alignment methods, two examples of structural-based methods are 3D-COFFEE [29] and MO-SAStrE. The main drawback of these methods is the limited availability of PDB structures.

One of the main contributions of this work is to use multiobjective evolutionary computation to solve the MSA problem. In the literature, we find evolutionary approaches that optimize the sum-of-pairs function (SAGA [18], MSA-GA [19], RBT-GA [20], VDGA [21], GAPAM [22], ABC [24], [25], ACO [27], or IMSA [28]) or the column score (RBT-GA [20], ACO [26]). In [30], a multiobjective evolutionary algorithm was implemented with the aim of assembling previously aligned sequences, trying to optimize jointly the sum-of-pairs function and the column score.

In this work, we also optimize at the same time two of the most widely-used objective functions in the literature: the weighted sum-of-pairs function with affine gap penalties (WSP) and the number of totally conserved (TC) columns score. Therefore, each objective function focuses on either preserving the quality of the alignment and consistency; respectively.

In addition, we apply a well-known swarm intelligence approach, the Artificial Bee Colony (ABC) algorithm [31]; but adapted to handle multiobjective problems, we refer to it as MOABC. The ABC algorithm was developed by D. Karaboga, inspired by the foraging and dance of honey bee colonies [31]. The swarm algorithms, such as ABC, have been successfully applied to solve real-world problems in different domains, such as the design and manufacturing problem [32], selection of cutting parameters in machining operations [33], the structural damage detection problem [34], image segmentation problems [35], image classification [36], the abnormal brain detection [37], or in the path planning problem [38].

As we have mentioned, several Genetic Algorithms (GAs) have been proposed in the literature for solving the MSA problem (SAGA [18], MSA-GA [19], RBT-GA [20], VDGA [21], GAPAM [22], or MO-SAStrE [23]). Whereas GAs take the information from 2–3 parents to generate a new solution; the algorithms based on swarm intelligence produce new individuals taking into account information not only from their parents, but also from the rest of the population. The effectiveness and goodness of the ABC against traditional GAs has been widely studied in the literature [39], [40].

In the ABC algorithm, we find three types of bees: employed bees, onlooker bees, and scout bees. In the canonical ABC, an employed bee becomes scout if it reaches a certain number of iterations with no improvements, which means that this bee is replaced by a new random solution. In our proposal, when an employed bee becomes scout, its stagnated solution (alignment) will be processed by the fast and accurate Kalign [13], avoiding the stagnation of the algorithm and promoting the diversity of the population as a result. In this way, the multiobjective ABC algorithm proposed in this paper was hybridized with the progressive, fast, and accurate Kalign to boost the accuracy and effectiveness of the algorithm, we refer to it as hybrid multiobjective artificial bee colony (HMOABC). In [41], a hybrid multi-objective artificial bee colony is proposed for burdening optimization of copper strip production. The main difference between the approach proposed in [41] and ours relies on the use of a deterministic heuristics (Kalign) in the scout phase of the ABC algorithm.

The remainder of this paper is organized as follows. Section 2 describes the multiple sequence alignment problem. A detailed description of HMOABC is presented in Section 3. Section 4 is devoted to analysis of the experiments carried out and also a comparison with other approaches published in the literature. Finally, Section 5 summarizes the conclusions of the paper and discusses possible lines of future work.

Section snippets

Multiple sequence alignment

Multiple sequence alignment (MSA) is simply an alignment of more than two sequences and is considered as an NP-hard optimization problem [42]. The MSA problem can be defined as follows:

Given a set of sequences S: {s1, s2, …, sk} of lengths |s1|, |s2|, …, |sk| defined over an alphabet Σ, for example ΣDNA = {A, C, G, T} or Σprotein = {A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y}.

A multiple sequence alignment of S is defined as S′: {s1, s2, …, sk}, where the length of the all the k

Hybrid multiobjective artificial bee colony

The Artificial Bee Colony (ABC) algorithm is a swarm-based evolutionary algorithm created by Dervis Karaboga [31] and it is inspired by the intelligent behaviour of honey bees. In this algorithm, the position of a food source represents a feasible solution, that is to say, a feasible alignment; and the amount of nectar indicates the quality of the food source (fitness value).

In ABC, the population of individuals is known as colony of bees, in which we find three types of bees: employed, onlooker

Experimental results

In our experiments, we have chosen the well-known Benchmark ALignment dataBASE (BALiBASE), which was developed to evaluate and compare multiple alignment methods. In particular, we have used the version 3.0 [45], which contains 218 alignments. It is freely available to download.1 BALiBASE3.0 benchmark is divided into six different groups or families: RV11, RV12, RV20, RV30, RV40, and RV50; each group presents different biological characteristics, for further

Conclusions and future work

In this paper, we have introduced a novel multiobjective evolutionary algorithm based on swarm intelligence for the multiple sequence alignment problem. By using our approach (HMOABC), we are able to obtain a set of accurate and consistent solutions by simultaneously maximizing two well-known objective functions: the weighted sum-of-pairs function with affine gap penalties (WSP) and the number of totally conserved (TC) columns score.

The proposed approach is inspired by the foraging behaviour of

Acknowledgments

Álvaro Rubio-Largo is supported by the post-doctoral fellowship SFRH/BPD/100872/2014 granted by Fundação para a Ciência e a Tecnologia (FCT), Portugal. Furthermore, this work has been partially funded by the Spanish Ministry of Science and Innovation and ERDF (the European Regional Development Fund), under contract TIN2012-30685 (BIO project).

References (48)

  • D. Feng et al.

    Progressive sequence alignment as a prerequisite to correct phylogenetic trees

    J. Mol. Evol.

    (1987)
  • C. Notredame

    Recent progresses in multiple sequence alignment: a survey

    Pharmacogenomics

    (2002)
  • R.C. Edgar

    MUSCLE: multiple sequence alignment with high accuracy and high throughput

    Nucleic Acids Res.

    (2004)
  • L. Wang et al.

    On the complexity of multiple sequence alignment

    J. Comput. Biol.

    (1994)
  • J.D. Thompson et al.

    CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting position-specific gap penalties and weight matrix choice

    Nucleic Acids Res.

    (1994)
  • F. Sievers et al.

    Fast, Scalable Generation of High-quality Protein Multiple Sequence Alignments Using

    (2011)
  • A. Loytynoja et al.

    An algorithm for progressive multiple alignment of sequences with insertions

    Proc. Natl. Acad. Sci. U. S. A.

    (2005)
  • R.K. Bradley et al.

    Fast statistical alignment

    PLoS Comput. Biol.

    (2009)
  • T. Lassmann et al.

    Kalign2: high-performance multiple alignment of protein and nucleotide sequences allowing external features

    Nucleic Acids Res.

    (2009)
  • K. Katoh et al.

    MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform

    Nucleic Acids Res.

    (2002)
  • C.B. Do et al.

    ProbCons: probabilistic consistency-based multiple sequence alignment

    Genome Res.

    (2005)
  • Y. Liu et al.

    MSAProbs: multiple sequence alignment based on pair hidden Markov models and partition function posterior probabilities

    Bioinformatics

    (2010)
  • J. Pei et al.

    MUMMALS: multiple sequence alignment improved by using hidden Markov models with local structural information

    Nucleic Acids Res.

    (2006)
  • C. Notredame et al.

    Saga: sequence alignment by genetic algorithm

    Nucleic Acids Res.

    (1996)
  • Cited by (42)

    • Metaheuristics for multiple sequence alignment: A systematic review

      2021, Computational Biology and Chemistry
      Citation Excerpt :

      Nowadays, the use of metaheuristics to perform MSA still an open research field. There is a continuous research evolution, with multi-objective and hybrid approaches (Rubio-Largo et al., 2016), besides traditional approaches. Hence, this work aims to provide a Systematic Literature Review (SLR) of metaheuristics for multiple sequence alignment.

    • Hybrid multi-objective evolutionary algorithm based on Search Manager framework for big data optimization problems

      2020, Applied Soft Computing Journal
      Citation Excerpt :

      Hybrid metaheuristics or hybrid evolutionary algorithms combine features of different techniques in order to exploit the strengths of the various algorithms and get better algorithmic improvements [20,21]. Regarding the multi-objective optimization, although hybrid multi-objective evolutionary algorithms (HMOEAs) have been investigated in many fields [22–26], very few studies of hybrid methods have been implemented to solve Big-Opt problems. In this paper, five hybrid MOEA (HMOEA) are proposed based on the Search Manager framework [27] for this purpose.

    • Multiobjective characteristic-based framework for very-large multiple sequence alignment

      2018, Applied Soft Computing Journal
      Citation Excerpt :

      In the literature, we can find several iterative refinement aligners, among the most important ones are: MUltiple Sequence Comparison by Log-Expectation (MUSCLE) [14] and Multiple Alignment using Fast Fourier Transform (MAFFT) [20]. In this group, we can find some evolutionary and/or genetic algorithms techniques for the MSA problem: VDGA [28], GAPAM [29], MO-SAStrE [32], HMOABC [35], H4MSA [36]. The vast majority of the aforementioned methods makes use of flags to modify certain alignment parameters.

    View all citing articles on Scopus
    View full text