Elsevier

Information Sciences

Volume 297, 10 March 2015, Pages 118-139
Information Sciences

A cascaded pairwise biomolecular sequence alignment technique using evolutionary algorithm

https://doi.org/10.1016/j.ins.2014.11.009Get rights and content

Abstract

In computational biology, biological sequence alignment is an important and challenging task for sequence analysis. In this paper, we propose a new sequence alignment technique based on a genetic algorithm (GA) for determining the optimal alignment score for a pair of sequences that could be either DNA or protein sequences. The search space requirement of the proposed genetic-based method, named Cascaded Pairwise Alignment with Genetic Algorithm (CPAGA), is reduced by breaking a large space into smaller subspaces. This is performed by decomposing the sequence pair into multiple segments before starting the alignment procedure. Such decomposition enhances the ability of the search process to reach the global or a near-global optimal solution even for the longer sequences. The method was tested using several DNA/protein sequence pairs. We also compared the alignment score of the CPAGA with that of some well-known and relevant alignment techniques. The performance of the CPAGA method and other relevant techniques was assessed by a set of non-parametric statistical approaches, which suggest a superior performance of CPAGA over the other alignment procedures.

Introduction

Biological sequences, such as nucleotides and amino acids, accumulate mutations in the course of evolution. Some specific residues in an amino acid sequence are conserved by natural selection because they play important functional and structural roles. Knowledge of the evolutionary history and structural properties of a sequence can be useful to find related sequences. The comparison between new and existing sequences is one of the primary objectives of bioinformatics to draw inferences on evolutionary, functional, and/or structural relationships. If two sequences from different sources have more than 30% sequence identity, they are considered to be homologous, i.e., to share a common ancestral gene [39].

Sequence alignment is an essential tool in bioinformatics, in which two or more sequences are compared to align their residues (e.g., nucleotide bases of DNA and RNA, or amino acids of a protein). The optimal alignment procedure arranges sequences such that a maximum number of either identical or similar residues are matched [24]. Thus, pairwise alignment arranges two sequences in a way that maximizes their similarities or identities. Multiple Sequence Alignment (MSA) is an extension of the pairwise alignment procedure, in which more than two sequences are considered [39]. The aligned residues arranged in parallel rows can be a match, a mismatch, or a gap. Gaps are often inserted to signify insertion or deletion (indel) events in a sequence alignment, and are hence assigned penalties in the scoring process. For DNA sequences, alignment scoring is used as a simple identification scheme where identical bases in both sequences are assigned positive scores. In contrast, for protein sequences, a similarity score could also be computed (in addition to sequence identity) denoting the alignment of amino acids that show similar physicochemical properties (e.g., Ser and Thr). Substitution matrices are then consulted for similarity measurement. The substitution matrices frequently used for protein sequence alignment are Point Accepted Mutation (PAM) [3] and BLOcked SUbstitution Matrix (BLOSUM) [15].

There are two main types of sequence alignment: global and local. Global alignment considers the total length of each sequence, whereas local alignment tries to align the locally highest scoring region of densely similar characters regardless of the remaining sequence length.

For a pairwise alignment problem, Dynamic Programming (DP) constructs a two-dimensional matrix where two axes represent the two sequences. DP attempts to exhaustively align all possible pairs of residues according to a scoring scheme for matches, mismatches, and gaps. The highest score is ultimately obtained by determining the optimal diagonal path by back-tracing [6], [40]. The DP-based Needleman–Wunsch algorithm [26] is used for global alignment, whereas the Smith–Waterman algorithm [35] is used for local alignment. The DP has been demonstrated to produce optimal alignment [33]. The optimized alignment function is also a biologically optimal alignment, but is rarely possible when more than three sequences are considered [27]. The computation itself is also a complex task and demands computer resources when applied to MSA [37].

Therefore, to reduce this computational complexity, various heuristic approaches have been developed that can provide a solution to a problem; however, they do not guarantee finding the global optimum. BLAST [1] and FASTA [31] are the most commonly used heuristic algorithms for pairwise alignment. For MSA, three types of heuristic algorithms are generally used: progressive, iterative, and consistency-based algorithms [22], [27]. Clustal W [36] is a well-known progressive alignment algorithm that is widely used for MSA. It starts with determining every possible global pairwise alignment [26] of the input sequences and then produces a distance matrix. Finally, it generates a consensus alignment by gradually adding sequences following a guide tree based on the distance matrix. However, the main drawback of this method is its so-called “greediness,” in that once a sequence is aligned it cannot be altered. On the other hand, in an iterative alignment technique, the optimal solution is achieved by iteratively modifying the suboptimal solutions in the intermediate stages, which helps to solve the “greedy” nature of the progressive approaches. MUSCLE [7] is an iterative alignment method that solves the alignment problem using a profile function called the log expectation score. Another heuristic method is progressive alignment with a consistency-based scoring scheme. This scoring scheme depends on the collection of methods that simultaneously align two sequences. In the T-Coffee package [29], the collection of pairwise alignments is the combination of global (produced with Clustal W) and local (produced with Lalign [19]) alignments. For aligning every pair of residues in a sequence pair, a consistency score is estimated from the collection of methods. Since an optimal initial alignment is progressively chosen from many alternative alignments during alignment construction, T-Coffee overcomes the greedy nature of the progressive approach.

In this paper, we propose a genetic algorithm (GA)-based approach in a cascaded manner that can be used to solve the pairwise sequence alignment problem. The technique is named Cascaded Pairwise Alignment with Genetic Algorithm (CPAGA). The novelty of this algorithm for sequence alignment is that it can be applied as an optimization tool in a large and complex search space. Its cascaded nature first breaks the large search space into several smaller subspaces by decomposing the sequence pair into multiple segments. Then it starts searching for solutions over the subspaces. As a result, the proposed technique shows good ability to move out of the local optima. The final alignment score is the summation of the best scores found in the subspaces.

The rest of the paper is organized as follows. Section 2 has the related works. Section 3 describes the conventional genetic method and its basic steps algorithmically. Section 4 elaborates the proposed technique with examples. The experimental results are reported in Section 5. The results are compared analytically and statistically with some well-known and relevant algorithms. Finally, Section 6 concludes the paper.

Section snippets

Related works

To solve sequence alignment problem in an optimized way other than the conventional approaches, several researchers have applied various evolutionary techniques which are heuristic in nature. Cutello et al. [2] developed a hybrid bio-inspired algorithm known as Immunological MSA algorithm (IMSA) which behaved as an improver to refine the best initial alignment produced by Clustal W. Wang and Li [38] proposed an iterative optimized algorithm to move blocks of gaps in MSA to refine the aligned

Simple genetic algorithm

The GA invented by Holland [17], is a useful tool for optimization problems. It is a class of stochastic search methods inspired by the Darwinian’s concept of survival of the fittest by natural selection for which, it is categorized as a class of evolutionary algorithms. GA iteratively tries to find the optimal solution globally in a search space where each point represents a solution. Each solution is encoded as a fixed length string called an individual or a chromosome. Chromosomes are

Cascaded Pairwise Alignment with Genetic Algorithm (CPAGA)

The objective of pairwise sequence alignment is to find maximum matches between two sequences. Gaps may be introduced in an alignment, and sometimes they are completely absent. In an aligned sequence pair, an insertion of a gap in a sequence can always be seen as a deletion. Gap insertion and deletion are relatively less frequent compared to substitution. Therefore, the sequence alignment can be done optimally in two ways, either with or without introducing gaps. In our earlier approach [10], a

Experimental results

In the experiment, we have evaluated the performance of the proposed method on DNA and protein sequence pairs. The results of CPAGA have been compared with some well-known sequence matching approaches. The comparison has also been performed using non-parametric statistical approaches like Friedman test [9], Multiple sign-test [32], and Contrast estimation [5]. The proposed genetic-based method has been implemented on an IBM Power 6 system with 8 GB RAM per core. The operating system is AIX (an

Conclusion

The novel approach used in the proposed method is decomposition or segmentation of the sequences. This approach provides several advantages in the computation of the CPAGA. Decomposition of the test sequence pairs reduces the search space, which ultimately diminishes the possibility of falling into local optima by reducing the space complexity. The decomposition process also enables computation to be performed with less memory allocation (relative to the non-segmentation process). Thus, the

Acknowledgements

The authors would like to sincerely thank the reviewers for their helpful and constructive suggestions and comments to improve the quality of the paper. The authors would also like to thank Mr. R. Banerjee, Dr. Abhijit Sinha Roy, S. Majumdar and Mr. N. Sanpui for their helpful support in preparation of the manuscript.

References (40)

  • V. Cutello et al.

    Protein multiple sequence alignment by hybrid bioinspired algorithms

    Nucl. Acids Res.

    (2011)
  • M.O. Dayhoff et al.

    A model of evolutionary change in proteins

  • J. Demšar

    Statistical comparisons of classifiers over multiple data sets

    J. Mach. Learn. Res.

    (2006)
  • K. Doksum

    Robust procedures for some linear models with one observation per cell

    Ann. Math. Stat.

    (1967)
  • R. Durbin et al.

    Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids

    (1998)
  • R.C. Edgar

    MUSCLE: multiple sequence alignment with high accuracy and high throughput

    Nucl. Acids Res.

    (2004)
  • R.A. Fisher

    Statistical Methods and Scientific Inference

    (1959)
  • M. Friedman

    The use of ranks to avoid the assumption of normality implicit in the analysis of variance

    J. Am. Stat. Assoc.

    (1937)
  • G. Garai et al.

    A novel genetic approach for optimized biological sequence alignment

    J. Biophys. Chem.

    (2012)
  • D.E. Goldberg

    Genetic Algorithms in Search, Optimization, and Machine Learning

    (1989)
  • Cited by (12)

    • HDInsight4PSi: Boosting performance of 3D protein structure similarity searching with HDInsight clusters in Microsoft Azure cloud

      2016, Information Sciences
      Citation Excerpt :

      Matching fragments and protein structure similarities may indicate common ancestry of the proteins, and then organisms, their evolutionary relationships, functional similarities of investigated molecules, existence of common functional regions, and many other things [5]. The role of the process is especially important in situations, where sequence similarity searching [8,40] fails or delivers too few clues and then, 3D protein structure similarity searching becomes the primary technique to make reasonable conclusions, e.g. regarding the function of the unknown protein [11]. There are also several processes, e.g. validation of predicted protein models [59], where protein structure similarity searching plays a supportive role [23].

    • Evolutionary algorithm for metabolic pathways synthesis

      2016, BioSystems
      Citation Excerpt :

      This family of algorithms are inspired in biology and employ the principle of natural selection to evolve a population of potential solutions (Pal et al., 2006; Affenzeller et al., 2009; Boussaïd et al., 2013). These methods have been successfully applied to solve a wide range of problems in bioinformatics (Lee and Hsiao, 2012; Kayaa and Şule Gündüz-Öğüdücü, 2013; de Magalhães et al., 2014; Garai and Chowdhury, 2015). The search is guided by the fitness of individual in the population, which is evaluated using functions without formal requirements.

    • Intelligent rule-based sequence planning algorithm with fuzzy optimization for robot manipulation tasks in partially dynamic environments

      2016, Information Sciences
      Citation Excerpt :

      Xue et al. [29] construct a multi-modal object modeling center where the modeled physical, semantic, and automatically computed informationis provided to service robot . In addition, other various methodologies for solving sequencing, decision-making, and classification problems, which are the main research subjects of this paper, are found in [3,7,8,10,11,13,15,17,19,28]. Chen et al. [3] develop a decision making approach based on the information axiom under hybrid uncertain environments.

    • Developing New Pairwise Sequence Alignment Method Based on Needleman-Wunsch Algorithm

      2023, International Journal of Intelligent Engineering and Systems
    View all citing articles on Scopus
    View full text