A cascaded pairwise biomolecular sequence alignment technique using evolutionary algorithm

doi:10.1016/j.ins.2014.11.009

Information Sciences

Volume 297, 10 March 2015, Pages 118-139

https://doi.org/10.1016/j.ins.2014.11.009 Get rights and content

Abstract

In computational biology, biological sequence alignment is an important and challenging task for sequence analysis. In this paper, we propose a new sequence alignment technique based on a genetic algorithm (GA) for determining the optimal alignment score for a pair of sequences that could be either DNA or protein sequences. The search space requirement of the proposed genetic-based method, named Cascaded Pairwise Alignment with Genetic Algorithm (CPAGA), is reduced by breaking a large space into smaller subspaces. This is performed by decomposing the sequence pair into multiple segments before starting the alignment procedure. Such decomposition enhances the ability of the search process to reach the global or a near-global optimal solution even for the longer sequences. The method was tested using several DNA/protein sequence pairs. We also compared the alignment score of the CPAGA with that of some well-known and relevant alignment techniques. The performance of the CPAGA method and other relevant techniques was assessed by a set of non-parametric statistical approaches, which suggest a superior performance of CPAGA over the other alignment procedures.

Introduction

Biological sequences, such as nucleotides and amino acids, accumulate mutations in the course of evolution. Some specific residues in an amino acid sequence are conserved by natural selection because they play important functional and structural roles. Knowledge of the evolutionary history and structural properties of a sequence can be useful to find related sequences. The comparison between new and existing sequences is one of the primary objectives of bioinformatics to draw inferences on evolutionary, functional, and/or structural relationships. If two sequences from different sources have more than 30% sequence identity, they are considered to be homologous, i.e., to share a common ancestral gene [39].

Sequence alignment is an essential tool in bioinformatics, in which two or more sequences are compared to align their residues (e.g., nucleotide bases of DNA and RNA, or amino acids of a protein). The optimal alignment procedure arranges sequences such that a maximum number of either identical or similar residues are matched [24]. Thus, pairwise alignment arranges two sequences in a way that maximizes their similarities or identities. Multiple Sequence Alignment (MSA) is an extension of the pairwise alignment procedure, in which more than two sequences are considered [39]. The aligned residues arranged in parallel rows can be a match, a mismatch, or a gap. Gaps are often inserted to signify insertion or deletion (indel) events in a sequence alignment, and are hence assigned penalties in the scoring process. For DNA sequences, alignment scoring is used as a simple identification scheme where identical bases in both sequences are assigned positive scores. In contrast, for protein sequences, a similarity score could also be computed (in addition to sequence identity) denoting the alignment of amino acids that show similar physicochemical properties (e.g., Ser and Thr). Substitution matrices are then consulted for similarity measurement. The substitution matrices frequently used for protein sequence alignment are Point Accepted Mutation (PAM) [3] and BLOcked SUbstitution Matrix (BLOSUM) [15].

There are two main types of sequence alignment: global and local. Global alignment considers the total length of each sequence, whereas local alignment tries to align the locally highest scoring region of densely similar characters regardless of the remaining sequence length.

For a pairwise alignment problem, Dynamic Programming (DP) constructs a two-dimensional matrix where two axes represent the two sequences. DP attempts to exhaustively align all possible pairs of residues according to a scoring scheme for matches, mismatches, and gaps. The highest score is ultimately obtained by determining the optimal diagonal path by back-tracing [6], [40]. The DP-based Needleman–Wunsch algorithm [26] is used for global alignment, whereas the Smith–Waterman algorithm [35] is used for local alignment. The DP has been demonstrated to produce optimal alignment [33]. The optimized alignment function is also a biologically optimal alignment, but is rarely possible when more than three sequences are considered [27]. The computation itself is also a complex task and demands computer resources when applied to MSA [37].

Therefore, to reduce this computational complexity, various heuristic approaches have been developed that can provide a solution to a problem; however, they do not guarantee finding the global optimum. BLAST [1] and FASTA [31] are the most commonly used heuristic algorithms for pairwise alignment. For MSA, three types of heuristic algorithms are generally used: progressive, iterative, and consistency-based algorithms [22], [27]. Clustal W [36] is a well-known progressive alignment algorithm that is widely used for MSA. It starts with determining every possible global pairwise alignment [26] of the input sequences and then produces a distance matrix. Finally, it generates a consensus alignment by gradually adding sequences following a guide tree based on the distance matrix. However, the main drawback of this method is its so-called “greediness,” in that once a sequence is aligned it cannot be altered. On the other hand, in an iterative alignment technique, the optimal solution is achieved by iteratively modifying the suboptimal solutions in the intermediate stages, which helps to solve the “greedy” nature of the progressive approaches. MUSCLE [7] is an iterative alignment method that solves the alignment problem using a profile function called the log expectation score. Another heuristic method is progressive alignment with a consistency-based scoring scheme. This scoring scheme depends on the collection of methods that simultaneously align two sequences. In the T-Coffee package [29], the collection of pairwise alignments is the combination of global (produced with Clustal W) and local (produced with Lalign [19]) alignments. For aligning every pair of residues in a sequence pair, a consistency score is estimated from the collection of methods. Since an optimal initial alignment is progressively chosen from many alternative alignments during alignment construction, T-Coffee overcomes the greedy nature of the progressive approach.

In this paper, we propose a genetic algorithm (GA)-based approach in a cascaded manner that can be used to solve the pairwise sequence alignment problem. The technique is named Cascaded Pairwise Alignment with Genetic Algorithm (CPAGA). The novelty of this algorithm for sequence alignment is that it can be applied as an optimization tool in a large and complex search space. Its cascaded nature first breaks the large search space into several smaller subspaces by decomposing the sequence pair into multiple segments. Then it starts searching for solutions over the subspaces. As a result, the proposed technique shows good ability to move out of the local optima. The final alignment score is the summation of the best scores found in the subspaces.

The rest of the paper is organized as follows. Section 2 has the related works. Section 3 describes the conventional genetic method and its basic steps algorithmically. Section 4 elaborates the proposed technique with examples. The experimental results are reported in Section 5. The results are compared analytically and statistically with some well-known and relevant algorithms. Finally, Section 6 concludes the paper.

Section snippets

Related works

To solve sequence alignment problem in an optimized way other than the conventional approaches, several researchers have applied various evolutionary techniques which are heuristic in nature. Cutello et al. [2] developed a hybrid bio-inspired algorithm known as Immunological MSA algorithm (IMSA) which behaved as an improver to refine the best initial alignment produced by Clustal W. Wang and Li [38] proposed an iterative optimized algorithm to move blocks of gaps in MSA to refine the aligned

Simple genetic algorithm

The GA invented by Holland [17], is a useful tool for optimization problems. It is a class of stochastic search methods inspired by the Darwinian’s concept of survival of the fittest by natural selection for which, it is categorized as a class of evolutionary algorithms. GA iteratively tries to find the optimal solution globally in a search space where each point represents a solution. Each solution is encoded as a fixed length string called an individual or a chromosome. Chromosomes are

Cascaded Pairwise Alignment with Genetic Algorithm (CPAGA)

The objective of pairwise sequence alignment is to find maximum matches between two sequences. Gaps may be introduced in an alignment, and sometimes they are completely absent. In an aligned sequence pair, an insertion of a gap in a sequence can always be seen as a deletion. Gap insertion and deletion are relatively less frequent compared to substitution. Therefore, the sequence alignment can be done optimally in two ways, either with or without introducing gaps. In our earlier approach [10], a

Experimental results

In the experiment, we have evaluated the performance of the proposed method on DNA and protein sequence pairs. The results of CPAGA have been compared with some well-known sequence matching approaches. The comparison has also been performed using non-parametric statistical approaches like Friedman test [9], Multiple sign-test [32], and Contrast estimation [5]. The proposed genetic-based method has been implemented on an IBM Power 6 system with 8 GB RAM per core. The operating system is AIX (an

Conclusion

The novel approach used in the proposed method is decomposition or segmentation of the sequences. This approach provides several advantages in the computation of the CPAGA. Decomposition of the test sequence pairs reduces the search space, which ultimately diminishes the possibility of falling into local optima by reducing the space complexity. The decomposition process also enables computation to be performed with less memory allocation (relative to the non-segmentation process). Thus, the

Acknowledgements

The authors would like to sincerely thank the reviewers for their helpful and constructive suggestions and comments to improve the quality of the paper. The authors would also like to thank Mr. R. Banerjee, Dr. Abhijit Sinha Roy, S. Majumdar and Mr. N. Sanpui for their helpful support in preparation of the manuscript.

References (40)

S.F. Altschul et al.
Basic local alignment search tool
J. Mol. Biol.
(1990)
S. García et al.
Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: experimental analysis of power
Inf. Sci.
(2010)
X. Huang et al.
A time-efficient, linear-space local similarity algorithm
Adv. Appl. Math.
(1991)
S.R. Jangam et al.
A novel method for alignment of two nucleic acid sequences using ant colony optimization and genetic algorithms
Appl. Soft Comput.
(2007)
Z.J. Lee et al.
Genetic algorithm with ant colony optimization (GA-ACO) for multiple sequence alignment
Appl. Soft Comput.
(2008)
S.B. Needleman et al.
A general method applicable to the search for similarities in the amino acid sequence of two proteins
J. Mol. Biol.
(1970)
C. Notredame et al.
T-coffee: a novel method for fast and accurate multiple sequence alignment
J. Mol. Biol.
(2000)
W.R. Pearson
Rapid and sensitive sequence comparison with FASTP and FASTA
Methods Enzymol.
(1990)
T.F. Smith et al.
Identification of common molecular subsequences
J. Mol. Biol.
(1981)
Y. Wang et al.
An adaptive and iterative algorithm for refining multiple sequence alignment
Comput. Biol. Chem.
(2004)

V. Cutello et al.

Protein multiple sequence alignment by hybrid bioinspired algorithms

Nucl. Acids Res.

(2011)

M.O. Dayhoff et al.

A model of evolutionary change in proteins

J. Demšar

Statistical comparisons of classifiers over multiple data sets

J. Mach. Learn. Res.

(2006)

K. Doksum

Robust procedures for some linear models with one observation per cell

Ann. Math. Stat.

(1967)

R. Durbin et al.

Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids

(1998)

R.C. Edgar

MUSCLE: multiple sequence alignment with high accuracy and high throughput

Nucl. Acids Res.

(2004)

R.A. Fisher

Statistical Methods and Scientific Inference

(1959)

M. Friedman

The use of ranks to avoid the assumption of normality implicit in the analysis of variance

J. Am. Stat. Assoc.

(1937)

G. Garai et al.

A novel genetic approach for optimized biological sequence alignment

J. Biophys. Chem.

(2012)

D.E. Goldberg

Genetic Algorithms in Search, Optimization, and Machine Learning

(1989)

Cited by (12)

HDInsight4PSi: Boosting performance of 3D protein structure similarity searching with HDInsight clusters in Microsoft Azure cloud
2016, Information Sciences
Citation Excerpt :
Matching fragments and protein structure similarities may indicate common ancestry of the proteins, and then organisms, their evolutionary relationships, functional similarities of investigated molecules, existence of common functional regions, and many other things [5]. The role of the process is especially important in situations, where sequence similarity searching [8,40] fails or delivers too few clues and then, 3D protein structure similarity searching becomes the primary technique to make reasonable conclusions, e.g. regarding the function of the unknown protein [11]. There are also several processes, e.g. validation of predicted protein models [59], where protein structure similarity searching plays a supportive role [23].
3D protein structure similarity searching is one of the important processes performed in structural bioinformatics, since it allows for protein function identification and reconstruction of phylogeny for weakly related organisms. Due to the complexity of 3D protein structures and exponential growth of protein structures in public repositories, like the Protein Data Bank, the process is time-consuming and requires increased computational resources. This causes the necessity to prepare computer systems to be able to deal with such huge volumes of macromolecular data.
In this paper, we show how 3D protein structure similarity searching can be performed in parallel by distributing MapReduce jobs on the HDInsight cluster in Microsoft Azure commercial cloud. Our solution combines the use of two important computing paradigms that gain popularity in recent years—Hadoop/MapReduce and Cloud computing. Our experiments performed with the use of the whole repository of protein structures from Protein Data Bank confirm that such a technological fusion is very beneficial and can be successfully applied when performing time-consuming computations over biological data. Moreover, appropriate preparation of data allows to reduce the time needed for computations and significantly accelerates the similarity searching.
Evolutionary algorithm for metabolic pathways synthesis
2016, BioSystems
Citation Excerpt :
This family of algorithms are inspired in biology and employ the principle of natural selection to evolve a population of potential solutions (Pal et al., 2006; Affenzeller et al., 2009; Boussaïd et al., 2013). These methods have been successfully applied to solve a wide range of problems in bioinformatics (Lee and Hsiao, 2012; Kayaa and Şule Gündüz-Öğüdücü, 2013; de Magalhães et al., 2014; Garai and Chowdhury, 2015). The search is guided by the fitness of individual in the population, which is evaluated using functions without formal requirements.
Metabolic pathway building is an active field of research, necessary to understand and manipulate the metabolism of organisms. There are different approaches, mainly based on classical search methods, to find linear sequences of reactions linking two compounds. However, an important limitation of these methods is the exponential increase of search trees when a large number of compounds and reactions is considered. Besides, such models do not take into account all substrates for each reaction during the search, leading to solutions that lack biological feasibility in many cases. This work proposes a new evolutionary algorithm that allows searching not only linear, but also branched metabolic pathways, formed by feasible reactions that relate multiple compounds simultaneously. Tests performed using several sets of reactions show that this algorithm is able to find feasible linear and branched metabolic pathways.
Intelligent rule-based sequence planning algorithm with fuzzy optimization for robot manipulation tasks in partially dynamic environments
2016, Information Sciences
Citation Excerpt :
Xue et al. [29] construct a multi-modal object modeling center where the modeled physical, semantic, and automatically computed informationis provided to service robot . In addition, other various methodologies for solving sequencing, decision-making, and classification problems, which are the main research subjects of this paper, are found in [3,7,8,10,11,13,15,17,19,28]. Chen et al. [3] develop a decision making approach based on the information axiom under hybrid uncertain environments.
An intelligent rule-based sequence planning algorithm with fuzzy optimization for robot manipulation tasks is introduced, using robot path planning for part-bringing as an example. The proposed approach is a rule-based method that requires specified rules for being used in generating a feasible path which is not defined in a traditional way. A part-bringing task associated with a robot part assembly is described; a part-bringing task brings a part from its initial position to an assembly hole or a receptacle (target) for the purpose of part mating in partially dynamic environments that have moveable obstacles. The part-bringing task is accomplished using the rule-based sequence planning algorithm combined with a fuzzy optimization. Comparisons and discussions are presented. The proposed algorithm utilizes knowledge processing functions such as machine reasoning, planning, and decision-making. A fuzzy entropy is introduced because it is employed as a useful tool that can measure the degree of uncertainty associated with an overall performance of the part-bringing task. Through a decision-making procedure, a final plan that satisfies the required criteria is determined to overcome a confronting path planning problem for part-bringing. The proposed algorithm is applicable to a wide range of the robot tasks, including choosing and placing operations despite of moveable obstacles, etc.
Correctness by Construction for Pairwise Sequence Alignment Algorithm in Bio-Sequence
2023, Tehnicki Vjesnik
Deep Learning Algorithms Comparison for Multiple Biological Sequences Alignment
2023, Lecture Notes in Networks and Systems
Developing New Pairwise Sequence Alignment Method Based on Needleman-Wunsch Algorithm
2023, International Journal of Intelligent Engineering and Systems

View all citing articles on Scopus

View full text

A cascaded pairwise biomolecular sequence alignment technique using evolutionary algorithm

Abstract

Introduction

Section snippets

Related works

Simple genetic algorithm

Cascaded Pairwise Alignment with Genetic Algorithm (CPAGA)

Experimental results

Conclusion

Acknowledgements

J. Mol. Biol.

Inf. Sci.

Adv. Appl. Math.

Appl. Soft Comput.

Appl. Soft Comput.

J. Mol. Biol.

J. Mol. Biol.

Methods Enzymol.

J. Mol. Biol.

Comput. Biol. Chem.

Protein multiple sequence alignment by hybrid bioinspired algorithms

Nucl. Acids Res.

A model of evolutionary change in proteins

Statistical comparisons of classifiers over multiple data sets

J. Mach. Learn. Res.

Robust procedures for some linear models with one observation per cell

Ann. Math. Stat.

Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids

MUSCLE: multiple sequence alignment with high accuracy and high throughput

Nucl. Acids Res.

Statistical Methods and Scientific Inference

The use of ranks to avoid the assumption of normality implicit in the analysis of variance

J. Am. Stat. Assoc.

A novel genetic approach for optimized biological sequence alignment

J. Biophys. Chem.

Genetic Algorithms in Search, Optimization, and Machine Learning