Elsevier

Information Sciences

Volume 420, December 2017, Pages 278-298
Information Sciences

Using biological knowledge for multiple sequence aligner decision making

https://doi.org/10.1016/j.ins.2017.08.069Get rights and content

Abstract

Multiple Sequence Alignment (MSA) is the simultaneous alignment among three or more biological sequences (nucleotides or amino acids). In recent years, important efforts have been assigned to the development of MSA approaches. In this work, we propose a framework that extracts the biological characteristics of an input set of unaligned sequences and uses this knowledge to decide which is the most suitable aligner and parameter configuration. We refer to it as Multiple Aligner Framework (MAF). The selection of the tuple {Aligner, Configuration} is based on searching, in a pre-computed file, the best tuple for a dataset with similar biological characteristics. In order to create this file, we use multiobjective optimization. In fact, three well-known multiobjective evolutionary algorithms (NSGA-II, IBEA and MOEA/D) have been used. To validate the framework, we have used five popular benchmark suites: BAliBASE 3.0, PREFAB 4.0, SABmark 1.65, OX-Bench and CDD 3.14. After comparing with well-known aligners published in the literature, such as Kalign2, MUSCLE, MAFFT, T-Coffee, MSAProbs, ProbCons, Clustal Ω and MUMMALS, we conclude that the multiple aligner framework is, in average, the method with the best balance between alignment accuracy/conservation and required runtime.

Introduction

In molecular biology, the problem of simultaneously aligning three or more biological sequences is known as Multiple Sequence Alignment (MSA) [1]. MSA is an important step to infer phylogenetics relationships among the different input sequences [6], [8]. An accurate and conservative MSA leads to strong biological significance, which is critical in the study of proteins and nucleotides.

Given a set of k unaligned sequences S: {s1, s2, , sk} defined over an alphabet Σ (amino-acids or nucleotides alphabet), the multiple sequence alignment of this set is defined as S′: {s1’, s2’, , sk’}, where all the sequences are of equal length. Therefore, the produced alignment (S′) will be defined over the alphabet Σ{}, where - refers to gap symbol.

For example, given the following set of unaligned sequences (S):

a possible alignment (S′) would be:

As we can see, the alignment S′ is represented by a matrix, where rows refer to the sequences and the columns to the aligned symbols. In addition, each column must contain at least one symbol of the alphabet; therefore, columns with all gap symbols are not allowed. According to [1], the MSA problem is an NP-hard optimization problem where the time complexity depends on the maximum length (L) and on the number of unaligned sequences (k); therefore, it has order of O(k2kLk).

In recent years, important efforts have been assigned to the development of MSA approaches. The first approaches were based on dynamic programming, which provides optimal alignments; however, its required runtime increases exponentially with the number of sequences to align. Other MSA algorithms, proposed to find pseudo-optimal alignments, may be classified in three groups: progressive, consistency-based and iterative.

In the progressive aligners, we find the following steps: (i) compute the pairwise distances between all pairs of sequences to determine the similarity of each pair of sequences, (ii) build up a guide-tree based on the aforementioned distance matrix and (iii) align the sequences following the order determined by the guide-tree.

A local and a global pairwise alignment library are the engine in the consistency-based aligners. These approaches involve four steps: (i) compute the global pairwise alignment and construct the global library, (ii) compute the local pairwise alignment and construct the local library, (iii) assign a weight to each pair of aligned sequences taking into account the information of the global and local libraries and (iv) using the aforementioned information, construct the alignment.

The last group corresponds to the iterative approaches. The main steps are: (i) compute a guide-tree and build a preliminary alignment by using any progressive algorithm, (ii) divide the guide-tree into two subtrees that will be re-aligned to obtain an improved alignment, (iii) if a number of iterations is reached, then output the alignment; otherwise, repeat the iteration.

In Table 1, we present the most important aligners published in the literature organized in the three groups. Additionally, we can find some evolutionary and/or genetic algorithms techniques for the MSA problem [21], [22].

As we can see, we find diverse approaches for dealing with the MSA problem in the literature. The vast majority of them makes use of flags to modify certain alignment parameters. The use of different values for these parameters leads to different alignments; therefore, a proper parameter configuration of the aligner is critical to obtain an accurate output. In case of using no flag, the aligner will use a default parameter configuration, which is proposed by the developers of the aligner.

Unfortunately, the alignment produced by considering the default parameter configuration is not always the best choice, the main reason lies in the fact that the default parameters are those that gave the developers best average accuracy in their training sets (different for each aligner). Not only the alignment accuracy may be compromised by the parameters configuration, the required runtime is also affected. To solve this drawback, in this work, we propose a framework that, depending on the biological characteristics of the input set of sequences, finds in a pre-computed file which is the most suitable aligner with its related parameters configuration found for another set of sequences with similar biological characteristics, a preliminary version appears in [19]. In this way, we are able to improve, not only the accuracy and conservation of the final alignment [20], but also the required runtime.

The use of Multiobjective Optimization has drawn much attention in the last years, providing a number of successful results [9], [10], [24], [25]. Therefore, to create the pre-computed file used by the framework, we have used Multiobjective Optimization and Evolutionary Computation jointly. Three well-known multiobjective approaches have been studied: Fast Non-dominated Sorting Genetic Algorithm (NSGA-II, [4]), Multiobjective Evolutionary Algorithm based on Decomposition (MOEA/D, [32]) and the Indicator-Based Evolutionary Algorithm (IBEA, [34]). Given a set of unaligned sequences, we obtain a set of non-dominated tuples {aligner, configurariton} that simultaneously optimizes the alignment accuracy (quality and conservation) and the required runtime to obtain it. In summary, the major contributions of this work are:

  • Given a set of unaligned sequences, we propose a framework that makes use of the most suitable aligner with its related parameters configuration found for another set of sequences with similar biological characteristics.

  • A multiobjective study among the most representative algorithms in the multiobjective domain (NSGA-II, MOEA/D and IBEA).

  • A comparative study between the proposed framework and the most important aligners published in the literature. In the comparative study, we use a total of five benchmark suites: BAliBASE 3.0, PREFAB 4.0, SABMark 1.65, OX-Bench and CDD 3.14.

The paper is organized as follows. Section 2 is devoted to explain the biological characteristics of any input set of unaligned sequences, the multiobjective optimization problem of obtaining the best aligner and its parameters configuration for a given set of unaligned sequences and the explanation of the framework. Section 3 presents the experimental results, which are divided into two studies: multiobjective optimization study and comparative study with other approaches published in the literature. Finally, conclusions and future lines of work appear in Section 4.

Section snippets

Method

This section is divided into three subsections. First, we define and explain the biological characteristics used to biologically describe any input set of unaligned sequences. Second, we present the multiobjective optimization problem of selecting a set of tuples {Aligner, Configuration} that optimize the alignment of an input set of unaligned sequences. Finally, we describe the flowchart of the proposed framework.

Experimental results

In this section we present different experiments in order to show the advantages of using the proposed framework for solving the MSA problem. The section is divided into two main subsections.

To demonstrate the goodness of the framework, we have used three aligners: Kalign2 [13] (v2.03), MAFFT [11] (v7.215) and MUSCLE [7] (v3.8). We have selected these aligners because they present a well-balanced behavior in terms of alignment accuracy and runtime. In Table 2, we present, for each aligner, a

Conclusions and future work

In this work, we propose a framework that extracts the biological characteristics of an input set of unaligned sequences and uses this knowledge to decide which is the most suitable aligner and parameter configuration. We refer to it as Multiple Aligner Framework (MAF). The selection of the tuple {Aligner, Configuration} is based on searching, in a pre-computed file, the best tuple for a dataset with similar biological characteristics. To create this file, we have used multiobjective

Acknowledgment

This work was partially funded by the Spanish Ministry of Economy and Competitiveness (TIN2016-76259-P) and the ERDF (European Regional Development Fund), under the contract TIN2016-76259-P (PROTEIN project). Álvaro Rubio-Largo is supported by the post-doctoral fellowship SFRH/BPD/100872/2014 granted by Fundação para a Ciência e a Tecnologia (FCT (SFRH/BPD/100872/2014)), Portugal.

References (36)

  • C.B. Do et al.

    Probcons: probabilistic consistency-based multiple sequence alignment

    Genome Res.

    (2005)
  • R. Doolittle

    Similar amino acid sequences: chance or common ancestry?

    Science

    (1981)
  • R.C. Edgar

    MUSCLE: multiple sequence alignment with high accuracy and high throughput

    Nucleic Acids Res.

    (2004)
  • D. Feng et al.

    Progressive sequence alignment as a prerequisite to correct phylogenetic trees

    J. Mol. Evol.

    (1987)
  • K. Katoh et al.

    MAFFT: a novel method for rapid multiple sequence alignment based on fast fourier transform

    Nucleic Acids Res.

    (2002)
  • M. Kimura

    A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences

    J. Mol. Evol.

    (1980)
  • T. Lassmann et al.

    Kalign2: high-performance multiple alignment of protein and nucleotide sequences allowing external features

    Nucleic Acids Res.

    (2009)
  • Y. Liu et al.

    MSAProbs: multiple sequence alignment based on pair hidden Markov models and partition function posterior probabilities

    Bioinformatics

    (2010)
  • Cited by (2)

    • LKAQ: Large-scale knowledge graph approximate query algorithm

      2019, Information Sciences
      Citation Excerpt :

      Knowledge graphs (KGs) provide a chance to organize and manage the massive information on the Internet. KGs have been widely applied to semantic search [23,41,45], knowledge quiz [33], knowledge-driven big data analysis [34,36] and decision making [30,44]. In practice, KGs are often in the form of RDF1~(resource description framework) and are stored as (subject, predicate, object) triples.

    • RDF(S) Store in Object-Relational Databases

      2023, Journal of Database Management
    View full text