Handling biological sequence alignments on networked computing systems: A divide-and-conquer approach

https://doi.org/10.1016/j.jpdc.2009.04.014Get rights and content

Abstract

In this paper, we address the biological sequence alignment problem, which is one of the most commonly used steps in several bioinformatics applications. We employ the Divisible Load Theory (DLT) paradigm that is suitable for handling large-scale processing on network-based systems to achieve a high degree of parallelism. Using the DLT paradigm, we propose a strategy in which we carefully partition the computation work load among the processors in the system so as to minimize the overall computation time of determining the maximum similarity between the DNA/protein sequences. We consider handling such a computational problem on networked computing platforms connected as a linear daisy chain. We derive the individual load quantum to be assigned to the processors according to computation and communication link speeds along the chain. We consider two cases of sequence alignment where post-processes, i.e., trace-back processes that are required to determine an optimal alignment, may or may not be done at individual processors in the system. We derive some critical conditions to determine if our strategies are able to yield an optimal processing time. We apply three different heuristic strategies proposed in the literature to generate sub-optimal solutions for processing time when the above conditions cannot be satisfied. To testify the proposed schemes, we use real-life DNA samples of house mouse mitochondrion and the DNA of human mitochondrion obtained from the public database GenBank [GenBank, http://www.ncbi.nlm.nih.gov] in our simulation experiments. By this study, we conclusively demonstrate the applicability and potential of the DLT paradigm to such biological sequence related computational problems.

Introduction

Designing a scheduler for network-based systems is often a challenging task. More specifically, when the underlying network has heterogeneity in terms of resources that are used for computing, storage, and communications then issues such as processor computational speed/power, memory capacities, etc. play a crucial role. We consider the problem of aligning biological sequences, a problem that is often an imperative step in several bioinformatics applications. The process of aligning two or more sequences is often an unavoidable step to quantify the quality of the samples under consideration. For instance, in the case of protein structure predictive methods and structure comparison methods, sequence alignment for maximum similarity score is often one of the crucial steps [15], [3]. As such, in aligning biological sequences, residues can be inserted, deleted or substituted from either of two sequences to obtain the optimum alignment [20].

As the process of sequence alignment is shown to be a computationally complex problem there has been several attempts to boost the time performance. In an early work by Needleman and Wunsch [23], the algorithm is shown to have a complexity of O(x2), where x is the length of the sequences. This algorithm was subsequently improved by Sellers [27] and generalized by Smith and Waterman [29], [28]. The Smith–Waterman (SW) algorithm, on the other hand, has a complexity of O(x3) but was later improved by Gotoh [19] to just O(x2). Although this is acceptable for a two-sequence problem, often it is required to run the SW algorithm multiple times, and this clearly discourages its use in handling the current day volume of data from public databases2 such as [18], [16], [12].

To minimize the computation complexity further, several heuristics were proposed. These include FASTP [21], FASTA [26], [25], BLAST [2], variants of BLAST such as mpiBLAST [17] and FLASH [8], to quote a few. Further, some of the variants of these methods do not generate the complete SW matrix that can be used to detect multiple subsequence similarities.

In [33], a speculative strategy was presented for multi-sequence alignment. The strategy exploits the independence between alignment group pairs in the Beger–Munson [6] algorithm. It achieves speed-up by processing multiple iterations in parallel with the speculation that the current iteration is independent of the previous iteration. Nevertheless, due to the working style of this strategy, the number of processors that can be utilized is limited and the speed-up is dependent on the similarities of the sequences.

In general, much of the challenge lies in handling multiple sequences (MSAs). In order to handle MSAs, several clustering strategies have been proposed [32], while a pair-wise sequence alignment process takes place at a fundamental layer before clustering strategies kick in. Over the last few years, several new scoring models, which are key to asserting the quality of alignment, have been proposed. These include the segment-based evaluation implemented in Dialign [22]; the consistency-based objective function of T-Coffee [24]; and methods based on the Hidden Markov Models (HMM) [13]. The T-coffee method has been recently parallelized [34]. In the parallel T-coffee approach the important steps, the library generation and progressive alignment, have been parallelized and can be executed in a distributed manner. To derive an efficient solution, a hybrid strategy, referred to as MUSCLE, is proposed which combines both a progressive and an iterative search [14]. MUSCLE is shown to differ significantly from the working style of typical progressive methods. With current day multi-core microprocessors MSA problems are expected to gain enormous speed-up, although inter-core communications need to be handled carefully so as not to sacrifice the gain. A good example of this is IBM CBE (Cell Broadband Engine) that has 8 cores to handle such computationally intensive tasks. Work in [1] reports a multi-core implementation using two approaches — SIMD and wavefront parallelization.

A very recent attempt at using the DLT paradigm was first demonstrated on bus-based systems, which is a most common network infrastructure found in organizations on a small scale, for the biological sequence alignment problem in [30]. In this paper we will extend this approach to a linear network case. As the Grid computing paradigm offers enormous computing resources, a compilation in [4] exposes how Grid computing technology handles computationally intensive biological problems.

Our contributions in this paper are novel to the literature of bioinformatics as well as to the domain of Divisible Load Theory (DLT). In this paper, our interest lies in exploring the possibility of utilizing computing resources available on networked computing platforms such as either public domains like the Internet or even in dedicated cooperative networks for handling large-scale data processing. To this end, we present a network-based strategy employing a chain of processors and attempt to utilize the DLT paradigm [7] to achieve a high degree of parallelism by pipelining the computational process. As a case study, we employ an improved version of the popular Smith–Waterman algorithm with trace-back in the design of our strategy for sequence comparisons. In our strategy, we exploit the advantages of the independent links on such a linear chain network and also partition the computational space to achieve a high degree of parallelism. We derive the amount of computational load quantum to be assigned to each processor according to their computation and communication speeds in accordance with the DLT. We also derive an important condition to check if an optimal distribution strategy can be guaranteed. In the case where optimal distribution is infeasible, we attempt to utilize three different heuristic strategies introduced in [30] to distribute the load in a sub-optimal manner. Thus the scope of the work is in designing efficient strategies for handling such computer intensive biological workloads on public networks and to perform rigorous simulation experiments to quantify the effectiveness of our strategies in various situations.

It may be noted that our strategies can be easily tuned to be used with approaches such as Needleman–Wunsch algorithm or other similar algorithms. One of the advantages of our approach is that we are able to efficiently utilize non-homogenous (heterogenous) processors as we will determine the amount of load to be assigned to each processor such that the degree of parallelism will not be deteriorated. To testify the proposed schemes, we use real-life DNA samples of house mouse mitochondrion (Mus musculus mitochondrion, NC_001569) consisting of 16,295 residues and the DNA of human mitochondrion (Homo sapiens mitochondrion, NC_001807) consisting of 16,571 residues, obtainable from the GenBank [18], in our rigorous simulation experiments. Thus we conclusively demonstrate the applicability and usefulness of the DLT paradigm to such biological sequence related problems.

The organization of the paper is as follow, in Section 2, we will briefly introduce some preliminary knowledge and formally define the problem we address here. We will then present our strategy for the problem concerned in Section 3. In this section, we also be deriving certain conditions that needs to be satisfied in order to guarantee an optimal solution. In cases where these conditions cannot be satisfied, we will then resolve to heuristic strategies. We will present three heuristic strategies in Section 4 for such cases. In Section 5, we will discuss the performance of our strategy using a rigorous simulation study. Finally, we will conclude our paper in Section 6 and discuss some possible future extensions to this work.

Section snippets

Preliminaries and problem formulation

In this section we will attempt to provide a brief description of the background material for non-bioinformatics researchers to quickly understand the relevant techniques we used in this study. Parts of this following background material also appear in our earlier publication [30]. For the sake of continuity we present it here.

Design and analysis of parallel processing strategy

In this section, we will present our multiprocessor strategy for linear networks. Firstly, let us consider the distribution of the task of generating the S matrix. It may be noted that when we mention generating the S matrix, we also take into account the generation of the other two required matrices h, and f, as generating Sx,y demands computing the entries of hx,y, and fx,y elements as well. The S matrix is partitioned into sub-matrices Li,k,i=1,,m,k=1,,Φ, where each sub-matrix comprises a

Heuristic strategies

In real life, depending on the values of the processor and link speed parameters, it may be possible that (9), (13) may not be satisfied. In such cases, we attempt to apply three heuristic strategies introduced in our earlier work [30]. Of course we will rederive the conditions accordingly for the linear network case. These heuristics generate sub-optimal solutions for processing time.

Performance evaluation and discussions

In order to evaluate the performance of our strategy, we performed rigorous simulation experiments to compare the processing time of our strategy with a direct implementation of the Smith–Waterman algorithm using a single machine (non-parallel version). We define speed-up as Speed-up=T(1)T(m) where T(m) is the processing time of our strategy on a system using m processors. T(1) is the processing time using a single processor, and is given by T(1)=αβE1. As mentioned in Section 2.4, in this

Conclusions

In this paper, we considered the problem of aligning biological sequences, which is an important study in several bioinformatics applications. The process of aligning two or more sequences is often an imperative step to quantify the quality of the samples under consideration. We proposed an efficient multiprocessor solution on loosely coupled networked computing platforms wherein a chain of processors is involved. We have fully exploited the advantages of the independent links in such chained

Veeravalli Bharadwaj, Member, IEEE & IEEE-CS, received his B.Sc. in Physics from Madurai-Kamaraj Uiversity, India in 1987, Master’s in Electrical Communication Engineering from the Indian Institute of Science, Bangalore, India in 1991, and Ph.D. from the Department of Aerospace Engineering, Indian Institute of Science, Bangalore, India in 1994. He did his post-doctoral research in the Department of Computer Science, Concordia University, Montreal, Canada, in 1996. He is currently with the

References (35)

  • Dennis A. Benson et al.

    GenBank

    Nucleic Acids Research

    (2000)
  • M.P. Berger et al.

    A novel randomized iteration strategy for aligning multiple protein sequences

    Computer Applications in the Bioscience

    (1991)
  • V. Bharadwaj et al.

    Scheduling Divisible Loads in Parallel and Distributed Systems

    (1996)
  • A. Califano, I. Rigoutsos, FLASH: A fast look-up algorithm for string homology, in: Proceedings of the First...
  • M. Dayhoff et al.

    A model of evolutionary change in proteins

    Atlas of Protien Sequences and Structure

    (1978)
  • D. Debasish et al.

    Load partitioning and trade-off study for large matrix-vector computations in multicast bus networks with communication delays

    Journal of Parallel and Distributed Computing

    (1998)
  • L.H. Diana et al.

    On the design of high-performance algorithms for aligning multiple protein sequences on mesh-based multiprocessor architectures

    Journal of Parallel and Distributed Computing

    (2007)
  • Cited by (3)

    • A data parallel strategy for aligning multiple biological sequences on multi-core computers

      2013, Computers in Biology and Medicine
      Citation Excerpt :

      They are still limited in their ability to handle very large amounts of sequences because the system lacks a scalable high-performance computing (HPC) environment with a greatly extended data parallel strategy. To the best of our knowledge, most of the existing parallel MSA solutions are implemented on clusters of workstations [1–3], networks such as bus [4] and linear daisy chain [5], mesh-based multiprocessor architectures [6] and clouds [7]. These computer architectures are both hard and expensive to harness for non-expert users.

    • A work distribution strategy for global sequence alignment

      2019, International Journal of Computing
    • A data parallel strategy for aligning multiple biological sequences on homogeneous multiprocessor platform

      2011, Proceedings - 2011 6th Annual ChinaGrid Conference, ChinaGrid 2011

    Veeravalli Bharadwaj, Member, IEEE & IEEE-CS, received his B.Sc. in Physics from Madurai-Kamaraj Uiversity, India in 1987, Master’s in Electrical Communication Engineering from the Indian Institute of Science, Bangalore, India in 1991, and Ph.D. from the Department of Aerospace Engineering, Indian Institute of Science, Bangalore, India in 1994. He did his post-doctoral research in the Department of Computer Science, Concordia University, Montreal, Canada, in 1996. He is currently with the Department of Electrical and Computer Engineering, Communications and Information Engineering (CIE) division, at The National University of Singapore, Singapore, as a tenured Associate Professor. His main stream research interests include multiprocessor systems, Cluster/Grid/Cloud computing, scheduling in parallel and distributed systems, bioinformatics and computational biology, and multimedia computing. He is one of the earliest researchers in the field of divisible load theory (DLT). He has published over 100 papers in high-quality international journals and at conferences. He had successfully secured several externally funded projects. He has co-authored three research monographs in the areas of PDS, Distributed Databases (competitive algorithms), and Networked Multimedia Systems, in the years 1996, 2003, and 2005, respectively. He had guest edited a special issue on Cluster/Grid Computing for IJCA in 2004. He had served as a program committee member and as a Session Chair at several International Conferences. He is currently serving on the Editorial Board of IEEE Transactions on Computers, IEEE Transactions on SMC-A, and International Journal of Computers & Applications, USA, as an Associate Editor. Bharadwaj Veeravalli’s complete academic career profile can be found in http://cnds.ece.nus.edu.sg/elebv.

    Wong Han Min received his B.Eng. (Hons) in electronics engineering from the University of Nottingham, UK, in 2000, and his Ph.D. from the Department of Electrical and Computer Engineering at the National University of Singapore, in 2004. He did his first post-doctoral research in the School of Biosciences at the University of Exeter, UK, 2007. He is currently a research associate at the Cavendish Laboratory at the University of Cambridge. His research interests include algorithm designs, database, bioinformatics, and nucleic acids.

    1

    MIEEE, MIEEE-CS.

    View full text