Keywords

1 Introduction

High performance computation has always been demanded in the fields of bioinformatics and computational biology since the volume of biological database has been increasing explosively in recent years. To achieve high performance in bioinformatics applications researchers have proposed diversified approaches of accelerating popularly used tools including software- and hardware-oriented approaches [1,2,3,4,5,6,7]. Besides, recent technology of clustered or cloud systems makes naive users attain the power of parallel systems without using specialized parallel codes, e.g., OpenMP, MPI, etc., but simply running multiple copies of an application with divided data through job submission queuing system, e.g., condor [8, 9]. Regardless of the system environment including stand-alone, clustered/cloud and specialized systems, developing high performance versions of applications is ever demanded and beneficial.

In this paper, we propose and describe an efficient high performance approach of accelerating RNA sequence searching tool Infernal [10], more specifically, the cmsearch module in the Infernal package, based on task parallel and pipelined parallel processing methodologies. We name this model ‘pipelined task parallel’ computation model in this paper. Infernal cmsearch module is implemented as a filter pipeline, which consists of seven major filtering stages [10]. Each stage in the filtering pipeline is designed to yield reduced data to be processed in the following stages. However, even with this series of filtering operations, the execution time of the cmsearch module is considerably long in many cases. To address this, Infernal cmsearch provides MPI and Pthread mode options, which are based on a straightforward data parallel methodology, i.e., providing multiple copies (paths) of the serial pipeline with divided input data. However, this simple and static data parallel approach cannot exploit highly efficient parallelism since the code in each pipeline path is same as the original serial code. This becomes the motivation of our research. Our aim is parallelizing the serial operations in the filtering pipeline. Our proposed ‘pipelined task parallel’ computation model can be deployed in either a stand-alone multicore workstation/server environment or a clustered/cloud system environment in which each node is a multicore machine. Also, the proposed approach is suitable to be integrated with the straightforward data parallel approach used in the cmsearch code. In our current practice, the proposed approach is implemented with Pthread and OpenMP, and sophisticated load balancing methodologies are developed and used. The stages in the filtering pipeline are executed in a pipelined parallel manner, i.e., after the initial trigger time slices, all stages in the filtering pipeline work in parallel with a different segment of input data to each stage - a kind of systolic processing.

The rest of this paper is organized as follows. In Sect. 2, we briefly review related work on the Infernal tool and background knowledge. In Sect. 3, detailed description of the cmsearch program is provided. In Sect. 4, our proposed approaches of achieving the maximum parallelism is described in detail. In Sect. 5, experimental results and discussion are provided, and Sect. 6 concludes the paper.

2 Related Work and Background

Before the emergence of Infernal, some heuristic sequence homology searching tools, e.g., FASTA, BLAST, BLAT, etc., had already appeared, but the alignment processing is based on the RNA primary structure, which is not able to detect homology in a group of samples having homologous secondary structures. This became the motivation of developing Infernal tool [10]. Inspired by the method of formalizing the linear sequence alignment proposed in [11], a structural RNA model is used as the query in Infernal.

The first version of the Infernal package was released in 2002 [12]. In this tool, a memory-efficient dynamic programming algorithm is used with a divide-and-conquer strategy [12]. To achieve a better time complexity, a query dependent banded method was introduced in Infernal in 2007 and it was able to obtain the average time complexity of O(LN*2.4) \(\sim \) O(LN*1.3), where L and N are the numbers of base pairs in the query and database sequences, respectively [13]. Researchers considered Infernal 1.0 (released in 2009) as the first reasonably complete version of the tool [14, 15]. In this version, two levels of filtering operations are used to reduce the data to be processed. The HMM (Hidden Markov Model) based filtering described in [16] was used as the primary filter and the QDB CYK maximum likelihood searching algorithm is used as the secondary filter. Subsequences that passed through the two filters are searched with Inside algorithm [17] for the final alignment. Although later versions provide considerably improved performance, Infernal cmsearch program remains as a computationally expensive module. The current version of Infernal cmsearch (version 1.1) is claimed to be 100-fold faster than the previous version [10, 18]. This improvement is achieved by using a new filtering pipeline, which takes advantages of using profile HMM based filtering operations originated from HMMER3 project [19] together with banded covariance model based dynamic programming algorithms [20].

2.1 Covariance Model Based Alignment

Infernal cmsearch module processes two input files, a CM (covariance model) file representing an RNA secondary structure and a database sequence file in the fasta format. A CM file consists of both profile HMM and CM representations derived from the consensus structure of the multiple RNA sequence alignments found in an RNA family (refer to the Rfam CM models [21]). Infernal package provides a module named ‘cmbuild’ for this and how a CM file is constructed is described in [10] in detail. Cmsearch program scans the database sequence file to align the profile HMM and the CM provided in a CM file using multiple dynamic programming algorithms. In this subsection, we briefly describe the structure of a CM file.

A CM is a specific repetitive profile SCFG architecture, which models a consensus secondary structure of an RNA family. Each CM has a group of states that are mapped to consensus and non-consensus positions in the RNA secondary structure from a given group of multiple sequence alignments in an RNA family. The final CM becomes a list of states connected in a directed graph in which each state has both position specific emission and transition probability scores [10]. A profile HMM is a repetitive HMM architecture that associates with each consensus column of a multiple alignment with a single type of model node. A profile HMM could be thought as a special case of a CM.

Each CM file built from the cmbuild module consists of both CM and profile HMM sections each of which contains header and state parts as illustrated in Fig. 1. The CM header part includes metadata such as model name, RNA family, number of nodes, number of states, etc. The CM states part contains the representation of the entire model. Each node in the CM model is denoted by a line of node type and node index followed by a couple of lines of state information. Each state line consists of state type, state index, indices of parent/child states that can transit, bit scores, etc. Analogously, the profile HMM header includes HMM related metadata. Finally, in the profile HMM states part, each node has a line of match emission scores and a line of insertion emission scores and state transition scores. The cmsearch program code is tightly related to the node and state data in the CM file.

Fig. 1.
figure 1

CM file structure

3 Cmsearch Filtering Pipeline

The filtering pipeline used in the cmsearch program consists of profile HMM based filtering stages and CM based filtering stages. Profile HMM stages use a combination of Viterbi algorithm, forward algorithm and backward algorithm to align the sequence with the profile HMM to find qualified sequence fragments, which pass the threshold values set by the program. All profile HMM stages yield considerably higher efficiency than those CM based stages since a profile HMM structure is much simpler than its equivalent CM structure. Thus, the profile HMM stages are positioned in the front part of the pipeline to prune relatively higher volume of the subsequences from the target sequences to alleviate the processing time in the following complex CM based stages. Figure 2 illustrates the filtering pipeline, in which seven filters are organized into four stages. We describe briefly the operation performed in each filter in the pipeline, and more detailed descriptions can be found in [10].

Fig. 2.
figure 2

Infernal cmsearch filtering pipeline

3.1 Stage 1

Stage 1 consists of the front end three filters, which perform vector mode operations [10, 22].

Local Scanning SSV Filter. A local un-gapped Viterbi algorithm is used in this filter to find a set of high-scoring subsequences, defined as ‘windows’, from a given input sequence. Each alignment can start from any position in the model and the sequence, and overlapped subsequences are combined into a SSV (single-segment Viterbi). Very long windows are split into multiple windows [10] and all qualified windows are passed to the next filter.

Local Viterbi Filter. In this filter, each forwarded window is aligned with the profile HMM model using the local Viterbi algorithm for finding the optimal gapped alignment, which calculates the maximum likelihood score of the state sequence in the profile HMM. Note that this filter is set to be in the local mode since each alignment can start/end at any position in the model. The same mechanism applies to the next filter.

Local Forward Filter. Windows passed through the local Viterbi filter are aligned with the profile HMM via a local forward algorithm in this filter. With the forward algorithm, the maximum likelihood score for the given sequences and the profile HMM is computed by summing up the probabilities of all possible paths in the model.

3.2 Stage 2

In this stage, 2-D dynamic programming based forward and backward algorithms are used to detect further shorter high-scoring subsequences from the windows.

Glocal Forward Parser Filter. The full forward algorithm is used to align the remaining (qualified) windows to the profile HMM in the ‘glocal’ mode - global to the profile HMM but local to the windows, i.e., each alignment can start/end at any position in the sequence window.

Glocal Envelop Definition Filter. From the windows, which passed the glocal forward parser filter, further shorter hits are identified in this filter. First, the glocal backward algorithm is applied to the windows. Second, by combining the results from the glocal forward algorithm the posterior probability that a target window starts and ends at a given position is computed. Lastly, shorter hits having more significant probability mass, defined as ‘envelopes’ are extracted from the windows and passed to the next stage.

3.3 Stage 3

HMM Banded CM CYK Filter. A banded version of the CYK algorithm [21], which is based on the 3-D dynamic programming, is used to determine the bit score of the maximum likelihood alignment of any subsequence within an envelope to the CM that is consistent with the HMM-derived bands. This bit score is checked with a threshold value and only qualified sequence envelopes are passed to the next stage. The time complexity of this stage (and the next stage) is relatively high since the CM model is processed, but the number of input envelopes is greatly reduced by the previous HMM based light-weight filters.

3.4 Stage 4

HMM Banded CM Inside Filter. The full likelihood of each profile/sequence envelope is evaluated in this filter using the CM Inside algorithm [17]. For each envelope, the entire alignments are summed. Similar to the CYK algorithm, HMM bands are used to constrain the CM dynamic programming calculations. This filter identifies possible non-overlapping high-scoring alignments and adds them to the final output.

4 Proposed Approaches: Pipelined Task Parallel Computation Models

In this section, we describe our proposed approaches of exploiting maximum parallelism in the filtering pipeline path with multiple threads. Since the proposed approaches accelerate the operation of the filtering pipeline internally, i.e., within the pipeline, they can be integrated with the original cmsearch data parallel approach or any other data parallel approaches without difficulties for further performance gain.

4.1 Pipelined Task Parallel (PTP) Computation Model

Our first trial is developing a ‘pipelined parallel’ computation model within the filtering pipeline of the cmsearch program. This strategy is inspired by the hardware pipelining used in the datapath of modern processors. In this parallel computation model, each stage of the filtering pipeline is assigned to a thread and all the threads work in parallel after the initial trigger time of the depth 4 pipeline (refer to Fig. 2). Since the task of each pipeline stage is different from others’, we name this model ‘pipelined task parallel computation model’ there after, ‘PTP computation model’. Input database sequences are divided into small blocks and a series of the divided blocks enter the pipeline in a systolic manner. In the ideal case in which the processing times of the four stages are identical, this model can achieve 4-fold performance gain. However, in the reality, time complexities of the four stages are not identical and the actual speedup of the computation model tightly depends on the most time consuming stage. The graph shown in Fig. 3 reveals this. From our primitive study with 200 randomly selected cmsearch jobs (200 CM files with a fixed database), stage 3 (HMM banded CM CYK filter) usually takes the longest time as shown in the figure in which random 100 (from 200) cmsearch jobs are shown in the ascending order of the total execution time. Thus, we need to employ a load balancing mechanism in the pipeline.

Fig. 3.
figure 3

Execution times of four stages in the cmsearch pipeline

4.2 PTP Computation Model with Static Load Balancing

Our first load balancing strategy is based on the observation as partly illustrated in Fig. 3. Within our practice of searching Rfam [21] bacterial small RNA families in the divided HMP [24] database, we randomly selected 200 relatively time consuming cmsearch jobs and tested for time consumption in each stage. To resolve the bottleneck of the pipelining caused by the stage 3 (CYK filter), we included a parameter for setting the desired number of threads for the stage 3. This model is illustrated in Fig. 4(a). Although this statically load balanced version yields much better performance than the original PTP computation model, it is still difficult to exploit the maximum parallelism due to the lack of the information, which predicts the relative execution time differences among the stages in the pipeline. This became the motivation of developing a dynamic load balancing strategy described in the following subsection.

4.3 PTP Computation Model with Dynamic Load Balancing

Our further testing practices with more diversified cmsearch jobs revealed that there exist some abnormalities in which stage 3 is not the longest stage. This is due to the fact that the cmsearch execution time not only depends on the CM file to process, but also depends on the segment of the database processed. To balance the load maximally/equally on each stage of the PTP computation model, we need to develop a sophisticated heuristic prediction scheme to determine the relative weight of each stage upon receiving the input to the pipeline, i.e., a CM file and a database sequence file. Based on the prediction result for each cmsearch job, appropriate numbers of threads are assigned to each stage in the pipeline dynamically as illustrated in Fig. 4(b).

Fig. 4.
figure 4

Concepts of pipelined task parallel (PTP) computation models (a) with static load balancing (b) with dynamic load balancing

The methodology of our prediction scheme is based on the linear/polynomial regression model. The initial linear regression model that we started with is shown below:

$$\begin{aligned} t=\theta _0+\theta _1s+\theta _2n+\theta _3c+\theta _4l+\theta _5k \end{aligned}$$
(1)

where, t is the execution time to predict, s is the number of states in the CM, n is the number of nodes in the CM, c is the length of the consensus string (same as the number of states in the profile HMM), l is the number of residues in the target sequence file, k is the number of sequences in the target sequence file, and {\(\theta _0, \theta _1, \theta _2, \theta _3, \theta _4, \theta _5\)} are coefficients. We collected over 2,000 cmsearch job running results as the training dataset and extracted those dependent and independent variables. The next step is computing a set of coefficients {\(\theta _0, \theta _1, \theta _2, \theta _3, \theta _4, \theta _5\)} that could minimize the average difference between the actual execution time and the predicted execution time of each stage in the pipeline. We utilized the linear regression model provided in the Python machine learning library, scikit-learn, to build the regression model and we measured the model using the \(R^2\) (coefficient of determination) score, which ranges from 0 to 1 - the closer \(R^2\) score is to 1, the better the model fits the training data. This primitive regression model yields good predictions on the execution time of the stage 1 in the pipeline, i.e., \(R^2\) score is over 0.8 in average, but fails to make good predictions for other three stages.

Since both profile HMM and CM are position dependent statistic models, the performance of the cmsearch program heavily depends on the scoring model used in the CM file and the residue distribution used in the database sequence file. To improve the accuracy of our prediction scheme, we modify the primitive regression model shown in (1) as follows:

$$\begin{aligned} t=\theta _0+\theta _1s+\theta _2n+\theta _3c+\theta _4l+\theta _5k+\theta _6nwin+\theta _7lwin \end{aligned}$$
(2)

where, nwin and lwin are the number of windows and the number of residues passed through the local scanning SSV filter, and {\(\theta _6, \theta _7\)} are two new coefficients. With this modification, the average \(R^2\) score of all stages are increased, but it still is considered as under-fitting [23] in terms of the \(R^2\) scores for stage 2, stage 3 and stage 4. To resolve the under-fitting problem, we finally convert our linear regression model to a polynomial regression model in which some polynomial variables are added to the dependent variables used in (2). For example, if we set the degree of the polynomial regression to 2, the regression model (2) becomes:

$$\begin{aligned} t=\theta _0+\theta _1s^2+\theta _2n+\theta _3c+\theta _4l+\theta _5k+\ldots +\theta _34lwin^2+\theta _35lwin \end{aligned}$$
(3)

As we increase the polynomial degree, the \(R^2\) score on the training dataset is increased. However, fitting the training dataset better doesn’t mean that it is a well-established model in general. A regression model falls into over-fitting when it fits to the training dataset well but fails to fit to the testing dataset. We developed and tested more polynomial regression models with degree 3–5 and realized that the higher degree yields worse cross validation scores. Our testing results showed that the modified linear regression model (2) and the polynomial model with degree 2 (3) yield analogously the best performance.

5 Experimental Results

To measure performance, our proposed PTP computation models were implemented in the Infernal cmsearch code, which is written in C, with both Pthreads and OpenMP. Our experiment was conducted on a server, PowerEdge T620 with two 8-core Intel Xeon processors (2.60 GHz) and 32 GB memory. Data used are HMP (Human Microbiome Project) [24] gastrointestinal_track database (Dec. 2012 version, 61,238 sequences, 2.49 GB) and Rfam [21] bacterial small RNA families (1,105 CM files from Rfam version 12.0). The HMP database is divided into 20 approximately equal sized pieces (\(\sim \)125 MB each) and we randomly selected 200 combinations, each of which consists of a CM file and a piece of the HMP database, in our experiment. To develop the regression models described in Sect. 4, we tested 2,000 randomly selected combinations (cmsearch jobs). Since the PTP computation model is pipelined parallel, we tested and determined the optimum size of the pipelined data, which is \(\sim \)10% (\(\sim \)12 MB) of the given input database file (\(\sim \)125 MB each), to yield the maximum performance.

Figure 5 shows performances of the pipelined task parallel (PTP) computation model described in Sect. 4.1 and the PTP computation model with static load balancing described in Sect. 4.2. We use the original Infernal cmsearch serial pipeline as the baseline for the comparison purpose since the PTP computation model accelerates the operation within the pipeline. As shown in the figure, the proposed PTP computation models yield a considerable amount of performance gain. In average, the PTP computation model in which each of the four pipeline stages is assigned to a thread (so, total 4 threads) yields 1.56\(\times \) speedup and the PTP model with static load balancing in which the stage 3 is assigned to 3 threads (so, total 6 threads) yields 2.04\(\times \) speedup over the baseline serial pipeline.

Fig. 5.
figure 5

Performances of the pipelined task parallel (PTP) computation models

Fig. 6.
figure 6

Performance comparison of PTP with static/dynamic load balancing

Figure 6 shows the performance difference between the PTP with static load balancing and the PTP with dynamic load balancing described in Sect. 4.3. As shown in the figure, the dynamic load balancing method achieves higher performance than the static approach in general, and the modified linear regression model (2) and the polynomial regression model with degree 2 yield pretty similar results, i.e., 3.41\(\times \) and 3.20\(\times \) speedups, respectively. The average number of threads used in the PTP-DL model is 8. Some abnormalities shown in the figure are due to the accuracy of the current prediction scheme used in the dynamic load balancing. As described in [10], prediction of the execution time of the cmsearch job is highly difficult and complex task and we work towards making further accurate prediction scheme.

Fig. 7.
figure 7

Performance from doubling the threads in each stage of PTP-DL

Fig. 8.
figure 8

Performance with multiple data parallel paths with PTP-DL

Lastly, we checked the scalability and flexibility (compatibility) of the PTP computation model, specifically the PTP with dynamic load balancing model (PTP-DL, there after) with the modified linear regression model (2). In the experiment of checking the scalability, we compared the PTP with the dynamic load balancing model to the version with doubled number of threads in each stage. Figure 7 shows the result from this experiment, i.e., 4.90\(\times \) speedup. The flexibility is checked with an experiment of integrating our computation model into the original data parallel approach that Infernal cmsearch provides as an option, i.e., each data parallel path is switched with our PTP-DL. We implemented with two and three data paths and Fig. 8 shows the resulting performance from this experiment, i.e., 5.32\(\times \) and 6.29\(\times \) speedups, respectively. As shown in Figs. 7 and 8, integrating the proposed computation model with doubled data parallel paths yields higher performance gain over the approach of doubling the number of threads in each stage of single PTP-DL path.

6 Conclusion

We proposed high performance computation models for accelerating Infernal cmsearch module, which is an RNA sequence prediction tool based on the secondary structure of RNA. The proposed approach is based on a pipelined task parallel strategy in which serial steps of the operations used in the original tool are parallelized in a pipelined parallel manner, i.e., systolic. To exploit the maximum parallelism, we included both static and dynamic load balancing strategies to alleviate heavy loaded stages in the pipeline. To develop the dynamic load balancing strategy we used heuristic regression models to predict the execution time of each stage in the pipeline.

In our practice, we implemented and tested several versions of the proposed high performance computation model with Rfam bacterial small RNA CM files and HMP gastrointestinal_track database as input to the program. Experimental results showed that 1.56\(\times \) and 2.04\(\times \) speedups in average were achieved with the model with simple pipelining and pipelining plus static load balancing, respectively. With the model with the dynamic load balancing, 3.41\(\times \) speedup in average was achieved. The proposed computation models are scalable and flexible to be integrated with other trivial data parallel approaches, i.e., hybrid model of data parallel and our proposed pipelined task parallel approaches. With the hybrid model with 2 and 3 data paths, 5.32\(\times \) and 6.29\(\times \) speedups, respectively, in average were achieved.