Keywords

1 Introduction

More than a decade ago bioinformatics analysis experienced a major revolution with the introduction of NGS technologies, able to produce millions of DNA fragments at very high speed [11, 12]. Higher volumes of experimental data have a great impact in multiple areas, where an increase in the amount of observations improves the statistical power and hence the accuracy of results. Variability analysis in population genomics is an example where researchers take advantage of large-scale data volumes by incorporating more individuals and increasing DNA coverage in their experiments [16]. Originally driven by theoretical approaches, population genomics needs to test their models and validate its hypothesis using large amounts of sequencing data which require high performance technologies for its processing. In this context, population genomics has become a data-driven discipline. By contrast, there is a wide gap between data production and current processing capabilities. This divergence results in the need of providing new methods and solutions in the field of high performance computing. A typical case of variant analysis involves the identification, catalog, and map of variations at a single position (locus) in a DNA sequence among individuals, also known as single nucleotide polymorphism or SNP. The number of tools and its relation inside analytical workflows depends on each specific study [13], but we can identify three common stages illustrated in Fig. 1. First stage is performed to prepare raw input data (preprocess), followed by a main processing stage like SNP calling (process), and a final stage where results are arranged for its review.

Fig. 1.
figure 1

An example of generic workflow used in variant analysis.

Bioinformatics tools like Samtools [8], GATK [1] or ANGSD [7] have been mainly designed for detecting SNPs while keeping lower false discovery rates, but not for the analysis of variability. In this work we present a coarse-grain parallelization of the software Genotype/Haplotype caller (GH caller) [13]. This program is developed for the analysis of variability and implements the Roesti et al. [14] algorithm to accurately account the number of effective positions and the frequency of the variants, making this tool highly suitable for population genomics studies. We show that our parallel implementation can provide a performance advantage when using a data-intensive computing system (cluster).

The remaining of the paper is organized as follows. Section 2 reviews the original serial application GH caller. Our proposal and MPI implementation are described in Sect. 3 while its performance is analyzed in Sect. 4. Final conclusions and outlines of our future work are described in Sect. 5.

2 GH Caller

GH caller implements the Roesti algorithm [14], which is based in the algorithms proposed by Hohenhole et al. [4] and Lynch [10]. The main difference with other SNP callers is that GH caller counts the number of chromosomes that are effectively read per site, so that it makes unbiased in relation to the frequency of the variants for neutrality tests and to the calculation of the nucleotide variability.

As input, GH caller takes pileup or mpileup file formats obtained from bam files [9] produced in previous steps of a variant analysis workflow. The pileup or mpileup format is a text-based format for summarizing the base calls of aligned reads to a reference sequence. This format describes the base-pair information at each chromosomal position, and one line contains the information of all the individuals at this position (locus). The output of GH caller is a fasta file with two sequences per individual and a length equal to the total length of the input sequence in mpileup.

Table 1. Execution time obtained by running the original application using different data sets.

Table 1 shows an example of sizes used as input and output data, and execution time of GH caller using these data sets. First column represents the number of individuals (bam files) used to create the mpileup input file; second column indicates the total size of the resulting mpileup. These files were obtained from 50 different bam files, and represent only one chromosome (chromosome 1) from the pig genome. Third column shows the size of the output fasta file. Last column shows the elapsed execution time using the original application. These figures show that the execution time increases with the problem size. For the case of 50 individuals and one chromosome (an input file of 283 GBytes) the time elapsed exceeds 60.000 s. It is not hard to foresee that as soon as NGS costs drop another fraction, researchers will be willing to analyze full genomes what it will easily require several days to process. These evidences make this problem worth to be analyzed in order to reduce execution times and improve resource usage.

Fig. 2.
figure 2

Independent rows from the input mpileup file can be processed to generate known positions of the output fasta file.

In order to analyze the original application (serial), Fig. 2 shows the schema of how data is processed through the main flow of execution. The main steps are: First, it obtains all the information needed to apply the SNP algorithm at a given position (locus), reading a full line from the mpileup file. Next, for each locus GH caller applies the SNP calling algorithm and generates the corresponding output. Finally, all results are stored in memory and, at the end of execution, they are written to disk. Because all information is self-contained in each locus and there is no data dependency to apply the SNP calling algorithm, we can ensure that this problem is embarrassingly parallel, and hence it can be divided into components that can be executed concurrently. Next section will shows the strategy applied to parallelize this application.

3 Parallelization

The parallel implementation takes advantage of the inherent data parallelism found in the Roesti algorithm [14]. As we shown in previous section, this algorithm is applied independently in each site (locus), so this problem is embarrassingly parallel. Thanks to this property, we can split input data in groups of n loci and distribute them in chunks among the available processing units, where all the computations will be performed in parallel. The main parallelization phases are the following:

  1. (1)

    All processors initialize the parallel environment;

  2. (2)

    the main process (master) indexes the input file and maps the input data to other processes (workers);

  3. (3)

    the parallel computations start asynchronously on each worker. Locally, each process reads and parses a subset of data (‘Read’ and ‘Parse’), applies the base calling algorithm to these data (‘Process’), and finally writes his partial results to the final fasta output file (‘Write’);

  4. (4)

    if it was selected as a command line option, a synchronization point (barrier) is used to allow the master process append a reference/outgroup sequence.

Phases 1 and 3 are independent and only need synchronization, hence are performed in parallel, while phases 2 and 4 are serial and must be performed in order. The master-worker model implemented is illustrated in Fig. 3.

Fig. 3.
figure 3

Master starts indexing the input file and it generates tasks. For each task, a worker reads and parses its data, applies the SNP calling algorithm and writes down its output. Finally, if set, the master appends a reference sequence to the output file.

Domain Decomposition

Our application uses the standard Message Passing Interface (MPI) [3] and distributes the data among the available processors by mapping subsets of data (hereinafter “chunks”) in an uniform way, giving one to each worker process at a time. Index and map operations are carried out by the master process. All workers open the input file and read the part of the assigned data. Finally, they process its data using the algorithm described in the previous section (Fig. 4).

Fig. 4.
figure 4

Master performs a domain decomposition process mapping subsets of data to each worker.

This master-worker model provides two main advantages: First, it allows the implicit parallelization of input/output operations along all workers. Second, as not all these chunks needs the same time to be processed, the master process can apply a load balancing strategy that distributes chunks dynamically across multiple workers. Futhermore, each chunk does not need to be replicated at the local memory of any other process, so memory used is released when data is written to disk: this allows GH caller to take advantage of working with data sets without size limits and use the whole memory available to the bunch of allocated processors, at the same time.

Pre-calculating Complex Operations

As described in Sect. 2, to complete the SNP calling process, we should apply the same formulas to all positions (loci). Since the scope of these operations is well-known and limited to a set of numbers, we created an array that holds a set of pre-computed results and use it as a Look-Up-Table (LUT). This approach allows to transform a set of slow complex calculations into faster memory accesses. As a result, we reduced a fraction of high cpu costs accumulated over all the SNP calling iterations.

4 Experimentation and Results

The main objective of the experiments proposed here are to evaluate the benefits obtained in our parallel approach by comparing both applications, serial and parallel versions, in terms of effective usage of variable computational resources using the same data.

Experimental Platform

The experimental platform used is a cluster composed by 8 nodes. Each node has 12 cores in 2 Intel Xeon X5660 processors with a frequency of 2.8 GHz, and 96 GBytes of RAM. The total number of available cores on this platform is 96 with 768 GBytes of RAM. These nodes are connected via 10 Gbit Ethernet links and run Scientific Linux version 6.3.

Our application was evaluated using the parallel file system Lustre, version 2.5.3 configured with 1 Metadata Server (MDS) and 2 object storage servers (OSS), which were connected to the same 10 Gbit Ethernet network. These Lustre servers manage up to 8 Object Storage Target (OST, that stores the data) with a size of 10 TBytes each. OSTs are served by two Storage Arrays (SAN) via 8 Gbit/s fiber channel links. A total of 80 TBytes of space is available to the entire cluster with a default stripe size of 1 Mbyte. We used mpicxx for OpenMPI version 1.8.8, and gcc 4.9.1 to compile all the codes, with −03 optimization enabled.

Experimental Setup

The performance of GH caller is evaluated using the execution time, speedup, and efficiency. These metrics are important because reducing execution time is the main concern to parallelize our application, speedup allows us to know how performs our solution while scales up the number of resources, and efficiency indicates how much of the resources are being used by our implementation.

First, we obtain the execution time of the original GH caller (serial) using only one core as computational resources (\(T_{1}\)). This value is used as a baseline to compare gains obtained regarding with the execution time used by our new parallel version (\(T_{N}\)). Although there are different definitions of speedup [15], we define the relative speedup obtained by the parallel program, Q, when solving a instance of instance I of size i, using NP processes as follows:

$$\begin{aligned} RelativeSpeedup(I,Q) = \frac{T_{(I,1)}}{T_{(I,NP)}} \end{aligned}$$
(1)

Where \(T_{(I,1)}\) is the time to solve I using program Q and 1 process. Because our application uses a master-worker scheme, this value corresponds to \(T_{I,2}\) (2 processes using 2 cores). \(T_{(I,NP)}\) is the time to solve I using program Q and NP processes.

We performed each experiment ten times and kept the average execution time. In all the experiments performed we used the case of 10 individuals, all from the chromosome 1 of the pig genome. Sizes of input and output files used in this example are show in first row of Table 1.

Execution Time and Scalability

Table 2 shows all results obtained from serial and parallel execution, scaling up the parallel implementation from 2 to 96 cores, and showing the obtained execution times, relative speedup, efficiency and gains with respect to the original version. As we can observe, execution times has been drammatically reduced thanks to the parallelization, while maintaining high efficiency figures (above \(80\%\)) until it reaches 64 processes, where the benefits of adding more resources begins to decrease.

Table 2. Results obtained running GH caller scaling up to 96 cores.
Fig. 5.
figure 5

Execution time increasing the number of workers on cluster up to 96 processes.

In order to clearly show and appreciate gains obtained respect the original version, Fig. 5 plots and compares execution time of different implementations of GH caller. First column represents the original serial implementation, whereas next columns are results from parallel executions. As can be observed, using only 2 processes - a Master and a Worker - using this parallel implementation it reduces in 6x total execution time, due to improvements made in SNP algorithm internal calculations, parallelization of input/output tasks, and use of buffers to reduce total number of input/output operations. Furthermore, if we compare execution time using 64 cores, our results show a significant reduction in the time with respect to the original version, from 13,000 s to nearly 50 s. Looking at these results, we can observed that the execution time can be reduced by adding computational resources up to 260x, by using 64 cores and MPI processes. Nevertheless, these gains start to decrease when using all the available computational resources, a total of 96 cores, proving that there are some scalability issues.

Fig. 6.
figure 6

Relative speedup and efficiency for an increasing number of workers on cluster. MPI configuration is started with 2 MPI processes with a process working as master, and one Worker to perform heavier computational tasks.

In order to show plainly these issues, at Fig. 6 we plot relative speedup and efficiency. Dashed lines at these figures represent linear speedup and ideal efficiency, respectively. As described above, there are no data dependencies processing each chunk, as a result we can expect a linear speedup when scaling up the number of resources. However, we observe scalability issues using more than 64 processes, and these problems can be appreciate more clearly observing the efficiency values obtained. In fact, relative speedup obtained with 96 processes is limited to 39, clearly showing a low efficiency, with a value obtained of 42% of effectiveness. This low efficiency will be discussed in the next section.

Fig. 7.
figure 7

Total CPU time used in each phase. Ideally, it should be equal in all executions independently of the number of processes.

Bottleneck Analysis

The work done in this section should distinguish the overhead introduced by each phase and identify the main reasons. First, like other analysis tools [6], we intercept the MPI calls to instrument the application. We use the MPI Parallel Environment (MPE) library [17] in our parallel application to log custom events and perform further analysis. Next, as MPE logging is accompanied by very precise time stamps, we used output log traces created by MPE in clog2 format to profile all executions and extract the execution time used in all application phases described in Sect. 3.

Figure 7 shows the total CPU time used by all processes in each phase: read the buffer from disk, parse data, apply SNP caller algorithm (process), and finally write results to disk. It clearly shows the impact of contention on the I/O system when we scale up the number of MPI processes. Ideally, with no overhead, the sum of total execution time used by all the processes in each phase should remain constant, but we can observe an overhead in both read and write phases, but it is particularly relevant at the last one. Indeed, when a limited number of processes (\(np<64\)) perform read and/or write operations, accumulated time used in all phases remains almost constant. Nevertheless, when we scale up the number of processes, the impact of I/O, specially in write operations, increases execution time more than 2x over the ideal using 96 processes.

We can conclude that when scaling up the number of MPI processes (\(np>64\)), it ends in a situation with I/O contention during the write phase. In order to mitigate this situation, that leads in a poor scalability, future work should be oriented to improve this writing phase, in a way that results optimal to the underlying distributed file system characteristics.

5 Conclusions and Future Work

In this work we have described the parallelization of GH caller using a master-worker strategy with the MPI library. We have shown that our parallel implementation can provide a performance advantage using a data-intensive computing system (cluster), reaching a performance gain factor of 260x using 64 processes and additional optimizations with regard to the initial non-parallelized version. Although execution time can considerably be reduced by adding more computational resources, we measured a poor efficiency when we scale up the number of MPI processes above a certain limit. The main reason is that our implementation is not optimized for parallel I/O and cannot take advantage of existing advanced techniques in this field, thus leading to poor performance when there are many (\(np>64\)) MPI processes writing to the same output file [2]. Future work could explore the possibility of applying different techniques already employed by specialized input/output libraries in high performance computing, like ROMIO (a MPI-IO implementation) [5].

Source code and instructions to compile are freely available to download at https://bioinformatics.cragenomica.es/projects/ghcaller.