Europe PMC

This website requires cookies, and the limited processing of your personal data in order to function. By using the site you are agreeing to this as outlined in our privacy notice and cookie policy.

Abstract 


Unlabelled

The most important features of error correction tools for sequencing data are accuracy, memory efficiency and fast runtime. The previous version of BLESS was highly memory-efficient and accurate, but it was too slow to handle reads from large genomes. We have developed a new version of BLESS to improve runtime and accuracy while maintaining a small memory usage. The new version, called BLESS 2, has an error correction algorithm that is more accurate than BLESS, and the algorithm has been parallelized using hybrid MPI and OpenMP programming. BLESS 2 was compared with five top-performing tools, and it was found to be the fastest when it was executed on two computing nodes using MPI, with each node containing twelve cores. Also, BLESS 2 showed at least 11% higher gain while retaining the memory efficiency of the previous version for large genomes.

Availability and implementation

Freely available at https://sourceforge.net/projects/bless-ec

Contact

dchen@illinois.edu

Supplementary information

Supplementary data are available at Bioinformatics online.

Free full text 


Logo of bioinfoLink to Publisher's site
Bioinformatics. 2016 Aug 1; 32(15): 2369–2371.
Published online 2016 Mar 24. https://doi.org/10.1093/bioinformatics/btw146
PMCID: PMC6280799
PMID: 27153708

BLESS 2: accurate, memory-efficient and fast error correction method

Associated Data

Supplementary Materials

Abstract

Summary: The most important features of error correction tools for sequencing data are accuracy, memory efficiency and fast runtime. The previous version of BLESS was highly memory-efficient and accurate, but it was too slow to handle reads from large genomes. We have developed a new version of BLESS to improve runtime and accuracy while maintaining a small memory usage. The new version, called BLESS 2, has an error correction algorithm that is more accurate than BLESS, and the algorithm has been parallelized using hybrid MPI and OpenMP programming. BLESS 2 was compared with five top-performing tools, and it was found to be the fastest when it was executed on two computing nodes using MPI, with each node containing twelve cores. Also, BLESS 2 showed at least 11% higher gain while retaining the memory efficiency of the previous version for large genomes.

Availability and implementation: Freely available at https://sourceforge.net/projects/bless-ec

Contact: ude.sionilli@nehcd

Supplementary information: Supplementary data are available at Bioinformatics online.

1 Introduction

Correcting errors in sequencing reads is a time-consuming and memory-intensive process. The occurrences of patterns (k-mers in many tools) in reads should be counted, and patterns with a small number of occurrences have to be replaced with ones having a large number of occurrences. Storing patterns requires a lot of memory, and searching for alternative patterns takes a long time for large genomes. Therefore, memory efficiency and fast runtime are as important as accuracy in error correction methods.

To provide a memory-efficient error correction method, BLESS, which uses a Bloom filter as the main data structure, was developed (Heo et al., 2014). While BLESS could generate accurate results with a much smaller amount of memory than previous tools, it was too slow to be applied to reads from large genomes.

Recently, some new error correction methods that can correct errors in a large dataset in a short period of time have been developed (Li, 2015; Song et al., 2014). However, to the best of our knowledge, none of the present tools satisfy all the three constraints (i.e. memory efficiency, runtime and accuracy).

To address the three requirements, we have developed a new version of BLESS. In BLESS 2, the accuracy of the error correction algorithm has been further improved over that of BLESS by adding new algorithmic steps. BLESS 2 corrects errors even in solid k-mers, k-mers that exist multiple times in reads, using the quality score distribution of input reads. Solid k-mers were originally treated as error-free k-mers. There is also a new algorithm introduced in BLESS 2 for trimming reads where errors cannot be corrected or corrections are ambiguous. In addition to quality improvements, the overall execution has been parallelized using hybrid MPI and OpenMP programming, which make BLESS 2 the fastest tool when executed on multiple computing nodes. All these improvements were made without hurting the memory efficiency of the predecessor.

2 Methods

BLESS 2 is parallelized using hybrid MPI and OpenMP programming. Therefore, the process can not only be parallelized on a server with multiple CPU cores and shared memory, but it can also be accelerated further by running it on multiple servers.

The overall BLESS 2 architecture is shown in Figure 1. The grey boxes in the figure represent computing nodes in a cluster. First, Node 1 builds a quality score histogram that can be used for the error correction step. Then, all nodes start to fetch input reads to count the occurrences of k-mers. In order to accelerate the k-mer counting step, we applied MPI to KMC (Deorowicz et al., 2015), which is one of the fastest and the most memory-efficient k-mer counters, and the modified KMC was integrated into BLESS 2. In the original KMC, k-mers are sent to one of 512 bins, and k-mers in each bin are counted separately. In BLESS 2, all of the N nodes invoke the modified KMC, and each node counts k-mers in 512/N bins.

An external file that holds a picture, illustration, etc.
Object name is btw146f1p.jpg

Overview of the BLESS 2 structure. Rectangles with dotted lines are processes that are parallelized using OpenMP

After KMC is finished, Node 1 collects the outputs of N nodes and constructs a k-mer occurrence histogram. This histogram is used to determine the threshold for solid k-mers. Each node separates k-mers in its private bins that have occurrences larger than the threshold, and programs them into a local Bloom filter.

Each node’s Bloom filter thus contains solid k-mers private to the corresponding node. Bloom filter data in each node is broadcast to all the other nodes, and each node reduces all the received data, along with its own Bloom filter data into a single Bloom filter that represents all the solid k-mers in the entire input read set using a bit-wise OR operation. Each node then corrects R/N reads where R is the total number of input reads, using the local copy of the Bloom filter. Corrected reads from all nodes are concurrently written, using MPI, to a final output file.

The accuracy of BLESS 2 has been significantly improved over that of its predecessor. There could be many solid k-mers that have errors, and they were not corrected in BLESS, because all such k-mers were considered error-free. The new algorithm checks whether a solid k-mer needs to be considered an erroneous k-mer when it has a base with a quality score below the fifth percentile. In addition, BLESS 2 can remove errors even when it cannot find right solutions for correcting them. In such a case, the new algorithm checks the locations of weak k-mers in a read and tries to determine the minimum number of bases that need to be trimmed to remove as many weak k-mers as possible.

3 Results and discussion

In order to evaluate the performance of BLESS 2, we compared it with five state-of-the-art error correction tools. All the experiments were done on servers, each with two six-core Xeon X5650 CPUs and 24 GB of memory.

ERR194147, which is a read set from NA12878, was used as an evaluation input. D1 is reads from chr1-3 and D2 is the entire read set. D3 is reads from SRR065390 C.elegans, the same dataset that was used in the BFC paper (Li, 2015). How the inputs were prepared and how all the error correction tools were run, and how results were evaluated can be found in the supplementary document.

The input reads were corrected using BLESS, BLESS 2, BFC-KMC (Li, 2015), Lighter (Song et al., 2014), Musket (Liu et al., 2013), QuorUM (Marçais et al., 2013), SGA (Simpson and Durbin, 2011). The results are summarized in Table 1. We prepared D1 because Musket, QuorUM and SGA could not handle D2 with 24 GB of memory on our server. BLESS could not finish correcting errors in D2 in 70 h, which is the maximum allocated run time in our cluster. For D1, BLESS 2 generated the most accurate result; its gain was higher than those of the others by at least 11 and 20% on the average.

Table 1.

Error correction results

DataSoftwareGainMemory (GB)Runtime (min)
D1BLESS 2 (1 node)0.5653.765
BLESS 2 (2 nodes)0.5653.741
BLESS 2 (3 nodes)0.5653.733
BLESS 2 (4 nodes)0.5653.730
BLESS0.5101.12,893
BFC-KMC0.50511.260
Lighter0.4393.254
Musket0.45413.2133
QuorUM0.47710.5120
SGA0.44812.0396
D2BLESS 2 (1 node)0.5635.6320
BLESS 2 (2 nodes)0.5635.6203
BLESS 2 (3 nodes)0.5635.6160
BLESS 2 (4 nodes)0.5635.6139
BFC-KMC0.49620.5274
Lighter0.43413.9231
D3BLESS 2 (1 node)0.8333.77
BLESS0.8010.1331
BFC-KMC0.8009.89
Lighter0.7761.44
Musket0.7512.924
QuorUM0.8214.112
SGA0.7511.755

TP, erroneous bases that are correctly modified; FP, all bases that are incorrectly modified; FN, erroneous bases that are not modified; Gain: (TP - FP)/(TP + FN); Twelve threads were used in a node for all the tools.

Accuracies of BLESS 2, BFC-KMC, and Lighter for D2 were similar to their accuracies for D1, which means the D1 results can represent the accuracy of an error correction tool for the entire human genome. For D3, BLESS 2 again showed the best accuracy among the compared tools. The percentage of trimmed bases rose from 0.9% for D1 and D2 to 1.6% for D3. Since trimming increases the chances that a read aligns to multiple locations, the aligned locations of BLESS 2’s output reads were compared with those of the uncorrected reads. The result showed that 99.8% of D3 reads were aligned the same location after error correction. This is discussed in more detail in Section 5.1 of supplementary document.

The effect of error correction on DNA assembly was also assessed. D3 and its error correction results were assembled using Gossamer (Conway et al., 2012) and the assembly results were compared using Quast (Gurevich et al., 2013). All the error correction tools helped improve the NG50 length, and the improvement was the most when using BLESS 2. Details are presented in Section 5.2. of supplementary document.

While BLESS 2 consumed the smallest amount of memory for D2, Lighter used the least memory for D1. This is because KMC that is invoked in BLESS 2 requires a constant 4 GB of memory irrespective of genome size. For D2, the size of the Bloom filter in BLESS 2 was larger than 4 GB and KMC was no longer a memory bottleneck.

The runtime of BLESS 2 on one computing node was comparable to that of the other methods, and BLESS 2 became the fastest tool when more nodes were available. When four nodes were used, BLESS 2 became 2.3 times faster than when one node was used. The current version of BLESS 2 reads input read files three times (i.e. for analyzing quality scores, counting k-mers using KMC and correcting errors), and reading compressed input read files consumes a significant amount of time. Since there is no efficient way to read a compressed file in parallel, this part cannot be accelerated even when the number of available nodes increases. In the next version, KMC source code will be embedded in BLESS 2 and we could analyze quality scores while counting k-mers using KMC, giving further speedup.

Conflict of Interest: none declared.

Supplementary Material

Supplementary Data

References


Articles from Bioinformatics are provided here courtesy of Oxford University Press

Citations & impact 


Impact metrics

Jump to Citations

Citations of article over time

Alternative metrics

Altmetric item for https://www.altmetric.com/details/7277101
Altmetric
Discover the attention surrounding your research
https://www.altmetric.com/details/7277101

Article citations


Go to all (12) article citations

Data 


Data behind the article

This data has been text mined from the article, or deposited into data resources.