MinimapR: A parallel alignment tool for the analysis of large-scale third-generation sequencing data

https://doi.org/10.1016/j.compbiolchem.2022.107735Get rights and content

Abstract

The development of third-generation sequencing technology has brought significant changes and influences on genomics. Compared to the second-generation sequencing methods, the third-generation technologies produce around 100 times longer reads to reveal new genomic variations that complete long-term gaps in the human reference genome. However, these reads' excessive length and high error rate severely increase the amount of data and alignment cost. The traditional data analysis platform and serial sequence alignment method can not effectively deal with large-scale long read alignment. There is a critical need for a novel data analysis platform that can deliver fast alignment of large-scale sequences to solve the problem of long read alignment. High-performance computing platforms and efficient, scalable algorithms based on these platforms have significant potential to impact sequence analysis approaches. This paper presented minimapR, a multi-level parallel long-read alignment tool based on minimap2, a popular third-generation read aligner. MinimapR is developed based on the new high-performance distributed framework Ray. Ray fully integrates with the Python environment and can be easily installed with pip. MinimapR can utilize the power of multiple computing nodes, significantly accelerating alignment speeds without sacrificing sensitivity. The minimapR tool was tested on 64 nodes and demonstrated a 50 fold increase in speed with 78 % parallel efficiency. The source code and user manual of minimapR are freely available at https://github.com/Geehome/minimapR.

Introduction

Continuous advances in sequencing technologies reduce the sequencing cost, increase the read length, and generate a large number of genomic data (Rhoads and Au, 2015, Jain and Olsen, 2016, Schuster, 2008, Mardis, 2017, Wetterstrand, 2021). A core challenge of analyzing these sequence data is sequence alignment because sequence alignment is very time-consuming. The sequence alignment algorithm's computational complexity and time complexity positively correlate with sequence length. However, most of the existing long read alignment tools are serial programs with limited performance and space-time efficiency. And the significant memory demand often exceeds the capacity of a single workstation or even a shared-memory multiprocessor. Therefore, after obtaining a large number of long reads, how to realize the fast and accurate alignment of large-scale sequences has become a significant challenge for the third generation of long read alignment. A typical simple method is to manually divide the sequence into several files and then process multiple files simultaneously. When the data scale is large, the labour cost is considerable, and the manual segmentation is static allocation, which has poor flexibility. When the number of tasks or calculation scale needs to be adjusted, the file needs to be re segmented. In addition, manual segmentation can easily cause load imbalance (Abuín et al., 2020).

In this study, an Ray (Moritz et al., 2018) variant (minimapR) has been developed, and a comparison with both the Spark variant (IMOS) and the original minimap2 implementation (using a sequential approach) has been made. The parallel optimization method of minimapR ensures that the tasks are evenly distributed to each processor and uses multiple computing nodes to achieve significant acceleration. This method has little impact on the original code. MinimapR can perform any analysis that can be performed using minimap2. Our results show that minimapR speed can increase almost linearly for the number of nodes in a cluster. The minimapR tool was tested on 128 nodes and demonstrated a 92 fold increase in speed. Moreover, minimapR is faster than IMOS (Yousefi et al., 2019) by a factor of 1.7 × . Ray and Spark adopt different memory storage mechanisms. MinimapR has lower memory requirements.

Moreover, minimapR can run on various computing platforms with different structures, such as high-performance or distributed cloud computing platforms. Even for a multi-core personal PC, the multi-level parallelism of minimapR can show better acceleration performance than the multithreading of minimap2. Our parallel optimization work also has important reference significance for parallelizing other sequence alignment tools using the seed-chain-extend heuristic model (Alser and Rotman, 2020, Li and Homer, 2010, Jain and Koren, 2018, Marx, 2013, The TOP500 Supercomputer Sites, 2020, Wilkinson and Allen, 2005).

We make the following contributions:

  • We design a multi-level parallel long-read alignment tool with excellent performance, robust scalability, and less memory consumption.

  • To reduce IO time and memory consumption, we let each process only needs to read in an evenly divided query sequence.

  • To reduce sequence alignment time, we use multi-level parallelism to complete sequence alignment.

  • To avoid the multi-process preemptive output of results, we output the results to multiple files separately and then merge them.

The paper is organized as follows: In Section 2, we introduce the related work of minimap2 and the frameworks Ray and Spark. Then, we describe the parallel implementations of the minimap2 in Section 3. Moreover, we explain in Section 4 the experiments performed and present their results. Finally, we summarize the obtained results and provide future research directions in Section 5.

Section snippets

Minimap2

Minimap2 (Li, 2018, Suzuki and Kasahara, 2018) is a versatile sequence alignment program developed by Dr. Li Heng. It uses the seed-chain-extend heuristic model similar to minimap (Li, 2015) and improves the ability of base-level alignment. Minimap2 supports splicing read alignment to promote minimap further. The experimental results of minimap2 show that it is capable of alignment speeds more than 30 times faster than other long read or cDNA alignment tools and 3–4 times faster than mainstream

Methods

The minimap2 tool is based on the seed-chain-extend heuristic model. The minimap2-based alignment process can be divided into four steps: index sharing, sequence distribution, sequence alignment, and outputting alignment results. In this section, the parallelization of each step is introduced. The parallel optimization process of minimap2 is shown in Fig. 2. The workflow of optimizing minimap2 in combination with the architecture of Ray is shown in Fig. 3. In this section, we will introduce the

Experimental setup

The experiments were tested on a parallel system of Sugoneasyop. The configuration of system computing nodes is shown in Table 1. Our experiments aligned the datasets to the reference genome with option -ax map-ont which uses parameters tuned for Oxford Nanopore reads. Each test was repeated in triplicate, and the average time was used for comparison.

Input data

Experiments were performed on simulated and real datasets. The reference genome is the human reference genome GRCh38/hg38. The simulated reads

Conclusion

The development of next-generation sequencing technologies has led to reduced sequencing costs, longer read lengths, and larger volumes of data in computational genomics. This has resulted in a critical need for a novel data analysis platform to deliver fast alignment of large-scale sequences. We choose to do parallel optimization for minimap2, a popular third-generation read aligner. MinimapR optimizes minimap2 in parallel based on Ray. The results show that minimapR has good scalability. The

Declaration of Competing Interest

We declare that we do not have any commercial or associative interest that represents a conflict of interest in connection with the work submitted.

Acknowledgments

This work was supported by NSFC Grants 62102427, 61772543.

References (22)

  • A. Rhoads et al.

    PacBio sequencing and its applications

    Genom., Proteom. Bioinforma.

    (2015)
  • J.M. Abuín et al.

    Big data in metagenomics: apache spark vs MPI

    PLoS One

    (2020)
  • J.M. Abułn

    SparkBWA: speeding up the alignment of high-throughput DNA sequencing data

    PloS One

    (2016)
  • Alser, M., Rotman, J., et al. (2020). Technology dictates algorithms: Recent developments in read alignment....
  • AnonNcbi Sequence Read Archive (SRA).〈www.ncbi.nlm.nih.gov/sra〉. Accessed...
  • Feng, Z., Qiu, S., et al. (2019, August). Accelerating Long Read Alignment on Three Processors. In Proceedings of the...
  • M. Jain et al.

    Nanopore sequencing and assembly of a human genome with ultra-long reads

    Nat. Biotechnol.

    (2018)
  • M. Jain et al.

    The Oxford nanopore MinION: delivery of nanopore sequencing to the genomics community

    Genome Biol.

    (2016)
  • H. Li

    Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences

    Bioinformatics

    (2015)
  • H. Li

    Minimap2: pairwise alignment for nucleotide sequences

    Bioinformatics

    (2018)
  • H. Li et al.

    A survey of sequence alignment algorithms for next-generation sequencing

    Brief. Bioinforma.

    (2010)
  • Cited by (6)

    • Performance Evaluation of Spark, Ray and MPI: A Case Study on Long Read Alignment Algorithm

      2024, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
    • Minimap2 parallelization method based on distributed computing

      2023, 2023 2nd International Conference on Health Big Data and Intelligent Healthcare, ICHIH 2023
    • Accelerating minimap2 for long-read sequencing on NUMA multi-core CPU

      2022, ACM International Conference Proceeding Series
    View full text