MinimapR: A parallel alignment tool for the analysis of large-scale third-generation sequencing data
Introduction
Continuous advances in sequencing technologies reduce the sequencing cost, increase the read length, and generate a large number of genomic data (Rhoads and Au, 2015, Jain and Olsen, 2016, Schuster, 2008, Mardis, 2017, Wetterstrand, 2021). A core challenge of analyzing these sequence data is sequence alignment because sequence alignment is very time-consuming. The sequence alignment algorithm's computational complexity and time complexity positively correlate with sequence length. However, most of the existing long read alignment tools are serial programs with limited performance and space-time efficiency. And the significant memory demand often exceeds the capacity of a single workstation or even a shared-memory multiprocessor. Therefore, after obtaining a large number of long reads, how to realize the fast and accurate alignment of large-scale sequences has become a significant challenge for the third generation of long read alignment. A typical simple method is to manually divide the sequence into several files and then process multiple files simultaneously. When the data scale is large, the labour cost is considerable, and the manual segmentation is static allocation, which has poor flexibility. When the number of tasks or calculation scale needs to be adjusted, the file needs to be re segmented. In addition, manual segmentation can easily cause load imbalance (Abuín et al., 2020).
In this study, an Ray (Moritz et al., 2018) variant (minimapR) has been developed, and a comparison with both the Spark variant (IMOS) and the original minimap2 implementation (using a sequential approach) has been made. The parallel optimization method of minimapR ensures that the tasks are evenly distributed to each processor and uses multiple computing nodes to achieve significant acceleration. This method has little impact on the original code. MinimapR can perform any analysis that can be performed using minimap2. Our results show that minimapR speed can increase almost linearly for the number of nodes in a cluster. The minimapR tool was tested on 128 nodes and demonstrated a 92 fold increase in speed. Moreover, minimapR is faster than IMOS (Yousefi et al., 2019) by a factor of 1.7 × . Ray and Spark adopt different memory storage mechanisms. MinimapR has lower memory requirements.
Moreover, minimapR can run on various computing platforms with different structures, such as high-performance or distributed cloud computing platforms. Even for a multi-core personal PC, the multi-level parallelism of minimapR can show better acceleration performance than the multithreading of minimap2. Our parallel optimization work also has important reference significance for parallelizing other sequence alignment tools using the seed-chain-extend heuristic model (Alser and Rotman, 2020, Li and Homer, 2010, Jain and Koren, 2018, Marx, 2013, The TOP500 Supercomputer Sites, 2020, Wilkinson and Allen, 2005).
We make the following contributions:
- •
We design a multi-level parallel long-read alignment tool with excellent performance, robust scalability, and less memory consumption.
- •
To reduce IO time and memory consumption, we let each process only needs to read in an evenly divided query sequence.
- •
To reduce sequence alignment time, we use multi-level parallelism to complete sequence alignment.
- •
To avoid the multi-process preemptive output of results, we output the results to multiple files separately and then merge them.
The paper is organized as follows: In Section 2, we introduce the related work of minimap2 and the frameworks Ray and Spark. Then, we describe the parallel implementations of the minimap2 in Section 3. Moreover, we explain in Section 4 the experiments performed and present their results. Finally, we summarize the obtained results and provide future research directions in Section 5.
Section snippets
Minimap2
Minimap2 (Li, 2018, Suzuki and Kasahara, 2018) is a versatile sequence alignment program developed by Dr. Li Heng. It uses the seed-chain-extend heuristic model similar to minimap (Li, 2015) and improves the ability of base-level alignment. Minimap2 supports splicing read alignment to promote minimap further. The experimental results of minimap2 show that it is capable of alignment speeds more than 30 times faster than other long read or cDNA alignment tools and 3–4 times faster than mainstream
Methods
The minimap2 tool is based on the seed-chain-extend heuristic model. The minimap2-based alignment process can be divided into four steps: index sharing, sequence distribution, sequence alignment, and outputting alignment results. In this section, the parallelization of each step is introduced. The parallel optimization process of minimap2 is shown in Fig. 2. The workflow of optimizing minimap2 in combination with the architecture of Ray is shown in Fig. 3. In this section, we will introduce the
Experimental setup
The experiments were tested on a parallel system of Sugoneasyop. The configuration of system computing nodes is shown in Table 1. Our experiments aligned the datasets to the reference genome with option -ax map-ont which uses parameters tuned for Oxford Nanopore reads. Each test was repeated in triplicate, and the average time was used for comparison.
Input data
Experiments were performed on simulated and real datasets. The reference genome is the human reference genome GRCh38/hg38. The simulated reads
Conclusion
The development of next-generation sequencing technologies has led to reduced sequencing costs, longer read lengths, and larger volumes of data in computational genomics. This has resulted in a critical need for a novel data analysis platform to deliver fast alignment of large-scale sequences. We choose to do parallel optimization for minimap2, a popular third-generation read aligner. MinimapR optimizes minimap2 in parallel based on Ray. The results show that minimapR has good scalability. The
Declaration of Competing Interest
We declare that we do not have any commercial or associative interest that represents a conflict of interest in connection with the work submitted.
Acknowledgments
This work was supported by NSFC Grants 62102427, 61772543.
References (22)
- et al.
PacBio sequencing and its applications
Genom., Proteom. Bioinforma.
(2015) - et al.
Big data in metagenomics: apache spark vs MPI
PLoS One
(2020) SparkBWA: speeding up the alignment of high-throughput DNA sequencing data
PloS One
(2016)- Alser, M., Rotman, J., et al. (2020). Technology dictates algorithms: Recent developments in read alignment....
- AnonNcbi Sequence Read Archive (SRA).〈www.ncbi.nlm.nih.gov/sra〉. Accessed...
- Feng, Z., Qiu, S., et al. (2019, August). Accelerating Long Read Alignment on Three Processors. In Proceedings of the...
- et al.
Nanopore sequencing and assembly of a human genome with ultra-long reads
Nat. Biotechnol.
(2018) - et al.
The Oxford nanopore MinION: delivery of nanopore sequencing to the genomics community
Genome Biol.
(2016) Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences
Bioinformatics
(2015)Minimap2: pairwise alignment for nucleotide sequences
Bioinformatics
(2018)
A survey of sequence alignment algorithms for next-generation sequencing
Brief. Bioinforma.
Cited by (6)
Review on the evolution in DNA-based techniques for molecular characterization and authentication of GMOs
2024, Microchemical JournalHeadTailTransfer: An efficient sampling method to improve the performance of graph neural network method in predicting sparse ncRNA–protein interactions
2023, Computers in Biology and MedicinePerformance Evaluation of Spark, Ray and MPI: A Case Study on Long Read Alignment Algorithm
2024, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)Minimap2 parallelization method based on distributed computing
2023, 2023 2nd International Conference on Health Big Data and Intelligent Healthcare, ICHIH 2023Applications of Deep Learning for Drug Discovery Systems with BigData
2022, BioMedInformaticsAccelerating minimap2 for long-read sequencing on NUMA multi-core CPU
2022, ACM International Conference Proceeding Series