MinimapR: A parallel alignment tool for the analysis of large-scale third-generation sequencing data

doi:10.1016/j.compbiolchem.2022.107735

Computational Biology and Chemistry

Volume 99, August 2022, 107735

https://doi.org/10.1016/j.compbiolchem.2022.107735 Get rights and content

Abstract

The development of third-generation sequencing technology has brought significant changes and influences on genomics. Compared to the second-generation sequencing methods, the third-generation technologies produce around 100 times longer reads to reveal new genomic variations that complete long-term gaps in the human reference genome. However, these reads' excessive length and high error rate severely increase the amount of data and alignment cost. The traditional data analysis platform and serial sequence alignment method can not effectively deal with large-scale long read alignment. There is a critical need for a novel data analysis platform that can deliver fast alignment of large-scale sequences to solve the problem of long read alignment. High-performance computing platforms and efficient, scalable algorithms based on these platforms have significant potential to impact sequence analysis approaches. This paper presented minimapR, a multi-level parallel long-read alignment tool based on minimap2, a popular third-generation read aligner. MinimapR is developed based on the new high-performance distributed framework Ray. Ray fully integrates with the Python environment and can be easily installed with pip. MinimapR can utilize the power of multiple computing nodes, significantly accelerating alignment speeds without sacrificing sensitivity. The minimapR tool was tested on 64 nodes and demonstrated a 50 fold increase in speed with 78 % parallel efficiency. The source code and user manual of minimapR are freely available at https://github.com/Geehome/minimapR.

Introduction

Continuous advances in sequencing technologies reduce the sequencing cost, increase the read length, and generate a large number of genomic data (Rhoads and Au, 2015, Jain and Olsen, 2016, Schuster, 2008, Mardis, 2017, Wetterstrand, 2021). A core challenge of analyzing these sequence data is sequence alignment because sequence alignment is very time-consuming. The sequence alignment algorithm's computational complexity and time complexity positively correlate with sequence length. However, most of the existing long read alignment tools are serial programs with limited performance and space-time efficiency. And the significant memory demand often exceeds the capacity of a single workstation or even a shared-memory multiprocessor. Therefore, after obtaining a large number of long reads, how to realize the fast and accurate alignment of large-scale sequences has become a significant challenge for the third generation of long read alignment. A typical simple method is to manually divide the sequence into several files and then process multiple files simultaneously. When the data scale is large, the labour cost is considerable, and the manual segmentation is static allocation, which has poor flexibility. When the number of tasks or calculation scale needs to be adjusted, the file needs to be re segmented. In addition, manual segmentation can easily cause load imbalance (Abuín et al., 2020).

In this study, an Ray (Moritz et al., 2018) variant (minimapR) has been developed, and a comparison with both the Spark variant (IMOS) and the original minimap2 implementation (using a sequential approach) has been made. The parallel optimization method of minimapR ensures that the tasks are evenly distributed to each processor and uses multiple computing nodes to achieve significant acceleration. This method has little impact on the original code. MinimapR can perform any analysis that can be performed using minimap2. Our results show that minimapR speed can increase almost linearly for the number of nodes in a cluster. The minimapR tool was tested on 128 nodes and demonstrated a 92 fold increase in speed. Moreover, minimapR is faster than IMOS (Yousefi et al., 2019) by a factor of 1.7 × . Ray and Spark adopt different memory storage mechanisms. MinimapR has lower memory requirements.

Moreover, minimapR can run on various computing platforms with different structures, such as high-performance or distributed cloud computing platforms. Even for a multi-core personal PC, the multi-level parallelism of minimapR can show better acceleration performance than the multithreading of minimap2. Our parallel optimization work also has important reference significance for parallelizing other sequence alignment tools using the seed-chain-extend heuristic model (Alser and Rotman, 2020, Li and Homer, 2010, Jain and Koren, 2018, Marx, 2013, The TOP500 Supercomputer Sites, 2020, Wilkinson and Allen, 2005).

We make the following contributions:

•
We design a multi-level parallel long-read alignment tool with excellent performance, robust scalability, and less memory consumption.
•
To reduce IO time and memory consumption, we let each process only needs to read in an evenly divided query sequence.
•
To reduce sequence alignment time, we use multi-level parallelism to complete sequence alignment.
•
To avoid the multi-process preemptive output of results, we output the results to multiple files separately and then merge them.

The paper is organized as follows: In Section 2, we introduce the related work of minimap2 and the frameworks Ray and Spark. Then, we describe the parallel implementations of the minimap2 in Section 3. Moreover, we explain in Section 4 the experiments performed and present their results. Finally, we summarize the obtained results and provide future research directions in Section 5.

Section snippets

Minimap2

Minimap2 (Li, 2018, Suzuki and Kasahara, 2018) is a versatile sequence alignment program developed by Dr. Li Heng. It uses the seed-chain-extend heuristic model similar to minimap (Li, 2015) and improves the ability of base-level alignment. Minimap2 supports splicing read alignment to promote minimap further. The experimental results of minimap2 show that it is capable of alignment speeds more than 30 times faster than other long read or cDNA alignment tools and 3–4 times faster than mainstream

Methods

The minimap2 tool is based on the seed-chain-extend heuristic model. The minimap2-based alignment process can be divided into four steps: index sharing, sequence distribution, sequence alignment, and outputting alignment results. In this section, the parallelization of each step is introduced. The parallel optimization process of minimap2 is shown in Fig. 2. The workflow of optimizing minimap2 in combination with the architecture of Ray is shown in Fig. 3. In this section, we will introduce the

Experimental setup

The experiments were tested on a parallel system of Sugoneasyop. The configuration of system computing nodes is shown in Table 1. Our experiments aligned the datasets to the reference genome with option -ax map-ont which uses parameters tuned for Oxford Nanopore reads. Each test was repeated in triplicate, and the average time was used for comparison.

Input data

Experiments were performed on simulated and real datasets. The reference genome is the human reference genome GRCh38/hg38. The simulated reads

Conclusion

The development of next-generation sequencing technologies has led to reduced sequencing costs, longer read lengths, and larger volumes of data in computational genomics. This has resulted in a critical need for a novel data analysis platform to deliver fast alignment of large-scale sequences. We choose to do parallel optimization for minimap2, a popular third-generation read aligner. MinimapR optimizes minimap2 in parallel based on Ray. The results show that minimapR has good scalability. The

Declaration of Competing Interest

We declare that we do not have any commercial or associative interest that represents a conflict of interest in connection with the work submitted.

Acknowledgments

This work was supported by NSFC Grants 62102427, 61772543.

References (22)

A. Rhoads et al.
PacBio sequencing and its applications
Genom., Proteom. Bioinforma.
(2015)
J.M. Abuín et al.
Big data in metagenomics: apache spark vs MPI
PLoS One
(2020)
J.M. Abułn
SparkBWA: speeding up the alignment of high-throughput DNA sequencing data
PloS One
(2016)
Alser, M., Rotman, J., et al. (2020). Technology dictates algorithms: Recent developments in read alignment....
AnonNcbi Sequence Read Archive (SRA).〈www.ncbi.nlm.nih.gov/sra〉. Accessed...
Feng, Z., Qiu, S., et al. (2019, August). Accelerating Long Read Alignment on Three Processors. In Proceedings of the...
M. Jain et al.
Nanopore sequencing and assembly of a human genome with ultra-long reads
Nat. Biotechnol.
(2018)
M. Jain et al.
The Oxford nanopore MinION: delivery of nanopore sequencing to the genomics community
Genome Biol.
(2016)
H. Li
Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences
Bioinformatics
(2015)
H. Li
Minimap2: pairwise alignment for nucleotide sequences
Bioinformatics
(2018)

H. Li et al.

A survey of sequence alignment algorithms for next-generation sequencing

Brief. Bioinforma.

(2010)

Cited by (6)

Review on the evolution in DNA-based techniques for molecular characterization and authentication of GMOs
2024, Microchemical Journal
Genetically modified (GM) foods have not only improved the yield and quality of food but also raised public concerns about the safety of GM foods. However, with the continuous innovation of sequencing technology, the characterization of GM organisms (GMOs) has been rapidly developed and upgraded several times. The traditional strategy of Southern blotting to confirm the copy number and Sanger combined with polymerase chain reaction (PCR) to identify the flanking sequences can accomplish the identification of goals, but it is not able to obtain the complete insertion information and is time-consuming and labor-intensive. Next-Generation DNA Sequencing (NGS) has effectively addressed this issue through its high-throughput sequencing technology, allowing for the rapid and cost-effective sequencing of large amounts of DNA. Its high-throughput sequencing technology has effectively addressed the limitations of traditional identification methods, offering a promising avenue for the characterization of GMOs. However, NGS is limited by the short read length and cannot identify and characterize all insertion sites included in the sequence data, especially in samples with relatively large and complex genomes. Third-generation DNA sequencing (TGS) breaks this limitation with the advantage of long reads, but its higher error rate and fewer bioinformatics tools are still the primary problems to be solved. This article presents a comprehensive review of the DNA-based techniques for assessing the molecular characterization of GM products since their introduction. The iterative evolution of these techniques is discussed, and the advantages and limitations of each method are analyzed. Through this analysis, the article identifies significant challenges and future development directions in the field of molecular characterization of GM products. By examining the strengths and weaknesses of current approaches, this review aims to provide insights and guidance for future research in this area.
HeadTailTransfer: An efficient sampling method to improve the performance of graph neural network method in predicting sparse ncRNA–protein interactions
2023, Computers in Biology and Medicine
Noncoding RNA (ncRNA) is a functional RNA derived from DNA transcription, and most transcribed genes are transcribed into ncRNA. ncRNA is not directly involved in the translation of proteins, but it can participate in gene expression in cells and affect protein synthesis, thus playing an important role in biological processes such as growth, proliferation, metabolism, and information transmission. Therefore, understanding the interaction between ncRNA and protein is the basis for studying ncRNA regulation of protein-related biological activities. However, it is very expensive and time-consuming to verify ncRNA–protein interaction through biological experiments, and prediction methods based on machine learning have been developed rapidly. Recently, the graph neural network model (GNN) stands out for its excellent performance, but lacks a general framework for predicting ncRNA–protein interactions. We propose a GNN-based framework to predict ncRNA–protein interactions, which can utilize topological structure information to complete prediction tasks faster and more accurately. Meanwhile, for some smaller datasets, many ncRNA nodes lack neighbor information, resulting in lower prediction accuracy. For some larger datasets, the long-tail distribution causes the prediction of the tail nodes (sparse nodes linking few neighbors) to be affected. Therefore, we propose a new sampling method named HeadTailTransfer to mitigate these effects. Experimental results illustrate the effectiveness of this method. Especially for task-specific prediction on the RPI369 dataset in the Graphsage-based neural network framework, the AUC and ACC values increased from 56.8% and 52.2% to 80.2% and 71.8%, respectively. Our data and codes are available: https://github.com/kkkayle/HeadTailTransfer.
Performance Evaluation of Spark, Ray and MPI: A Case Study on Long Read Alignment Algorithm
2024, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Minimap2 parallelization method based on distributed computing
2023, 2023 2nd International Conference on Health Big Data and Intelligent Healthcare, ICHIH 2023
Applications of Deep Learning for Drug Discovery Systems with BigData
2022, BioMedInformatics
Accelerating minimap2 for long-read sequencing on NUMA multi-core CPU
2022, ACM International Conference Proceeding Series

View full text

MinimapR: A parallel alignment tool for the analysis of large-scale third-generation sequencing data

Abstract

Introduction

Section snippets

Minimap2

Methods

Experimental setup

Input data

Conclusion

Declaration of Competing Interest

Acknowledgments

Genom., Proteom. Bioinforma.

Big data in metagenomics: apache spark vs MPI

PLoS One

SparkBWA: speeding up the alignment of high-throughput DNA sequencing data

PloS One

Nanopore sequencing and assembly of a human genome with ultra-long reads

Nat. Biotechnol.

The Oxford nanopore MinION: delivery of nanopore sequencing to the genomics community

Genome Biol.

Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences

Bioinformatics

Minimap2: pairwise alignment for nucleotide sequences

Bioinformatics

A survey of sequence alignment algorithms for next-generation sequencing

Brief. Bioinforma.