skip to main content
10.1145/3307681.3325961acmconferencesArticle/Chapter ViewAbstractPublication PageshpdcConference Proceedingsconference-collections
research-article

Suffix Array Construction on Multi-GPU Systems

Published: 17 June 2019 Publication History

Abstract

Suffix arrays are prevalent data structures being fundamental to a wide range of applications including bioinformatics, data compression, and information retrieval. Therefore, various algorithms for (parallel) suffix array construction both on CPUs and GPUs have been proposed over the years. Although providing significant speedup over their CPU-based counterparts, existing GPU implementations share a common disadvantage: input text sizes are limited by the scarce memory of a single GPU. In this paper, we overcome aforementioned memory limitations by exploiting multi-GPU nodes featuring fast NVLink interconnects. In order to achieve high performance for this communication-intensive task, we design a parallel inter-GPU (re-)merging scheme. To handle segments spanning multiple GPUs, we propose an efficient strategy for the merging phase facilitated by a fast partitioning search. On 8 GPUs our implementation achieves speedups between 133 and 354 over sequential CPU-based libdivsufsort, between 30 and 68 over its multi-threaded shared memory version using 80 threads on 40 CPU cores for large datasets ranging from 697M to 3159M characters in size. For medium-sized datasets ranging between 104M and 236M, our approach yields maximum (minimum) speedups of 11.7 (4.5) and 6.45 (4.5) over existing single-GPU implementations (CUDPP, NVBIO). We are able to construct the suffix array of a full human genome on a single DGX-1 server within a runtime of 3.44~seconds which is faster than the 4.8 seconds that were previously reported employing 1600 cores on 100 nodes on a CPU-based HPC cluster. Our implementation is publicly available at https://gitlab.rlp.net/pararch/multi-gpu-suffix-array/.

References

[1]
A. Abdelhadi, A. H. Kandil, and M. Abouelhoda. 2014. Cloud-based parallel suffix array construction based on MPI. In 2nd Middle East Conference on Biomedical Engineering. 334--337.
[2]
Mohamed Ibrahim Abouelhoda, Stefan Kurtz, and Enno Ohlebusch. 2004. Replacing suffix trees with enhanced suffix arrays. J. Discrete Alg., Vol. 2, 1 (2004), 53--86.
[3]
Diego Arroyuelo, Carolina Bonacic, Veronica Gil-Costa, Mauricio Marin, and Gonzalo Navarro. 2014. Distributed text search using suffix arrays. Parallel Comput., Vol. 40, 9 (2014), 471--495.
[4]
Saman Ashkiani, Andrew A. Davidson, Ulrich Meyer, and John D. Owens. 2017. GPU Multisplit: An Extended Study of a Parallel Algorithm. TOPC, Vol. 4, 1 (2017), 2:1--2:44.
[5]
Sean Baxter. 2016. ModernGPU 2.0. https://github.com/moderngpu/moderngpu
[6]
Henri Casanova, John Iacono, Ben Karsin, Nodari Sitchinava, and Volker Weichert. 2017. An Efficient Multiway Mergesort for GPU Architectures. CoRR, Vol. abs/1702.07961 (2017). arxiv: 1702.07961 http://arxiv.org/abs/1702.07961
[7]
Raphaël Clifford. 2005. Distributed suffix trees. J. Discrete Algorithms, Vol. 3, 2--4 (2005), 176--197.
[8]
NVIDIA Corporation. 2015. NVBIO. http://nvlabs.github.io/nvbio/
[9]
NVIDIA Corporation. 2015. NVBIO Sufsort Module. http://nvlabs.github.io/nvbio/sufsort_page.html
[10]
NVIDIA Corporation. 2018. NVIDIA Collective Communications Library (NCCL). https://developer.nvidia.com/nccl
[11]
Mrinal Deo and Sean Keely. 2013. Parallel suffix array and least common prefix for the GPU. In ACM Symp. on Principles and Practice of Parallel Programming (PPoPP'13), Shenzhen, China. 197--206.
[12]
Patrick Flick and Srinivas Aluru. 2015. Parallel distributed memory construction of suffix and longest common prefix arrays. In Proc. of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2015, Austin, TX, USA. 16:1--16:10.
[13]
Natsuhiko Futamura, Srinivas Aluru, and Stefan Kurtz. 2001. Parallel Suffix Sorting. Electrical Engineering and Computer Science, Vol. 64 (2001). https://surface.syr.edu/eecs/64/
[14]
Oded Green, Robert McColl, and David A. Bader. 2012. GPU merge path: a GPU merging algorithm. In Int. Conf. on Supercomputing, ICS'12, Venice, Italy. 331--340.
[15]
Robert Homann, David Fleer, Robert Giegerich, and Marc Rehmsmeier. 2009. mkESA: enhanced suffix array construction tool. Bioinf., Vol. 25, 8 (2009), 1084--1085.
[16]
Kaixi Hou, Weifeng Liu, Hao Wang, and Wu-chun Feng. 2017. Fast segmented sort on GPUs. In Int. Conf. on Supercomputing, ICS 2017, Chicago, IL, USA. 12:1--12:10.
[17]
Daniel Jünger, Christian Hundt, and Bertil Schmidt. 2018. WarpDrive: Massively parallel hashing on multi-GPU nodes. In IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 441--450.
[18]
Juha Karkkainen. 2007. Fast BWT in small space by blockwise suffix sorting. Theoretical Computer Science, Vol. 387, 3 (2007), 249--257.
[19]
Juha Karkkainen and Peter Sanders. 2003. Simple Linear Work Suffix Array Construction. In Automata, Languages and Programming, 30th Int. Colloquium, ICALP 2003, Eindhoven, Netherlands. 943--955.
[20]
Juha Karkkainen, Peter Sanders, and Stefan Burkhardt. 2006. Linear work suffix array construction. J. ACM, Vol. 53, 6 (2006), 918--936.
[21]
Fabian Kulla and Peter Sanders. 2007. Scalable parallel suffix array construction. Parallel Comput., Vol. 33, 9 (2007), 605--612.
[22]
Julian Labeit, Julian Shun, and Guy E. Blelloch. 2017. Parallel lightweight wavelet tree, suffix array and FM-index construction. J. Discrete Alg., Vol. 43 (2017), 2--17.
[23]
N. Jesper Larsson and Kunihiko Sadakane. 2007. Faster suffix sorting. Theor. Comput. Sci., Vol. 387, 3 (2007), 258--272.
[24]
Udi Manber and Gene Myers. 1990. Suffix Arrays: A New Method for On-Line String Searches. In Proc. of the 1st Annual ACM-SIAM Symposium on Discrete Algorithms, San Francisco, CA, USA. 319--327.
[25]
Giovanni Manzini and Paolo Ferragina. 2004. Engineering a Lightweight Suffix Array Construction Algorithm. Algorithmica, Vol. 40, 1 (2004), 33--50.
[26]
A. A. Metwally, A. H. Kandil, and M. Abouelhoda. 2016. Distributed suffix array construction algorithms: Comparison of two algorithms. In 2016 8th Cairo Int. Biomedical Engineering Conference (CIBEC). 27--30.
[27]
Hisham Mohamed and Mohamed Abouelhoda. 2010. Parallel suffix sorting based on bucket pointer refinement. In 5th Cairo Int. Biomed. Eng. Conf. (CIBEC) 2010. 98--102.
[28]
Yuta Mori. 2016. libdivsufsort 2.0.2--1. https://github.com/y-256/libdivsufsort
[29]
Saher Odeh, Oded Green, Zahi Mwassi, Oz Shmueli, and Yitzhak Birk. 2012. Merge Path - Parallel Merging Made Simple. In 26th IEEE Int. Parallel and Distributed Proc. Symp. Workshops & PhD Forum, IPDPSW '12, Shanghai, China. 1611--1618.
[30]
Vitaly Osipov. 2012. Parallel Suffix Array Construction for Shared Memory Architectures. In String Processing and Information Retrieval - 19th Int. Symposium, SPIRE 2012, Cartagena de Indias, Colombia. 379--384.
[31]
Simon J. Puglisi, William F. Smyth, and Andrew Turpin. 2007. A taxonomy of suffix array construction algorithms. ACM Comput. Surv., Vol. 39, 2 (2007), 4.
[32]
Klaus-Bernd Schü rmann and Jens Stoye. 2007. An incomplex algorithm for fast suffix array construction. Softw., Pract. Exper., Vol. 37, 3 (2007), 309--329.
[33]
Anish Man Singh Shrestha, Martin C. Frith, and Paul Horton. 2014. A bioinformaticians guide to the forefront of suffix array construction algorithms. Briefings in Bioinf., Vol. 15, 2 (2014), 138--154.
[34]
Julian Shun, Guy E. Blelloch, Jeremy T. Fineman, Phillip B. Gibbons, Aapo Kyrola, Harsha Vardhan Simhadri, and Kanat Tangwongsan. 2012. Brief announcement: the problem based benchmark suite. In SPAA. ACM, 68--70.
[35]
Ivan Tanasic, Llu'i s Vilanova, Marc Jordà, Javier Cabezas, Isaac Gelado, Nacho Navarro, and Wen-mei W. Hwu. 2013. Comparison based sorting for systems with multiple GPUs. In Proc. of the 6th Workshop on General Purpose Processor Using Graphics Processing Units, GPGPU-6, Houston, TX, USA. 1--11.
[36]
Leyuan Wang, Sean Baxter, and John D. Owens. 2016. Fast parallel skew and prefix-doubling suffix array construction on the GPU. Concurrency and Computation: Pract. and Exper., Vol. 28, 12 (2016), 3466--3484.

Cited By

View all
  • (2024)Massively Parallel Inverse Block-sorting Transforms for bzip2 Decompression on GPUsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673067(856-865)Online publication date: 12-Aug-2024
  • (2023)Reference-based genome compression using the longest matched substrings with parallelization considerationBMC Bioinformatics10.1186/s12859-023-05500-z24:1Online publication date: 30-Sep-2023
  • (2023)Faster Segmented Sort on GPUsEuro-Par 2023: Parallel Processing10.1007/978-3-031-39698-4_45(664-678)Online publication date: 24-Aug-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
HPDC '19: Proceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing
June 2019
278 pages
ISBN:9781450366700
DOI:10.1145/3307681
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 June 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. CUDA
  2. NVlink
  3. bioinformatics
  4. multi-GPU
  5. parallel
  6. suffix array

Qualifiers

  • Research-article

Funding Sources

Conference

HPDC '19
Sponsor:

Acceptance Rates

HPDC '19 Paper Acceptance Rate 22 of 106 submissions, 21%;
Overall Acceptance Rate 166 of 966 submissions, 17%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)26
  • Downloads (Last 6 weeks)9
Reflects downloads up to 03 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Massively Parallel Inverse Block-sorting Transforms for bzip2 Decompression on GPUsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673067(856-865)Online publication date: 12-Aug-2024
  • (2023)Reference-based genome compression using the longest matched substrings with parallelization considerationBMC Bioinformatics10.1186/s12859-023-05500-z24:1Online publication date: 30-Sep-2023
  • (2023)Faster Segmented Sort on GPUsEuro-Par 2023: Parallel Processing10.1007/978-3-031-39698-4_45(664-678)Online publication date: 24-Aug-2023
  • (2023)Memory-Efficient All-Pair Suffix-Prefix Overlaps on GPUComputational Science – ICCS 202310.1007/978-3-031-35995-8_44(624-638)Online publication date: 3-Jul-2023

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media