research-article

Suffix Array Construction on Multi-GPU Systems

Authors:

Florian Büren,

Daniel Jünger,

Christian Hundt,

Bertil SchmidtAuthors Info & Claims

HPDC '19: Proceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing

Pages 183 - 194

https://doi.org/10.1145/3307681.3325961

Published: 17 June 2019 Publication History

Abstract

Suffix arrays are prevalent data structures being fundamental to a wide range of applications including bioinformatics, data compression, and information retrieval. Therefore, various algorithms for (parallel) suffix array construction both on CPUs and GPUs have been proposed over the years. Although providing significant speedup over their CPU-based counterparts, existing GPU implementations share a common disadvantage: input text sizes are limited by the scarce memory of a single GPU. In this paper, we overcome aforementioned memory limitations by exploiting multi-GPU nodes featuring fast NVLink interconnects. In order to achieve high performance for this communication-intensive task, we design a parallel inter-GPU (re-)merging scheme. To handle segments spanning multiple GPUs, we propose an efficient strategy for the merging phase facilitated by a fast partitioning search. On 8 GPUs our implementation achieves speedups between 133 and 354 over sequential CPU-based libdivsufsort, between 30 and 68 over its multi-threaded shared memory version using 80 threads on 40 CPU cores for large datasets ranging from 697M to 3159M characters in size. For medium-sized datasets ranging between 104M and 236M, our approach yields maximum (minimum) speedups of 11.7 (4.5) and 6.45 (4.5) over existing single-GPU implementations (CUDPP, NVBIO). We are able to construct the suffix array of a full human genome on a single DGX-1 server within a runtime of 3.44~seconds which is faster than the 4.8 seconds that were previously reported employing 1600 cores on 100 nodes on a CPU-based HPC cluster. Our implementation is publicly available at https://gitlab.rlp.net/pararch/multi-gpu-suffix-array/.

References

[1]

A. Abdelhadi, A. H. Kandil, and M. Abouelhoda. 2014. Cloud-based parallel suffix array construction based on MPI. In 2nd Middle East Conference on Biomedical Engineering. 334--337.

[2]

Mohamed Ibrahim Abouelhoda, Stefan Kurtz, and Enno Ohlebusch. 2004. Replacing suffix trees with enhanced suffix arrays. J. Discrete Alg., Vol. 2, 1 (2004), 53--86.

Digital Library

[3]

Diego Arroyuelo, Carolina Bonacic, Veronica Gil-Costa, Mauricio Marin, and Gonzalo Navarro. 2014. Distributed text search using suffix arrays. Parallel Comput., Vol. 40, 9 (2014), 471--495.

Digital Library

[4]

Saman Ashkiani, Andrew A. Davidson, Ulrich Meyer, and John D. Owens. 2017. GPU Multisplit: An Extended Study of a Parallel Algorithm. TOPC, Vol. 4, 1 (2017), 2:1--2:44.

Digital Library

[5]

Sean Baxter. 2016. ModernGPU 2.0. https://github.com/moderngpu/moderngpu

[6]

Henri Casanova, John Iacono, Ben Karsin, Nodari Sitchinava, and Volker Weichert. 2017. An Efficient Multiway Mergesort for GPU Architectures. CoRR, Vol. abs/1702.07961 (2017). arxiv: 1702.07961 http://arxiv.org/abs/1702.07961

[7]

Raphaël Clifford. 2005. Distributed suffix trees. J. Discrete Algorithms, Vol. 3, 2--4 (2005), 176--197.

[8]

NVIDIA Corporation. 2015. NVBIO. http://nvlabs.github.io/nvbio/

[9]

NVIDIA Corporation. 2015. NVBIO Sufsort Module. http://nvlabs.github.io/nvbio/sufsort_page.html

[10]

NVIDIA Corporation. 2018. NVIDIA Collective Communications Library (NCCL). https://developer.nvidia.com/nccl

[11]

Mrinal Deo and Sean Keely. 2013. Parallel suffix array and least common prefix for the GPU. In ACM Symp. on Principles and Practice of Parallel Programming (PPoPP'13), Shenzhen, China. 197--206.

Digital Library

[12]

Patrick Flick and Srinivas Aluru. 2015. Parallel distributed memory construction of suffix and longest common prefix arrays. In Proc. of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2015, Austin, TX, USA. 16:1--16:10.

Digital Library

[13]

Natsuhiko Futamura, Srinivas Aluru, and Stefan Kurtz. 2001. Parallel Suffix Sorting. Electrical Engineering and Computer Science, Vol. 64 (2001). https://surface.syr.edu/eecs/64/

[14]

Oded Green, Robert McColl, and David A. Bader. 2012. GPU merge path: a GPU merging algorithm. In Int. Conf. on Supercomputing, ICS'12, Venice, Italy. 331--340.

Digital Library

[15]

Robert Homann, David Fleer, Robert Giegerich, and Marc Rehmsmeier. 2009. mkESA: enhanced suffix array construction tool. Bioinf., Vol. 25, 8 (2009), 1084--1085.

Digital Library

[16]

Kaixi Hou, Weifeng Liu, Hao Wang, and Wu-chun Feng. 2017. Fast segmented sort on GPUs. In Int. Conf. on Supercomputing, ICS 2017, Chicago, IL, USA. 12:1--12:10.

Digital Library

[17]

Daniel Jünger, Christian Hundt, and Bertil Schmidt. 2018. WarpDrive: Massively parallel hashing on multi-GPU nodes. In IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 441--450.

[18]

Juha Karkkainen. 2007. Fast BWT in small space by blockwise suffix sorting. Theoretical Computer Science, Vol. 387, 3 (2007), 249--257.

Digital Library

[19]

Juha Karkkainen and Peter Sanders. 2003. Simple Linear Work Suffix Array Construction. In Automata, Languages and Programming, 30th Int. Colloquium, ICALP 2003, Eindhoven, Netherlands. 943--955.

Digital Library

[20]

Juha Karkkainen, Peter Sanders, and Stefan Burkhardt. 2006. Linear work suffix array construction. J. ACM, Vol. 53, 6 (2006), 918--936.

Digital Library

[21]

Fabian Kulla and Peter Sanders. 2007. Scalable parallel suffix array construction. Parallel Comput., Vol. 33, 9 (2007), 605--612.

Digital Library

[22]

Julian Labeit, Julian Shun, and Guy E. Blelloch. 2017. Parallel lightweight wavelet tree, suffix array and FM-index construction. J. Discrete Alg., Vol. 43 (2017), 2--17.

[23]

N. Jesper Larsson and Kunihiko Sadakane. 2007. Faster suffix sorting. Theor. Comput. Sci., Vol. 387, 3 (2007), 258--272.

Digital Library

[24]

Udi Manber and Gene Myers. 1990. Suffix Arrays: A New Method for On-Line String Searches. In Proc. of the 1st Annual ACM-SIAM Symposium on Discrete Algorithms, San Francisco, CA, USA. 319--327.

Digital Library

[25]

Giovanni Manzini and Paolo Ferragina. 2004. Engineering a Lightweight Suffix Array Construction Algorithm. Algorithmica, Vol. 40, 1 (2004), 33--50.

Digital Library

[26]

A. A. Metwally, A. H. Kandil, and M. Abouelhoda. 2016. Distributed suffix array construction algorithms: Comparison of two algorithms. In 2016 8th Cairo Int. Biomedical Engineering Conference (CIBEC). 27--30.

[27]

Hisham Mohamed and Mohamed Abouelhoda. 2010. Parallel suffix sorting based on bucket pointer refinement. In 5th Cairo Int. Biomed. Eng. Conf. (CIBEC) 2010. 98--102.

[28]

Yuta Mori. 2016. libdivsufsort 2.0.2--1. https://github.com/y-256/libdivsufsort

[29]

Saher Odeh, Oded Green, Zahi Mwassi, Oz Shmueli, and Yitzhak Birk. 2012. Merge Path - Parallel Merging Made Simple. In 26th IEEE Int. Parallel and Distributed Proc. Symp. Workshops & PhD Forum, IPDPSW '12, Shanghai, China. 1611--1618.

Digital Library

[30]

Vitaly Osipov. 2012. Parallel Suffix Array Construction for Shared Memory Architectures. In String Processing and Information Retrieval - 19th Int. Symposium, SPIRE 2012, Cartagena de Indias, Colombia. 379--384.

Digital Library

[31]

Simon J. Puglisi, William F. Smyth, and Andrew Turpin. 2007. A taxonomy of suffix array construction algorithms. ACM Comput. Surv., Vol. 39, 2 (2007), 4.

Digital Library

[32]

Klaus-Bernd Schü rmann and Jens Stoye. 2007. An incomplex algorithm for fast suffix array construction. Softw., Pract. Exper., Vol. 37, 3 (2007), 309--329.

Digital Library

[33]

Anish Man Singh Shrestha, Martin C. Frith, and Paul Horton. 2014. A bioinformaticians guide to the forefront of suffix array construction algorithms. Briefings in Bioinf., Vol. 15, 2 (2014), 138--154.

[34]

Julian Shun, Guy E. Blelloch, Jeremy T. Fineman, Phillip B. Gibbons, Aapo Kyrola, Harsha Vardhan Simhadri, and Kanat Tangwongsan. 2012. Brief announcement: the problem based benchmark suite. In SPAA. ACM, 68--70.

Digital Library

[35]

Ivan Tanasic, Llu'i s Vilanova, Marc Jordà, Javier Cabezas, Isaac Gelado, Nacho Navarro, and Wen-mei W. Hwu. 2013. Comparison based sorting for systems with multiple GPUs. In Proc. of the 6th Workshop on General Purpose Processor Using Graphics Processing Units, GPGPU-6, Houston, TX, USA. 1--11.

Digital Library

[36]

Leyuan Wang, Sean Baxter, and John D. Owens. 2016. Fast parallel skew and prefix-doubling suffix array construction on the GPU. Concurrency and Computation: Pract. and Exper., Vol. 28, 12 (2016), 3466--3484.

Digital Library

Cited By

Weißenberger ASchmidt B(2024)Massively Parallel Inverse Block-sorting Transforms for bzip2 Decompression on GPUsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673067(856-865)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673067
Lu ZGuo LChen JWang R(2023)Reference-based genome compression using the longest matched substrings with parallelization considerationBMC Bioinformatics10.1186/s12859-023-05500-z24:1Online publication date: 30-Sep-2023
https://doi.org/10.1186/s12859-023-05500-z
Kobus RNelgen JHenkys VSchmidt B(2023)Faster Segmented Sort on GPUsEuro-Par 2023: Parallel Processing10.1007/978-3-031-39698-4_45(664-678)Online publication date: 24-Aug-2023
https://doi.org/10.1007/978-3-031-39698-4_45
Show More Cited By

Index Terms

Suffix Array Construction on Multi-GPU Systems

Recommendations

Parallel suffix array and least common prefix for the GPU
PPoPP '13

Suffix Array (SA) is a data structure formed by sorting the suffixes of a string into lexicographic order. SAs have been used in a variety of applications, most notably in pattern matching and Burrows-Wheeler Transform (BWT) based lossless data ...
Parallel suffix array and least common prefix for the GPU
PPoPP '13: Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming

Suffix Array (SA) is a data structure formed by sorting the suffixes of a string into lexicographic order. SAs have been used in a variety of applications, most notably in pattern matching and Burrows-Wheeler Transform (BWT) based lossless data ...
An efficient parallel collaborative filtering algorithm on multi-GPU platform

Collaborative filtering (CF) is one of the essential algorithms in recommendation system. As the size of the data in real applications is huge, usually at the magnitude of Petabytes, parallel computing technique is required to accelerate the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

HPDC '19: Proceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing

June 2019

278 pages

ISBN:9781450366700

DOI:10.1145/3307681

General Chair:
Jon Weissman
University of Minnesota, USA
,
Program Chairs:
Ali R. Butt
Virginia Tech, USA
,
Evgenia Smirni
College of William and Mary, USA

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

University of Arizona: University of Arizona
SIGHPC: ACM Special Interest Group on High Performance Computing, Special Interest Group on High Performance Computing
SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 June 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Deutsche Forschungsgemeinschaft

Conference

HPDC '19

Sponsor:

University of Arizona
SIGHPC
SIGARCH

HPDC '19: The 28th International Symposium on High-Performance Parallel and Distributed Computing

June 22 - 29, 2019

AZ, Phoenix, USA

Acceptance Rates

HPDC '19 Paper Acceptance Rate 22 of 106 submissions, 21%;

Overall Acceptance Rate 166 of 966 submissions, 17%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
293
Total Downloads

Downloads (Last 12 months)26
Downloads (Last 6 weeks)9

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Weißenberger ASchmidt B(2024)Massively Parallel Inverse Block-sorting Transforms for bzip2 Decompression on GPUsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673067(856-865)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673067
Lu ZGuo LChen JWang R(2023)Reference-based genome compression using the longest matched substrings with parallelization considerationBMC Bioinformatics10.1186/s12859-023-05500-z24:1Online publication date: 30-Sep-2023
https://doi.org/10.1186/s12859-023-05500-z
Kobus RNelgen JHenkys VSchmidt B(2023)Faster Segmented Sort on GPUsEuro-Par 2023: Parallel Processing10.1007/978-3-031-39698-4_45(664-678)Online publication date: 24-Aug-2023
https://doi.org/10.1007/978-3-031-39698-4_45
Goswami S(2023)Memory-Efficient All-Pair Suffix-Prefix Overlaps on GPUComputational Science – ICCS 202310.1007/978-3-031-35995-8_44(624-638)Online publication date: 3-Jul-2023
https://dl.acm.org/doi/10.1007/978-3-031-35995-8_44

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten