research-article

Scalable GPU graph traversal

Authors:

Michael Garland,

Andrew GrimshawAuthors Info & Claims

PPoPP '12: Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming

Pages 117 - 128

https://doi.org/10.1145/2145816.2145832

Published: 25 February 2012 Publication History

Abstract

Breadth-first search (BFS) is a core primitive for graph traversal and a basis for many higher-level graph analysis algorithms. It is also representative of a class of parallel computations whose memory accesses and work distribution are both irregular and data-dependent. Recent work has demonstrated the plausibility of GPU sparse graph traversal, but has tended to focus on asymptotically inefficient algorithms that perform poorly on graphs with non-trivial diameter.

We present a BFS parallelization focused on fine-grained task management constructed from efficient prefix sum that achieves an asymptotically optimal O(|V|+|E|) work complexity. Our implementation delivers excellent performance on diverse graphs, achieving traversal rates in excess of 3.3 billion and 8.3 billion traversed edges per second using single and quad-GPU configurations, respectively. This level of performance is several times faster than state-of-the-art implementations both CPU and GPU platforms.

References

[1]

10th DIMACS Implementation Challenge: http://www.cc.gatech.edu/dimacs10/index.shtml. Accessed: 2011-07-11.

[2]

9th DIMACS Implementation Challenge: http://www.dis.uniroma1.it/~challenge9/download.shtml. Accessed: 2011-07-11.

[3]

Agarwal, V. et al. 2010. Scalable Graph Exploration on Multicore Processors. 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (New Orleans, LA, USA, Nov. 2010), 1--11.

Digital Library

[4]

Bader, D.A. and Madduri, K. Designing Multithreaded Algorithms for Breadth-First Search and st-connectivity on the Cray MTA-2. 2006 International Conference on Parallel Processing (ICPP'06) (Columbus, OH, USA), 523--530.

Digital Library

[5]

Bader, D.A. et al. On the Architectural Requirements for Efficient Execution of Graph Algorithms. 2005 International Conference on Parallel Processing (ICPP'05) (Oslo, Norway), 547--556.

Digital Library

[6]

Bell, N. and Garland, M. 2009. Implementing sparse matrix-vector multiplication on throughput-oriented processors. Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis (New York, NY, USA, 2009), 18:1--18:11.

Digital Library

[7]

Blelloch, G.E. 1990. Prefix Sums and Their Applications. Synthesis of Parallel Algorithms.

[8]

Blelloch, G.E. 1989. Scans as primitive parallel operations. IEEE Transactions on Computers. 38, 11 (Nov. 1989), 1526--1538.

Digital Library

[9]

Chatterjee, S. et al. 1990. Scan primitives for vector computers. Proceedings of the 1990 ACM/IEEE conference on Supercomputing (Los Alamitos, CA, USA, 1990), 666--675.

Digital Library

[10]

Che, S. et al. 2009. Rodinia: A benchmark suite for heterogeneous computing. 2009 IEEE International Symposium on Workload Characterization (IISWC) (Austin, TX, USA, Oct. 2009), 44--54.

Digital Library

[11]

Cormen, T.H. et al. 2001. Introduction to Algorithms. MIT Press.

Digital Library

[12]

Deng, Y. (Steve) et al. 2009. Taming irregular EDA applications on GPUs. Proceedings of the 2009 International Conference on Computer-Aided Design (New York, NY, USA, 2009), 539--546.

Digital Library

[13]

Dotsenko, Y. et al. 2008. Fast scan algorithms on graphics processors. Proceedings of the 22nd annual international conference on Supercomputing (New York, NY, USA, 2008), 205--213.

Digital Library

[14]

Garland, M. 2008. Sparse matrix computations on manycore GPU's. Proceedings of the 45th annual Design Automation Conference (New York, NY, USA, 2008), 2--6.

Digital Library

[15]

GTgraph: A suite of synthetic random graph generators: https://sdm.lbl.gov/~kamesh/software/GTgraph/. Accessed: 2011-07-11.

[16]

Harish, P. and Narayanan, P.J. 2007. Accelerating large graph algorithms on the GPU using CUDA. Proceedings of the 14th international conference on High performance computing (Berlin, Heidelberg, 2007), 197--208.

Digital Library

[17]

Hillis, W.D. and Steele, G.L. 1986. Data parallel algorithms. Communications of the ACM. 29, 12 (Dec. 1986), 1170--1183.

Digital Library

[18]

Hong, S. et al. 2011. Accelerating CUDA graph algorithms at maximum warp. Proceedings of the 16th ACM symposium on Principles and practice of parallel programming (New York, NY, USA, 2011), 267--276.

Digital Library

[19]

Hong, S. et al. 2011. Efficient Parallel Graph Exploration for Multi-Core CPU and GPU. (New York, NY, USA, 2011), to appear.

[20]

Hussein, M. et al. 2007. On Implementing Graph Cuts on CUDA. First Workshop on General Purpose Processing on Graphics Processing Units (Boston, MA, Oct. 2007).

[21]

Leiserson, C.E. and Schardl, T.B. 2010. A work-efficient parallel breadth-first search algorithm (or how to cope with the nondeterminism of reducers). Proceedings of the 22nd ACM symposium on Parallelism in algorithms and architectures (New York, NY, USA, 2010), 303--314.

Digital Library

[22]

Luo, L. et al. 2010. An effective GPU implementation of breadth-first search. Proceedings of the 47th Design Automation Conference (New York, NY, USA, 2010), 52--55.

Digital Library

[23]

Merrill, D. and Grimshaw, A. 2011. High Performance and Scalable Radix Sorting: A case study of implementing dynamic parallelism for GPU computing. Parallel Processing Letters. 21, 02 (2011), 245--272.

[24]

Merrill, D. and Grimshaw, A. 2009. Parallel Scan for Stream Architectures. Technical Report #CS2009--14. Department of Computer Science, University of Virginia.

[25]

Merrill, D. et al. 2011. High Performance and Scalable GPU Graph Traversal. Technical Report #CS2011-05. Department of Computer Science, University of Virginia.

[26]

Parboil Benchmark suite: http://impact.crhc.illinois.edu/parboil.php. Accessed: 2011-07-11.

[27]

Scarpazza, D.P. et al. 2008. Efficient Breadth-First Search on the Cell/BE Processor. IEEE Transactions on Parallel and Distributed Systems. 19, 10 (Oct. 2008), 1381--1395.

Digital Library

[28]

Sengupta, S. et al. 2008. Efficient parallel scan algorithms for GPUs. Technical Report #NVR-2008-003. NVIDIA.

[29]

The Graph 500 List: http://www.graph500.org/. Accessed: 2011-07-11.

[30]

Ullman, J. and Yannakakis, M. 1990. High-probability parallel transitive closure algorithms. Proceedings of the second annual ACM symposium on Parallel algorithms and architectures - SPAA '90 (Island of Crete, Greece, 1990), 200--209.

Digital Library

[31]

University of Florida Sparse Matrix Collection: http://www.cise.ufl.edu/research/sparse/matrices/. Accessed: 2011-07-11.

[32]

Xia, Y. and Prasanna, V.K. 2009. Topologically Adaptive Parallel Breadth-first Search on Multicore Processors. 21st International Conference on Parallel and Distributed Computing and Systems (PDCS'09) (Nov. 2009).

[33]

Yoo, A. et al. A Scalable Distributed Parallel Breadth-First Search Algorithm on BlueGene/L. ACM/IEEE SC 2005 Conference (SC'05) (Seattle, WA, USA), 25--25.

Digital Library

Cited By

Yanhaona MGrimshaw AMickey S(2024)HighP5: Programming using Partitioned Parallel Processing SpacesJournal of the Brazilian Computer Society10.5753/jbcs.2024.434530:1(653-687)Online publication date: 17-Dec-2024
https://doi.org/10.5753/jbcs.2024.4345
Swann ROsama MSangaiah KMahmud JGrosser TDubach CSteuwer MXue JOttoni GQuintão Pereira F(2024)Seer: Predictive Runtime Kernel Selection for Irregular ProblemsProceedings of the 2024 IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO57630.2024.10444812(133-142)Online publication date: 2-Mar-2024
https://dl.acm.org/doi/10.1109/CGO57630.2024.10444812
Mustafa DAlkhasawneh RObeidat FShatnawi A(2024)MIMD Programs Execution Support on SIMD Machines: A Holistic SurveyIEEE Access10.1109/ACCESS.2024.337299012(34354-34377)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3372990
Show More Cited By

Index Terms

Scalable GPU graph traversal

Recommendations

High-Performance and Scalable GPU Graph Traversal
Special Issue on PPOPP 2012

Breadth-First Search (BFS) is a core primitive for graph traversal and a basis for many higher-level graph analysis algorithms. It is also representative of a class of parallel computations whose memory accesses and work distribution are both irregular ...
Scalable GPU graph traversal
PPOPP '12

Breadth-first search (BFS) is a core primitive for graph traversal and a basis for many higher-level graph analysis algorithms. It is also representative of a class of parallel computations whose memory accesses and work distribution are both irregular ...
Using the Intel Many Integrated Core to accelerate graph traversal

Data-intensive applications have drawn more and more attention in the last few years. The basic graph traversal algorithm, the breadth-first search (BFS), a typical data-intensive application, is widely used and the Graph 500 benchmark uses it to rank ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

PPoPP '12: Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming

February 2012

352 pages

ISBN:9781450311601

DOI:10.1145/2145816

General Chair:
J. Ramanujam
Louisiana State University, USA
,
Program Chair:
P. Sadayappan
The Ohio State University, USA

ACM SIGPLAN Notices Volume 47, Issue 8
PPOPP '12
August 2012
334 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/2370036
Issue’s Table of Contents

Copyright © 2012 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGPLAN: ACM Special Interest Group on Programming Languages

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 February 2012

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

PPoPP '12

Sponsor:

SIGPLAN

PPoPP '12: ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

February 25 - 29, 2012

Louisiana, New Orleans, USA

Acceptance Rates

Overall Acceptance Rate 230 of 1,014 submissions, 23%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

464
Total Citations
View Citations
3,915
Total Downloads

Downloads (Last 12 months)176
Downloads (Last 6 weeks)16

Reflects downloads up to 16 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Yanhaona MGrimshaw AMickey S(2024)HighP5: Programming using Partitioned Parallel Processing SpacesJournal of the Brazilian Computer Society10.5753/jbcs.2024.434530:1(653-687)Online publication date: 17-Dec-2024
https://doi.org/10.5753/jbcs.2024.4345
Swann ROsama MSangaiah KMahmud JGrosser TDubach CSteuwer MXue JOttoni GQuintão Pereira F(2024)Seer: Predictive Runtime Kernel Selection for Irregular ProblemsProceedings of the 2024 IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO57630.2024.10444812(133-142)Online publication date: 2-Mar-2024
https://dl.acm.org/doi/10.1109/CGO57630.2024.10444812
Mustafa DAlkhasawneh RObeidat FShatnawi A(2024)MIMD Programs Execution Support on SIMD Machines: A Holistic SurveyIEEE Access10.1109/ACCESS.2024.337299012(34354-34377)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3372990
Mahajan MNagi R(2024)GPU-accelerated transportation simplex algorithmJournal of Parallel and Distributed Computing10.1016/j.jpdc.2023.104790184:COnline publication date: 1-Feb-2024
https://dl.acm.org/doi/10.1016/j.jpdc.2023.104790
Fallin AGonzalez ASeo JBurtscher MMohror KArnold DBadia R(2023)A High-Performance MST Implementation for GPUsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607093(1-13)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3581784.3607093
Osama MPorumbescu SOwens JDehnavi MKulkarni MKrishnamoorthy S(2023)A Programming Model for GPU Load BalancingProceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3572848.3577434(79-91)Online publication date: 25-Feb-2023
https://dl.acm.org/doi/10.1145/3572848.3577434
Subhash VPandey KNatarajan V(2023)A GPU Parallel Algorithm for Computing Morse-Smale ComplexesIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2022.317476929:9(3873-3887)Online publication date: 1-Sep-2023
https://doi.org/10.1109/TVCG.2022.3174769
Hu YZhang FXia YYao ZZeng LDing HWei ZZhang XZhai JDu XMa S(2023)Enabling Efficient Random Access to Hierarchically Compressed Text Data on Diverse GPU PlatformsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.329434134:10(2699-2717)Online publication date: Oct-2023
https://doi.org/10.1109/TPDS.2023.3294341
Zeng LZou LÖzsu M(2023)SGSI – A Scalable GPU-Friendly Subgraph Isomorphism AlgorithmIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2022.323074435:11(11899-11916)Online publication date: 1-Nov-2023
https://doi.org/10.1109/TKDE.2022.3230744
Zheng ZShi XJin H(2023)Parallel Overlapping Community Detection Algorithm on GPUIEEE Transactions on Big Data10.1109/TBDATA.2022.31803609:2(677-687)Online publication date: 1-Apr-2023
https://doi.org/10.1109/TBDATA.2022.3180360
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten