ABSTRACT
Breadth-first search (BFS) is a core primitive for graph traversal and a basis for many higher-level graph analysis algorithms. It is also representative of a class of parallel computations whose memory accesses and work distribution are both irregular and data-dependent. Recent work has demonstrated the plausibility of GPU sparse graph traversal, but has tended to focus on asymptotically inefficient algorithms that perform poorly on graphs with non-trivial diameter.
We present a BFS parallelization focused on fine-grained task management constructed from efficient prefix sum that achieves an asymptotically optimal O(|V|+|E|) work complexity. Our implementation delivers excellent performance on diverse graphs, achieving traversal rates in excess of 3.3 billion and 8.3 billion traversed edges per second using single and quad-GPU configurations, respectively. This level of performance is several times faster than state-of-the-art implementations both CPU and GPU platforms.
- 10th DIMACS Implementation Challenge: http://www.cc.gatech.edu/dimacs10/index.shtml. Accessed: 2011-07-11.Google Scholar
- 9th DIMACS Implementation Challenge: http://www.dis.uniroma1.it/~challenge9/download.shtml. Accessed: 2011-07-11.Google Scholar
- Agarwal, V. et al. 2010. Scalable Graph Exploration on Multicore Processors. 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (New Orleans, LA, USA, Nov. 2010), 1--11. Google ScholarDigital Library
- Bader, D.A. and Madduri, K. Designing Multithreaded Algorithms for Breadth-First Search and st-connectivity on the Cray MTA-2. 2006 International Conference on Parallel Processing (ICPP'06) (Columbus, OH, USA), 523--530. Google ScholarDigital Library
- Bader, D.A. et al. On the Architectural Requirements for Efficient Execution of Graph Algorithms. 2005 International Conference on Parallel Processing (ICPP'05) (Oslo, Norway), 547--556. Google ScholarDigital Library
- Bell, N. and Garland, M. 2009. Implementing sparse matrix-vector multiplication on throughput-oriented processors. Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis (New York, NY, USA, 2009), 18:1--18:11. Google ScholarDigital Library
- Blelloch, G.E. 1990. Prefix Sums and Their Applications. Synthesis of Parallel Algorithms.Google Scholar
- Blelloch, G.E. 1989. Scans as primitive parallel operations. IEEE Transactions on Computers. 38, 11 (Nov. 1989), 1526--1538. Google ScholarDigital Library
- Chatterjee, S. et al. 1990. Scan primitives for vector computers. Proceedings of the 1990 ACM/IEEE conference on Supercomputing (Los Alamitos, CA, USA, 1990), 666--675. Google ScholarDigital Library
- Che, S. et al. 2009. Rodinia: A benchmark suite for heterogeneous computing. 2009 IEEE International Symposium on Workload Characterization (IISWC) (Austin, TX, USA, Oct. 2009), 44--54. Google ScholarDigital Library
- Cormen, T.H. et al. 2001. Introduction to Algorithms. MIT Press. Google ScholarDigital Library
- Deng, Y. (Steve) et al. 2009. Taming irregular EDA applications on GPUs. Proceedings of the 2009 International Conference on Computer-Aided Design (New York, NY, USA, 2009), 539--546. Google ScholarDigital Library
- Dotsenko, Y. et al. 2008. Fast scan algorithms on graphics processors. Proceedings of the 22nd annual international conference on Supercomputing (New York, NY, USA, 2008), 205--213. Google ScholarDigital Library
- Garland, M. 2008. Sparse matrix computations on manycore GPU's. Proceedings of the 45th annual Design Automation Conference (New York, NY, USA, 2008), 2--6. Google ScholarDigital Library
- GTgraph: A suite of synthetic random graph generators: https://sdm.lbl.gov/~kamesh/software/GTgraph/. Accessed: 2011-07-11.Google Scholar
- Harish, P. and Narayanan, P.J. 2007. Accelerating large graph algorithms on the GPU using CUDA. Proceedings of the 14th international conference on High performance computing (Berlin, Heidelberg, 2007), 197--208. Google ScholarDigital Library
- Hillis, W.D. and Steele, G.L. 1986. Data parallel algorithms. Communications of the ACM. 29, 12 (Dec. 1986), 1170--1183. Google ScholarDigital Library
- Hong, S. et al. 2011. Accelerating CUDA graph algorithms at maximum warp. Proceedings of the 16th ACM symposium on Principles and practice of parallel programming (New York, NY, USA, 2011), 267--276. Google ScholarDigital Library
- Hong, S. et al. 2011. Efficient Parallel Graph Exploration for Multi-Core CPU and GPU. (New York, NY, USA, 2011), to appear.Google Scholar
- Hussein, M. et al. 2007. On Implementing Graph Cuts on CUDA. First Workshop on General Purpose Processing on Graphics Processing Units (Boston, MA, Oct. 2007).Google Scholar
- Leiserson, C.E. and Schardl, T.B. 2010. A work-efficient parallel breadth-first search algorithm (or how to cope with the nondeterminism of reducers). Proceedings of the 22nd ACM symposium on Parallelism in algorithms and architectures (New York, NY, USA, 2010), 303--314. Google ScholarDigital Library
- Luo, L. et al. 2010. An effective GPU implementation of breadth-first search. Proceedings of the 47th Design Automation Conference (New York, NY, USA, 2010), 52--55. Google ScholarDigital Library
- Merrill, D. and Grimshaw, A. 2011. High Performance and Scalable Radix Sorting: A case study of implementing dynamic parallelism for GPU computing. Parallel Processing Letters. 21, 02 (2011), 245--272.Google ScholarCross Ref
- Merrill, D. and Grimshaw, A. 2009. Parallel Scan for Stream Architectures. Technical Report #CS2009--14. Department of Computer Science, University of Virginia.Google Scholar
- Merrill, D. et al. 2011. High Performance and Scalable GPU Graph Traversal. Technical Report #CS2011-05. Department of Computer Science, University of Virginia.Google Scholar
- Parboil Benchmark suite: http://impact.crhc.illinois.edu/parboil.php. Accessed: 2011-07-11.Google Scholar
- Scarpazza, D.P. et al. 2008. Efficient Breadth-First Search on the Cell/BE Processor. IEEE Transactions on Parallel and Distributed Systems. 19, 10 (Oct. 2008), 1381--1395. Google ScholarDigital Library
- Sengupta, S. et al. 2008. Efficient parallel scan algorithms for GPUs. Technical Report #NVR-2008-003. NVIDIA.Google Scholar
- The Graph 500 List: http://www.graph500.org/. Accessed: 2011-07-11.Google Scholar
- Ullman, J. and Yannakakis, M. 1990. High-probability parallel transitive closure algorithms. Proceedings of the second annual ACM symposium on Parallel algorithms and architectures - SPAA '90 (Island of Crete, Greece, 1990), 200--209. Google ScholarDigital Library
- University of Florida Sparse Matrix Collection: http://www.cise.ufl.edu/research/sparse/matrices/. Accessed: 2011-07-11.Google Scholar
- Xia, Y. and Prasanna, V.K. 2009. Topologically Adaptive Parallel Breadth-first Search on Multicore Processors. 21st International Conference on Parallel and Distributed Computing and Systems (PDCS'09) (Nov. 2009).Google Scholar
- Yoo, A. et al. A Scalable Distributed Parallel Breadth-First Search Algorithm on BlueGene/L. ACM/IEEE SC 2005 Conference (SC'05) (Seattle, WA, USA), 25--25. Google ScholarDigital Library
Index Terms
- Scalable GPU graph traversal
Recommendations
High-Performance and Scalable GPU Graph Traversal
Special Issue on PPOPP 2012Breadth-First Search (BFS) is a core primitive for graph traversal and a basis for many higher-level graph analysis algorithms. It is also representative of a class of parallel computations whose memory accesses and work distribution are both irregular ...
Scalable GPU graph traversal
PPOPP '12Breadth-first search (BFS) is a core primitive for graph traversal and a basis for many higher-level graph analysis algorithms. It is also representative of a class of parallel computations whose memory accesses and work distribution are both irregular ...
Using the Intel Many Integrated Core to accelerate graph traversal
Data-intensive applications have drawn more and more attention in the last few years. The basic graph traversal algorithm, the breadth-first search (BFS), a typical data-intensive application, is widely used and the Graph 500 benchmark uses it to rank ...
Comments