Abstract
Graphics processing units (GPUs) have become popular high-performance computing platforms for a wide range of applications. The trend of processing graph structures on modern GPUs has also attracted an increasing interest in recent years. This article aims to review research works on adapting the massively parallel architecture of GPUs to accelerate the performance of fundamental graph operations. Despite their merits, some factors such as the unique architecture of GPUs, limited programming models, and irregular structures of graphs prevent GPU implementations from achieving high performance. Thus, this survey also discusses challenges and optimization techniques used by recent studies to fully utilize the GPU capability. A categorization of the existing research works is also presented based on the specific issues these attempted to solve.
Similar content being viewed by others
References
Barnat J, Bauch P, Brim L, Ceška M (2011) Computing strongly connected components in parallel on CUDA. In: 2011 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp 544–555
Barnat J, Chaloupka J, van de Pol J (2008) Improved distributed algorithms for SCC decomposition. Electron Notes Theor Comput Sci 198(1):63–77
Baxter S (2013) Modern GPU. https://moderngpu.github.io/scan.html
Beamer S, Asanović K, Patterson D (2013) Direction-optimizing breadth-first search. Sci Program 21(3–4):137–148
Bell N, Garland M (2009) Implementing sparse matrix–vector multiplication on throughput-oriented processors. In: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, ACM
Bellman R (1958) On a routing problem. Q Appl Math 16:87–90
Boruvka O (1926) About a certain minimal problem. Praca Moravske Prirodovedecke Spolecnosti 3:37–58
Brandes U (2001) A faster algorithm for betweenness centrality. J Math Sociol 25(2):163–177
Bron C, Kerbosch J (1973) Algorithm 457: finding all cliques of an undirected graph. Commun ACM 16(9):575–577
Buluç A, Gilbert JR, Budak C (2010) Solving path problems on the GPU. Parallel Comput 36(5):241–253
Burtscher M, Nasre R, Pingali K (2012) A quantitative study of irregular programs on GPUs. In: 2012 IEEE International Symposium on Workload Characterization (IISWC), pp 141–151
Busato F, Bombieri N (2016) An efficient implementation of the bellman-ford algorithm for kepler GPU architectures. IEEE Trans Parallel Distrib Syst 27(8):2222–2233
Cavallari S, Zheng V, Cai HY, Chang CC, Cambria E (2017) Learning community embedding with community detection and node embedding on graphs. In: CIKM, pp 377–386
Chaturvedi I, Ong Y-S, Tsang I, Welsch R, Cambria E (2016) Learning word dependencies in text by means of a deep recurrent belief network. Knowl Based Syst 108:144–154
Che S (2014) GasCL: A vertex-centric graph model for GPUS. In: High Performance Extreme Computing Conference (HPEC), 2014 IEEE, pp 1–6
Crauser A, Mehlhorn K, Meyer U, Sanders P (1998) A parallelization of Dijkstra’s shortest path algorithm. In: International Symposium on Mathematical Foundations of Computer Science. Springer, Berlin, pp 722–731
D’Alberto P, Nicolau A (2007) R-Kleene: a high-performance divide-and-conquer algorithm for the all-pair shortest path for densely connected networks. Algorithmica 47(2):203–213
Davidson A, Baxter S, Garland M, Owens JD (2014) Work-efficient parallel GPU methods for single-source shortest paths. In: IEEE 28th International Parallel and Distributed Processing Symposium, pp 349–359
Deng Y, Wang BD, Mu S (2009) Taming irregular EDA applications on GPUs. In: Proceedings of the 2009 International Conference on Computer-Aided Design, pp 539–546 (Steve)
Dijkstra EW (1959) A note on two problems in connexion with graphs. Numer Math 1(1):269–271
Elsen E, Vaidyanathan V (2013) A vertex-centric CUDA/C++ API for large graph analytics on GPUs using the gather–apply–scatter abstraction
Ford L Jr (1956) Network flow theory. Technical report. Rand Corp, Santa Monica
Fu Z, Personick M, Thompson B (2014) Mapgraph: a high level api for fast development of high performance graph analytics on GPUs. In: Proceedings of Workshop on Graph Data management Experiences and Systems, ACM, pp 1–6
Gharaibeh A, Beltrão Costa L, Santos-Neto E, Ripeanu M (2012) A yoke of oxen and a thousand chickens for heavy lifting graph processing. In: Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques, ACM, pp 345–354
Guo Y, Biczak M, Varbanescu AL, Iosup A, Martella C, Willke TL (2014) How well do graph-processing platforms perform? An empirical performance evaluation and analysis. In: 2014 IEEE 28th International Parallel and Distributed Processing Symposium, pp 395–404
Guo Y, Varbanescu AL, Iosup A, Epema D (2015) An empirical performance evaluation of GPU-enabled graph-processing systems. In: 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), pp 423–432
Harish P, Narayanan PJ (2007). Accelerating large graph algorithms on the GPU using CUDA. In: Proceedings of the 14th International Conference on High Performance Computing, pp 197–208
Harish P, Vineet V, Narayanan P (2009) Large graph algorithms for massively multithreaded architectures. International Institute of Information Technology Hyderabad, Technical Report IIIT/TR/2009/74
Harris M, Sengupta S, Owens JD (2007) Parallel prefix sum (scan) with CUDA. GPU gems 3(39):851–876
Hiragushi T, Takahashi D (2013) Efficient hybrid breadth-first search on GPUs. In: Algorithms and Architectures for Parallel Processing. Springer, Berlin, pp 40–50
Hong S, Kim SK, Oguntebi T, Olukotun K. Accelerating CUDA graph algorithms at maximum warp. In: Proceedings of the 16th ACM symposium on Principles and Practice of Parallel Programming, pp 267–276
Hong S, Oguntebi T, Olukotun K (2011) Efficient parallel graph exploration on multi-core CPU and GPU. In: Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques, pp 78–88
Hong S, Rodia NC, Olukotun K (2013) On fast parallel detection of strongly connected components (SCC) in small-world graphs. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, ACM
Hussein M, Varshney A, Davis L (2007) On implementing graph cuts on CUDA. In: First Workshop on General Purpose Processing on Graphics Processing Units
Jenkins J, Arkatkar I, Owens JD, Choudhary A, Samatova NF (2011) Lessons learned from exploring the backtracking paradigm on the GPU. In: European Conference on Parallel Processing. Springer, Berlin, pp 425–437
Jia Y, Lu V, Hoberock J, Garland M, Hart JC (2011) Edge v. node parallelism for graph centrality metrics. GPU Comput Gems 2:15–30
Katz GJ, Joseph J, Kider T (2008) All-pairs shortest-paths for large graphs on the GPU. In: Proceedings of the 23rd ACM SIGGRAPH/EUROGRAPHICS Symposium on Graphics hardware, pp 47–55
Khorasani F, Vora K, Gupta R, Bhuyan LN (2014) CuSha: vertex-centric graph processing on GPUs. In: Proceedings of the 23rd international Symposium on High-Performance Parallel and Distributed Computing, ACM, pp 239–252
Kim M-S, An K, Park H, Seo H, Kim J (2016) GTS: a fast and scalable graph processing method based on streaming topology to GPUs. In: Proceedings of the 2016 International Conference on Management of Data, ACM, pp 447–461
Kirk DB, Wen-mei WH (2012) Programming massively parallel processors: a hands-on approach. Newnes, Boston
Kruskal JB (1956) On the shortest spanning subtree of a graph and the traveling salesman problem. Proc Am Math soc 7(1):48–50
Kyrola A, Blelloch GE, Guestrin C et al (2012) Graphchi: large-scale graph computation on just a pc. In: OSDI, vol 12, pp 31–46
Li G, Zhu Z, Cong Z, Yang F (2014) Efficient decomposition of strongly connected components on GPUs. J Syst Archit 60(1):1–10
Lin W, Xiao X, Xie X, Li X-L (2016) Network motif discovery: GPU approach. IEEE Trans Knowl Data Eng 29:513–528
Liu H, Huang HH (2015) Enterprise: breadth-first graph traversal on GPUs. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp 1–12
Liu H, Huang HH, Hu Y (2016) iBFS: concurrent breadth-first search on GPUs. In: Proceedings of the 2016 International Conference on Management of Data. ACM, pp 403–416
Luo L, Wong M, Hwu W-m (2010) An effective GPU implementation of breadth-first search. In: Proceedings of the 47th Design Automation Conference, pp 52–55
Malewicz G, Austern MH, Bik AJ, Dehnert JC, Horn I, Leiser N, Czajkowski G (2010) Pregel: a system for large-scale graph processing. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, ACM, pp 135–146
Martín PJ, Torres R, Gavilanes A (2009) CUDA solutions for the SSSP problem. In: International Conference on Computational Science. Springer, Berlin, pp 904–913
Matsumoto K, Nakasato N, Sedukhin SG (2011) Blocked all-pairs shortest paths algorithm for hybrid CPU–GPU system. In: 2011 IEEE 13th International Conference on High Performance Computing and Communications (HPCC), pp 145–152
McLaughlin A, Bader DA (2014) Scalable and high performance betweenness centrality on the GPU. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, IEEE Press, pp 572–583
Mclendon W III, Hendrickson B, Plimpton SJ, Rauchwerger L, Rauchwerger L (2005) Finding strongly connected components in distributed graphs. J Parallel Distrib Comput 65(8):901–910
Meng J, Tarjan D, Skadron K (2010) Dynamic warp subdivision for integrated branch and memory divergence tolerance. ACM SIGARCH Comput Archit News 38(3):235–246
Merrill D, Garland M, Grimshaw A (2012) Scalable GPU graph traversal. In: Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming, pp 117–128
Meyer U, Sanders P (2003) Delta-stepping: a parallelizable shortest path algorithm. J Algorithms 49(1):114–152
Micikevicius P (2012) GPU performance analysis and optimization. In: GPU Technology Conference
Narasiman V, Shebanow M, Lee CJ, Miftakhutdinov R, Mutlu O, Patt YN (2011) Improving GPU performance via large warps and two-level warp scheduling. In: Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, ACM, pp 308–317
Nasre R, Burtscher M, Pingali K (2013) Morph algorithms on GPUs. In: ACM SIGPLAN Notices, ACM, vol 48, pp 147–156
Nobari S, Cao T-T, Karras P, Bressan S (2012) Scalable parallel minimum spanning forest computation. In: ACM SIGPLAN Notices, ACM, vol 47, pp 205–214
Okuyama T, Ino F, Hagihara K (2012) A task parallel algorithm for finding all-pairs shortest paths using the GPU. Int J High Perform Comput Netw 7(2):87–98
Ortega-Arranz H, Torres Y, Gonzalez-Escribano A, Llanos DR (2014) Optimizing an APSP implementation for NVIDIA GPUs using kernel characterization criteria. J Supercomput 70(2):786–798
Ortega-Arranz H, Torres Y, Gonzalez-Escribano A, Llanos DR (2015) Comprehensive evaluation of a new GPU-based approach to the shortest path problem. Int J Parallel Prog 43(5):918–938
Patidar S, Narayanan P (2009) Scalable split and gather primitives for the GPU. Technical Report 11-IT/TR/2009/99
Prim RC (1957) Shortest connection networks and some generalizations. Bell Labs Techn J 36(6):1389–1401
Rajagopal D, Cambria E, Olsher D, Kwok K (2013) A graph-based approach to commonsense concept extraction and semantic similarity detection: In: WWW, pp 565–570
Rostrup S, Srivastava S, Singhal K (2013) Fast and memory-efficient minimum spanning tree on the GPU. Int J Comput Sci Eng 8(1):21–33
Roy A, Mihailovic I, Zwaenepoel W (2013) X-stream: edge-centric graph processing using streaming partitions. In: Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, ACM, pp 472–488
Sarıyüce AE, Saule E, Kaya K, Çatalyürek ÜV (2015) Regularizing graph centrality computations. J Parallel Distrib Comput 76:106–119
Sedgewick R (1988) Algorithms. Pearson Education, Delhi
Sengupta D, Song SL, Agarwal K, Schwan K (2015) Graphreduce: processing large-scale graphs on accelerator-based systems. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ACM
Sengupta S, Harris M, Zhang Y, Owens JD (2007) Scan primitives for GPU computing. Graphics Hardw 2007:97–106
Shi X, Liang J, Luo X, Di S, He B, Lu L, Jin H (2015) Frog: asynchronous graph processing on GPU with hybrid coloring model. Huazhong University of Science and Technology, Technical Report HUST-CGCL-TR-402
Shi Z, Zhang B (2011) Fast network centrality analysis using GPUs. BMC Bioinform 12(1):149
Sminia T, Orzan SM (2004) On distributed verification and verified distribution. Ph.D. Thesis, Free University of Amsterdam. http://dare.ubvu.vu.nl/bitstream/handle/1871/10338/6934.pdf
Torres Y, Gonzalez-Escribano A, Llanos DR (2013) uBench: exposing the impact of CUDA block geometry in terms of performance. J Supercomput 65(3):1150–1163
Tran H-N, Cambria E, Hussain A (2016) Towards GPU-based common-sense reasoning: Using fast subgraph matching. Cognit Comput 8(6):1074–1086
Tran H-N, Kim J-j, He B (2015) Fast subgraph matching on large graphs using graphics processors. In: International Conference on Database Systems for Advanced Applications, Springer, Berlin, pp 299–315
Valiant LG (1990) A bridging model for parallel computation. Commun ACM 33(8):103–111
Venkataraman G, Sahni S, Mukhopadhyaya S (2003) A blocked all-pairs shortest-paths algorithm. J Exp Algorithm (JEA) 8:2–2
Vineet V, Harish P, Patidar S, Narayanan P (2009) Fast minimum spanning tree for large graphs on the GPU. In: Proceedings of the Conference on High Performance Graphics 2009, ACM , pp 167–171
Volkov V, Demmel J (2008) LU, QR and Cholesky factorizations using vector capabilities of GPUs. EECS Department, University of California, Berkeley, technical report UCB/EECS-2008-49
Wang Y, Davidson A, Pan Y, Wu Y, Riffel A, Owens JD (2016) Gunrock: a high-performance graph processing library on the GPU. In: Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, ACM
Watts DJ, Strogatz SH (1998) Collective dynamics of small-world networks. Nature 393(6684):440–442
Wijs A, Katoen J-P, Bošnački D (2014). GPU-based graph decomposition into strongly connected and maximal end components. In: International Conference on Computer Aided Verification. Springer, New York, pp 310–326
Wu T, Wang B, Shan Y, Yan F, Wang Y, Xu N (2010) Efficient pagerank and SpMV computation on amd GPUs. In: 2010 39th International Conference on Parallel Processing (ICPP), pp 81–89
Wu Y, Wang Y, Pan Y, Yang C, Owens JD (2015) Performance characterization of high-level programming models for GPU graph analytics. In: 2015 IEEE International Symposium on Workload Characterization (IISWC), pp 66–75
S. Xiao and W. c. Feng (2010) Inter-block GPU communication via fast barrier synchronization. In: 2010 IEEE International Symposium on Parallel and Distributed Processing (IPDPS), pp 1–12
Xu Q, Jeon H, Annavaram M (2014) Graph processing on GPUs: where are the bottlenecks? In: 2014 IEEE International Symposium on Workload Characterization (IISWC), pp 140–149
You Y, Bader D, Dehnavi MM (2014) Designing a heuristic cross-architecture combination for breadth-first search. In: 2014 43rd International Conference on Parallel Processing (ICPP), pp 70–79
Zheng T, Nellans D, Zulfiqar A, Stephenson M, Keckler SW (2016) Towards high performance paged memory for GPUs. In: 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp 345–357
Zhong J, He B (2014) Medusa: simplified graph processing on GPUs. IEEE Trans Parallel Distrib Syst 25(6):1543–1552
Zou D, Dou Y, Wang Q, Xu J, Li B (2013) Direction-optimizing breadth-first search on CPU–GPU heterogeneous platforms. In: 2013 IEEE 10th International Conference on High Performance Computing and Communications and 2013 IEEE International Conference on Embedded and Ubiquitous Computing (HPCC_EUC), pp 1064–1069
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Tran, HN., Cambria, E. A survey of graph processing on graphics processing units. J Supercomput 74, 2086–2115 (2018). https://doi.org/10.1007/s11227-017-2225-1
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-017-2225-1