Skip to main content
Log in

A survey of graph processing on graphics processing units

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Graphics processing units (GPUs) have become popular high-performance computing platforms for a wide range of applications. The trend of processing graph structures on modern GPUs has also attracted an increasing interest in recent years. This article aims to review research works on adapting the massively parallel architecture of GPUs to accelerate the performance of fundamental graph operations. Despite their merits, some factors such as the unique architecture of GPUs, limited programming models, and irregular structures of graphs prevent GPU implementations from achieving high performance. Thus, this survey also discusses challenges and optimization techniques used by recent studies to fully utilize the GPU capability. A categorization of the existing research works is also presented based on the specific issues these attempted to solve.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. https://developer.NVIDIA.com/what-cuda.

  2. https://www.khronos.org/opencl.

  3. http://developer.amd.com/tools-and-sdks/radeon-open-compute-platform/.

  4. http://www.nvidia.com/object/nvlink.html.

References

  1. Barnat J, Bauch P, Brim L, Ceška M (2011) Computing strongly connected components in parallel on CUDA. In: 2011 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp 544–555

  2. Barnat J, Chaloupka J, van de Pol J (2008) Improved distributed algorithms for SCC decomposition. Electron Notes Theor Comput Sci 198(1):63–77

    Article  MathSciNet  MATH  Google Scholar 

  3. Baxter S (2013) Modern GPU. https://moderngpu.github.io/scan.html

  4. Beamer S, Asanović K, Patterson D (2013) Direction-optimizing breadth-first search. Sci Program 21(3–4):137–148

    Google Scholar 

  5. Bell N, Garland M (2009) Implementing sparse matrix–vector multiplication on throughput-oriented processors. In: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, ACM

  6. Bellman R (1958) On a routing problem. Q Appl Math 16:87–90

    Article  MathSciNet  MATH  Google Scholar 

  7. Boruvka O (1926) About a certain minimal problem. Praca Moravske Prirodovedecke Spolecnosti 3:37–58

    Google Scholar 

  8. Brandes U (2001) A faster algorithm for betweenness centrality. J Math Sociol 25(2):163–177

    Article  MATH  Google Scholar 

  9. Bron C, Kerbosch J (1973) Algorithm 457: finding all cliques of an undirected graph. Commun ACM 16(9):575–577

    Article  MATH  Google Scholar 

  10. Buluç A, Gilbert JR, Budak C (2010) Solving path problems on the GPU. Parallel Comput 36(5):241–253

    Article  MATH  Google Scholar 

  11. Burtscher M, Nasre R, Pingali K (2012) A quantitative study of irregular programs on GPUs. In: 2012 IEEE International Symposium on Workload Characterization (IISWC), pp 141–151

  12. Busato F, Bombieri N (2016) An efficient implementation of the bellman-ford algorithm for kepler GPU architectures. IEEE Trans Parallel Distrib Syst 27(8):2222–2233

    Article  Google Scholar 

  13. Cavallari S, Zheng V, Cai HY, Chang CC, Cambria E (2017) Learning community embedding with community detection and node embedding on graphs. In: CIKM, pp 377–386

  14. Chaturvedi I, Ong Y-S, Tsang I, Welsch R, Cambria E (2016) Learning word dependencies in text by means of a deep recurrent belief network. Knowl Based Syst 108:144–154

    Article  Google Scholar 

  15. Che S (2014) GasCL: A vertex-centric graph model for GPUS. In: High Performance Extreme Computing Conference (HPEC), 2014 IEEE, pp 1–6

  16. Crauser A, Mehlhorn K, Meyer U, Sanders P (1998) A parallelization of Dijkstra’s shortest path algorithm. In: International Symposium on Mathematical Foundations of Computer Science. Springer, Berlin, pp 722–731

  17. D’Alberto P, Nicolau A (2007) R-Kleene: a high-performance divide-and-conquer algorithm for the all-pair shortest path for densely connected networks. Algorithmica 47(2):203–213

    Article  MathSciNet  MATH  Google Scholar 

  18. Davidson A, Baxter S, Garland M, Owens JD (2014) Work-efficient parallel GPU methods for single-source shortest paths. In: IEEE 28th International Parallel and Distributed Processing Symposium, pp 349–359

  19. Deng Y, Wang BD, Mu S (2009) Taming irregular EDA applications on GPUs. In: Proceedings of the 2009 International Conference on Computer-Aided Design, pp 539–546 (Steve)

  20. Dijkstra EW (1959) A note on two problems in connexion with graphs. Numer Math 1(1):269–271

    Article  MathSciNet  MATH  Google Scholar 

  21. Elsen E, Vaidyanathan V (2013) A vertex-centric CUDA/C++ API for large graph analytics on GPUs using the gather–apply–scatter abstraction

  22. Ford L Jr (1956) Network flow theory. Technical report. Rand Corp, Santa Monica

  23. Fu Z, Personick M, Thompson B (2014) Mapgraph: a high level api for fast development of high performance graph analytics on GPUs. In: Proceedings of Workshop on Graph Data management Experiences and Systems, ACM, pp 1–6

  24. Gharaibeh A, Beltrão Costa L, Santos-Neto E, Ripeanu M (2012) A yoke of oxen and a thousand chickens for heavy lifting graph processing. In: Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques, ACM, pp 345–354

  25. Guo Y, Biczak M, Varbanescu AL, Iosup A, Martella C, Willke TL (2014) How well do graph-processing platforms perform? An empirical performance evaluation and analysis. In: 2014 IEEE 28th International Parallel and Distributed Processing Symposium, pp 395–404

  26. Guo Y, Varbanescu AL, Iosup A, Epema D (2015) An empirical performance evaluation of GPU-enabled graph-processing systems. In: 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), pp 423–432

  27. Harish P, Narayanan PJ (2007). Accelerating large graph algorithms on the GPU using CUDA. In: Proceedings of the 14th International Conference on High Performance Computing, pp 197–208

  28. Harish P, Vineet V, Narayanan P (2009) Large graph algorithms for massively multithreaded architectures. International Institute of Information Technology Hyderabad, Technical Report IIIT/TR/2009/74

  29. Harris M, Sengupta S, Owens JD (2007) Parallel prefix sum (scan) with CUDA. GPU gems 3(39):851–876

    Google Scholar 

  30. Hiragushi T, Takahashi D (2013) Efficient hybrid breadth-first search on GPUs. In: Algorithms and Architectures for Parallel Processing. Springer, Berlin, pp 40–50

  31. Hong S, Kim SK, Oguntebi T, Olukotun K. Accelerating CUDA graph algorithms at maximum warp. In: Proceedings of the 16th ACM symposium on Principles and Practice of Parallel Programming, pp 267–276

  32. Hong S, Oguntebi T, Olukotun K (2011) Efficient parallel graph exploration on multi-core CPU and GPU. In: Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques, pp 78–88

  33. Hong S, Rodia NC, Olukotun K (2013) On fast parallel detection of strongly connected components (SCC) in small-world graphs. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, ACM

  34. Hussein M, Varshney A, Davis L (2007) On implementing graph cuts on CUDA. In: First Workshop on General Purpose Processing on Graphics Processing Units

  35. Jenkins J, Arkatkar I, Owens JD, Choudhary A, Samatova NF (2011) Lessons learned from exploring the backtracking paradigm on the GPU. In: European Conference on Parallel Processing. Springer, Berlin, pp 425–437

  36. Jia Y, Lu V, Hoberock J, Garland M, Hart JC (2011) Edge v. node parallelism for graph centrality metrics. GPU Comput Gems 2:15–30

    Google Scholar 

  37. Katz GJ, Joseph J, Kider T (2008) All-pairs shortest-paths for large graphs on the GPU. In: Proceedings of the 23rd ACM SIGGRAPH/EUROGRAPHICS Symposium on Graphics hardware, pp 47–55

  38. Khorasani F, Vora K, Gupta R, Bhuyan LN (2014) CuSha: vertex-centric graph processing on GPUs. In: Proceedings of the 23rd international Symposium on High-Performance Parallel and Distributed Computing, ACM, pp 239–252

  39. Kim M-S, An K, Park H, Seo H, Kim J (2016) GTS: a fast and scalable graph processing method based on streaming topology to GPUs. In: Proceedings of the 2016 International Conference on Management of Data, ACM, pp 447–461

  40. Kirk DB, Wen-mei WH (2012) Programming massively parallel processors: a hands-on approach. Newnes, Boston

    Google Scholar 

  41. Kruskal JB (1956) On the shortest spanning subtree of a graph and the traveling salesman problem. Proc Am Math soc 7(1):48–50

    Article  MathSciNet  MATH  Google Scholar 

  42. Kyrola A, Blelloch GE, Guestrin C et al (2012) Graphchi: large-scale graph computation on just a pc. In: OSDI, vol 12, pp 31–46

  43. Li G, Zhu Z, Cong Z, Yang F (2014) Efficient decomposition of strongly connected components on GPUs. J Syst Archit 60(1):1–10

    Article  Google Scholar 

  44. Lin W, Xiao X, Xie X, Li X-L (2016) Network motif discovery: GPU approach. IEEE Trans Knowl Data Eng 29:513–528

    Article  Google Scholar 

  45. Liu H, Huang HH (2015) Enterprise: breadth-first graph traversal on GPUs. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp 1–12

  46. Liu H, Huang HH, Hu Y (2016) iBFS: concurrent breadth-first search on GPUs. In: Proceedings of the 2016 International Conference on Management of Data. ACM, pp 403–416

  47. Luo L, Wong M, Hwu W-m (2010) An effective GPU implementation of breadth-first search. In: Proceedings of the 47th Design Automation Conference, pp 52–55

  48. Malewicz G, Austern MH, Bik AJ, Dehnert JC, Horn I, Leiser N, Czajkowski G (2010) Pregel: a system for large-scale graph processing. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, ACM, pp 135–146

  49. Martín PJ, Torres R, Gavilanes A (2009) CUDA solutions for the SSSP problem. In: International Conference on Computational Science. Springer, Berlin, pp 904–913

  50. Matsumoto K, Nakasato N, Sedukhin SG (2011) Blocked all-pairs shortest paths algorithm for hybrid CPU–GPU system. In: 2011 IEEE 13th International Conference on High Performance Computing and Communications (HPCC), pp 145–152

  51. McLaughlin A, Bader DA (2014) Scalable and high performance betweenness centrality on the GPU. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, IEEE Press, pp 572–583

  52. Mclendon W III, Hendrickson B, Plimpton SJ, Rauchwerger L, Rauchwerger L (2005) Finding strongly connected components in distributed graphs. J Parallel Distrib Comput 65(8):901–910

    Article  MATH  Google Scholar 

  53. Meng J, Tarjan D, Skadron K (2010) Dynamic warp subdivision for integrated branch and memory divergence tolerance. ACM SIGARCH Comput Archit News 38(3):235–246

    Article  Google Scholar 

  54. Merrill D, Garland M, Grimshaw A (2012) Scalable GPU graph traversal. In: Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming, pp 117–128

  55. Meyer U, Sanders P (2003) Delta-stepping: a parallelizable shortest path algorithm. J Algorithms 49(1):114–152

    Article  MathSciNet  MATH  Google Scholar 

  56. Micikevicius P (2012) GPU performance analysis and optimization. In: GPU Technology Conference

  57. Narasiman V, Shebanow M, Lee CJ, Miftakhutdinov R, Mutlu O, Patt YN (2011) Improving GPU performance via large warps and two-level warp scheduling. In: Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, ACM, pp 308–317

  58. Nasre R, Burtscher M, Pingali K (2013) Morph algorithms on GPUs. In: ACM SIGPLAN Notices, ACM, vol 48, pp 147–156

  59. Nobari S, Cao T-T, Karras P, Bressan S (2012) Scalable parallel minimum spanning forest computation. In: ACM SIGPLAN Notices, ACM, vol 47, pp 205–214

  60. Okuyama T, Ino F, Hagihara K (2012) A task parallel algorithm for finding all-pairs shortest paths using the GPU. Int J High Perform Comput Netw 7(2):87–98

    Article  Google Scholar 

  61. Ortega-Arranz H, Torres Y, Gonzalez-Escribano A, Llanos DR (2014) Optimizing an APSP implementation for NVIDIA GPUs using kernel characterization criteria. J Supercomput 70(2):786–798

    Article  Google Scholar 

  62. Ortega-Arranz H, Torres Y, Gonzalez-Escribano A, Llanos DR (2015) Comprehensive evaluation of a new GPU-based approach to the shortest path problem. Int J Parallel Prog 43(5):918–938

    Article  Google Scholar 

  63. Patidar S, Narayanan P (2009) Scalable split and gather primitives for the GPU. Technical Report 11-IT/TR/2009/99

  64. Prim RC (1957) Shortest connection networks and some generalizations. Bell Labs Techn J 36(6):1389–1401

    Article  Google Scholar 

  65. Rajagopal D, Cambria E, Olsher D, Kwok K (2013) A graph-based approach to commonsense concept extraction and semantic similarity detection: In: WWW, pp 565–570

  66. Rostrup S, Srivastava S, Singhal K (2013) Fast and memory-efficient minimum spanning tree on the GPU. Int J Comput Sci Eng 8(1):21–33

    Article  Google Scholar 

  67. Roy A, Mihailovic I, Zwaenepoel W (2013) X-stream: edge-centric graph processing using streaming partitions. In: Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, ACM, pp 472–488

  68. Sarıyüce AE, Saule E, Kaya K, Çatalyürek ÜV (2015) Regularizing graph centrality computations. J Parallel Distrib Comput 76:106–119

    Article  Google Scholar 

  69. Sedgewick R (1988) Algorithms. Pearson Education, Delhi

    MATH  Google Scholar 

  70. Sengupta D, Song SL, Agarwal K, Schwan K (2015) Graphreduce: processing large-scale graphs on accelerator-based systems. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ACM

  71. Sengupta S, Harris M, Zhang Y, Owens JD (2007) Scan primitives for GPU computing. Graphics Hardw 2007:97–106

    Google Scholar 

  72. Shi X, Liang J, Luo X, Di S, He B, Lu L, Jin H (2015) Frog: asynchronous graph processing on GPU with hybrid coloring model. Huazhong University of Science and Technology, Technical Report HUST-CGCL-TR-402

  73. Shi Z, Zhang B (2011) Fast network centrality analysis using GPUs. BMC Bioinform 12(1):149

    Article  Google Scholar 

  74. Sminia T, Orzan SM (2004) On distributed verification and verified distribution. Ph.D. Thesis, Free University of Amsterdam. http://dare.ubvu.vu.nl/bitstream/handle/1871/10338/6934.pdf

  75. Torres Y, Gonzalez-Escribano A, Llanos DR (2013) uBench: exposing the impact of CUDA block geometry in terms of performance. J Supercomput 65(3):1150–1163

    Article  Google Scholar 

  76. Tran H-N, Cambria E, Hussain A (2016) Towards GPU-based common-sense reasoning: Using fast subgraph matching. Cognit Comput 8(6):1074–1086

    Article  Google Scholar 

  77. Tran H-N, Kim J-j, He B (2015) Fast subgraph matching on large graphs using graphics processors. In: International Conference on Database Systems for Advanced Applications, Springer, Berlin, pp 299–315

  78. Valiant LG (1990) A bridging model for parallel computation. Commun ACM 33(8):103–111

    Article  Google Scholar 

  79. Venkataraman G, Sahni S, Mukhopadhyaya S (2003) A blocked all-pairs shortest-paths algorithm. J Exp Algorithm (JEA) 8:2–2

    MathSciNet  MATH  Google Scholar 

  80. Vineet V, Harish P, Patidar S, Narayanan P (2009) Fast minimum spanning tree for large graphs on the GPU. In: Proceedings of the Conference on High Performance Graphics 2009, ACM , pp 167–171

  81. Volkov V, Demmel J (2008) LU, QR and Cholesky factorizations using vector capabilities of GPUs. EECS Department, University of California, Berkeley, technical report UCB/EECS-2008-49

  82. Wang Y, Davidson A, Pan Y, Wu Y, Riffel A, Owens JD (2016) Gunrock: a high-performance graph processing library on the GPU. In: Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, ACM

  83. Watts DJ, Strogatz SH (1998) Collective dynamics of small-world networks. Nature 393(6684):440–442

    Article  MATH  Google Scholar 

  84. Wijs A, Katoen J-P, Bošnački D (2014). GPU-based graph decomposition into strongly connected and maximal end components. In: International Conference on Computer Aided Verification. Springer, New York, pp 310–326

  85. Wu T, Wang B, Shan Y, Yan F, Wang Y, Xu N (2010) Efficient pagerank and SpMV computation on amd GPUs. In: 2010 39th International Conference on Parallel Processing (ICPP), pp 81–89

  86. Wu Y, Wang Y, Pan Y, Yang C, Owens JD (2015) Performance characterization of high-level programming models for GPU graph analytics. In: 2015 IEEE International Symposium on Workload Characterization (IISWC), pp 66–75

  87. S. Xiao and W. c. Feng (2010) Inter-block GPU communication via fast barrier synchronization. In: 2010 IEEE International Symposium on Parallel and Distributed Processing (IPDPS), pp 1–12

  88. Xu Q, Jeon H, Annavaram M (2014) Graph processing on GPUs: where are the bottlenecks? In: 2014 IEEE International Symposium on Workload Characterization (IISWC), pp 140–149

  89. You Y, Bader D, Dehnavi MM (2014) Designing a heuristic cross-architecture combination for breadth-first search. In: 2014 43rd International Conference on Parallel Processing (ICPP), pp 70–79

  90. Zheng T, Nellans D, Zulfiqar A, Stephenson M, Keckler SW (2016) Towards high performance paged memory for GPUs. In: 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp 345–357

  91. Zhong J, He B (2014) Medusa: simplified graph processing on GPUs. IEEE Trans Parallel Distrib Syst 25(6):1543–1552

    Article  MathSciNet  Google Scholar 

  92. Zou D, Dou Y, Wang Q, Xu J, Li B (2013) Direction-optimizing breadth-first search on CPU–GPU heterogeneous platforms. In: 2013 IEEE 10th International Conference on High Performance Computing and Communications and 2013 IEEE International Conference on Embedded and Ubiquitous Computing (HPCC_EUC), pp 1064–1069

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ha-Nguyen Tran.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Tran, HN., Cambria, E. A survey of graph processing on graphics processing units. J Supercomput 74, 2086–2115 (2018). https://doi.org/10.1007/s11227-017-2225-1

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-017-2225-1

Keywords

Navigation