Skip to main content
Log in

FSGraph: fast and scalable implementation of graph traversal on GPUs

  • Regular Paper
  • Published:
CCF Transactions on High Performance Computing Aims and scope Submit manuscript

Abstract

Graph is one of the best ways to express and process association relationship. It is widely used in various applications, including social networks, fraud detection, Internet of things, etc. As a typical graph traversal algorithm, the Breadth-First Search (BFS) performance on GPU is not desirable, due to strong data dependency, intensive irregular memory access and low computation intensity. On GPUs, the situation is even worse with unbalanced data partitioning and high communication-to-computation ratios. In this paper, we implement FSGraph that is a fast and scalable BFS implementation on GPUs. In FSGraph, we propose three optimizing techniques: GPU-friendly Compressed Sparse Row (CSR) structure, bidirectional one-dimensional (1d) partition and UM-aware communication. We have evaluated our work with extensive experiments on four T4 and four V100 GPUs. The average performance of BFS on four T4 GPUs is 132.67 Giga-Traversed Edges per Second (GTEPS), which delivers up to 1.44\(\times\) improvement than that on single T4. In terms of four V100 GPUs, the BFS performance achieves 392.35 GTEPS and outperforms existing CPU-based cluster with 1024 nodes on November 2022 Graph500 list.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

Data Availability

The data that support the findings of this study are available from the corresponding author upon reasonable request.

References

  • Agarwal, V., Petrini, F., Pasetto, D., Bader, D.:A.: Scalable graph exploration on multicore processors. In: SC’10: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–11. IEEE, 2010

  • Bader, D. A., Madduri, K.: Snap, small-world network analysis and partitioning: An open-source parallel graph framework for the exploration of large-scale networks. In: 2008 IEEE international symposium on parallel and distributed processing, pp. 1–12, IEEE, 2008

  • Beamer, S., Asanovic, K., Patterson, D.: Direction-optimizing breadth-first search. In: SC’12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pp 1–10, IEEE, 2012

  • Bernaschi, M., Carbone, G., Mastrostefano, E., Bisson, M., Fatica, M.: Enhanced gpu-based distributed breadth first search. In: Proceedings of the 12th ACM International Conference on Computing Frontiers, pages 1–8, 2015

  • Bisson, Mauro, Bernaschi, Massimo, Mastrostefano, Enrico: Parallel distributed breadth first search on the kepler architecture. IEEE Transact. Parallel Distrib. Syst 27(7), 2091–2102 (2015)

    Article  Google Scholar 

  • Buluç, Aydin, Madduri, K.: Parallel breadth-first search on distributed memory systems. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–12, 2011

  • Busato, Federico, Bombieri, Nicola: Bfs-4k: an efficient implementation of bfs for kepler gpu architectures. IEEE Transact. Parallel Distrib. Syst. 26(7), 1826–1838 (2014)

    Article  Google Scholar 

  • Checconi, F.o, Petrini, F., Willcock, J., Lumsdaine, A., Choudhury, A. Roy, Sabharwal, Y.: Breaking the speed and scalability barriers for graph exploration on distributed-memory machines. In: SC’12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pages 1–12, IEEE, 2012

  • De Domenico, Manlio, Lima, Antonio, Mougel, Paul, Musolesi, Mirco: The anatomy of a scientific rumor. Sci. Rep. 3(1), 1–9 (2013)

    Article  Google Scholar 

  • Dong, R.u, Cao, H., Ye, X., Zhang, Y., Hao, Q., Fan, D.: Highly efficient and gpu-friendly implementation of bfs on single-node system. In: 2020 IEEE Intl Conf on Parallel and Distributed Processing with Applications, Big Data and Cloud Computing, Sustainable Computing and Communications, Social Computing and Networking (ISPA/BDCloud/SocialCom/SustainCom), pp 544–553, IEEE, 2020

  • Faloutsos, Michalis, Faloutsos, Petros, Faloutsos, Christos: On power-law relationships of the internet topology. In: The Structure and Dynamics of Networks, pp. 195–206. Princeton University Press, New jersey (2011)

    Chapter  MATH  Google Scholar 

  • Graph500. http://www.graph500.org, (2010)

  • Harish, P., Narayanan, P. J.: Accelerating large graph algorithms on the gpu using cuda. In International conference on high-performance computing, Springer, pp 197–208, 2007

  • Hiragushi, T., Takahashi, D.: Efficient hybrid breadth-first search on gpus. In: International Conference on Algorithms and Architectures for Parallel Processing, Springer, pages 40–50, 2013

  • Hong, Sungpack, Kim, Sang Kyun, Oguntebi, Tayo, Olukotun, Kunle: Accelerating cuda graph algorithms at maximum warp. Acm. Sigplan. Notices 46(8), 267–276 (2011)

    Article  Google Scholar 

  • Khorasani, F., Vora, Keval, G., Rajiv, B., Laxmi N., Cusha.: Vertex-centric graph processing on gpus. In: Proceedings of the 23rd international symposium on High-performance parallel and distributed computing, pages 239–252, 2014

  • Klymko, C., Gleich, D., Kolda, T, G.: Using triangles to improve community detection in directed networks. arXiv preprint arXiv:1404.5874,(2014)

  • Li, Z., Wang, H., Zhang, P., Hui, P., Huang, J., Liao, J., Zhang, J., Bu, J.: Live-streaming fraud detection: a heterogeneous graph neural network approach. In: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 3670–3678, 2021

  • Liu, H., Huang, H H.: Enterprise: breadth-first graph traversal on gpus. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–12, 2015

  • Luo, L., Wong, M., Hwu, Wen-m.: An effective gpu implementation of breadth-first search. In: Design Automation Conference, pages 52–55, IEEE, 2010

  • Merrill, Duane, Garland, Michael, Grimshaw, Andrew: Scalable gpu graph traversal. Acm. Sigplan. Notices 47(8), 117–128 (2012)

    Article  Google Scholar 

  • Mislove, A., Marcon, M., Gummadi, K. P., Druschel, P., Bhattacharjee, B.: Measurement and analysis of online social networks. In: Proceedings of the 7th ACM SIGCOMM conference on Internet measurement, pages 29–42, 2007

  • Murphy, Richard C., Wheeler, Kyle B., Barrett, Brian W., Ang, James A.: Introducing the graph 500. Cray Use. Group (CUG). 19, 45–74 (2010)

    Google Scholar 

  • Nvidia. nvidia t4 70w low profile pcie gpu accelerator. https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/tesla-t4/t4-tensor-core-product-brief.pdf, (2020)

  • Pan, Y., Pearce, R., Owens, J, D.: Scalable breadth-first search on a gpu cluster. In: 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pages 1090–1101. IEEE, 2018

  • Pan, Y., Wang, Y., Wu, Y., Yang, C., Owens, J. D.: Multi-gpu graph analytics. In: 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS), IEEE,pages 479–490, 2017

  • Pham, T.-A. N., Li, X., Cong, G., Zhang, Z.: A general graph-based model for recommendation in event-based social networks. In: 2015 IEEE 31st international conference on data engineering, pp. 567–578, IEEE, 2015

  • Potluri, Sreeram, Goswami, Anshuman, Venkata, Manjunath Gorentla, Imam, Neena: Efficient breadth first search on multi-gpu systems using gpu-centric openshmem, pp. 82–96. Springer, In Workshop on OpenSHMEM and Related Technologies (2017)

    Google Scholar 

  • Sabet, Amir Hossein N., Zhao, Zhijia, Gupta R.: Subway Minimizing data transfer during out-of-gpu-memory graph processing. In: Proceedings of the Fifteenth European Conference on Computer Systems, pages 1–16, 2020

  • Sabet, Amir Hossein Nodehi., Qiu, Junqiao, Zhao, Zhijia: Tigr: Transforming irregular graphs for gpu-friendly graph processing. ACM SIGPLAN Notices 53(2), 622–636 (2018)

    Article  Google Scholar 

  • Takac, L., Zabovsky, M.: Data analysis in public social networks. In: International scientific conference and international workshop present day trends of innovations. Present Day Trends of Innovations Lamza Poland, 2012

  • Ting, Y., Yan, C., Xiang-wei, M.: Personalized recommendation system based on web log mining and weighted bipartite graph. In: 2013 international conference on computational and information sciences, pp 587–590, IEEE, 2013

  • Ueno, K., Suzumura, T.: Highly scalable graph search for the graph500 benchmark. In: Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing, pages 149–160, 2012

  • Wang, Pengyu, Wang, Jing, Li, Chao, Wang, Jianzong, Zhu, Haojin, Guo, Minyi: Grus: Toward unified-memory-efficient high-performance graph processing on gpu. ACM Transact. Architec. Code Optimiz. (TACO) 18(2), 1–25 (2021)

    Article  Google Scholar 

  • Yang, Jaewon, Leskovec, Jure: Defining and evaluating network communities based on ground-truth. Knowledge Info. Syst. 42(1), 181–213 (2015)

    Article  Google Scholar 

  • Yasui, Y., Fujisawa, K.: Fast and scalable numa-based thread parallel breadth-first search. In: 2015 International Conference on High Performance Computing and Simulation (HPCS), pp 377–385, IEEE, 2015

  • Yin, H., Benson, A. R., Leskovec, J., Gleich, D. F.:Local higher-order graph clustering. In: Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, pages 555–564, 2017

  • Yoo, A., Chow, E., Henderson, K.h, McLendon, W., Hendrickson, B., Catalyurek, U.: A scalable distributed parallel breadth-first search algorithm on bluegene/l. In: SC’05: Proceedings of the 2005 ACM/IEEE Conference on Supercomputing, pp 25–25, IEEE, 2005

  • Zhang, C., Cao, H., Ye, X., Wang, G., Hao, Q., Fan, D.: Highly efficient breadth-first search on cpu-based single-node system. In: 2019 IEEE 21st International Conference on High Performance Computing and Communications; IEEE 17th International Conference on Smart City; IEEE 5th International Conference on Data Science and Systems (HPCC/SmartCity/DSS), pages 2066–2071, IEEE, 2019

  • Zhong, Jianlong, He, Bingsheng: Medusa: Simplified graph processing on gpus. IEEE Transact. Parallel Distrib. Syst. 25(6), 1543–1552 (2013)

    Article  Google Scholar 

  • Zhong, Wenyong, Sun, Jianhua, Chen, Hao, Xiao, Jun, Chen, Zhiwen, Cheng, Chang, Shi, Xuanhua: Optimizing graph processing on gpus. IEEE Transact. Parallel Distrib. Syst. 28(4), 1149–1162 (2016)

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported by National Key Research and Development Program (Grant No. 2022YFB4501404), the Beijing Natural Science Foundation (4232036), CAS Project for Youth Innovation Promotion Association.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Huawei Cao.

Ethics declarations

Conflict of interest

No potential conflict of interest was reported by the authors

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, Y., Cao, H., Liang, Y. et al. FSGraph: fast and scalable implementation of graph traversal on GPUs. CCF Trans. HPC 5, 277–291 (2023). https://doi.org/10.1007/s42514-023-00155-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s42514-023-00155-x

Keywords

Navigation