Abstract
Graph is one of the best ways to express and process association relationship. It is widely used in various applications, including social networks, fraud detection, Internet of things, etc. As a typical graph traversal algorithm, the Breadth-First Search (BFS) performance on GPU is not desirable, due to strong data dependency, intensive irregular memory access and low computation intensity. On GPUs, the situation is even worse with unbalanced data partitioning and high communication-to-computation ratios. In this paper, we implement FSGraph that is a fast and scalable BFS implementation on GPUs. In FSGraph, we propose three optimizing techniques: GPU-friendly Compressed Sparse Row (CSR) structure, bidirectional one-dimensional (1d) partition and UM-aware communication. We have evaluated our work with extensive experiments on four T4 and four V100 GPUs. The average performance of BFS on four T4 GPUs is 132.67 Giga-Traversed Edges per Second (GTEPS), which delivers up to 1.44\(\times\) improvement than that on single T4. In terms of four V100 GPUs, the BFS performance achieves 392.35 GTEPS and outperforms existing CPU-based cluster with 1024 nodes on November 2022 Graph500 list.
Similar content being viewed by others
Data Availability
The data that support the findings of this study are available from the corresponding author upon reasonable request.
References
Agarwal, V., Petrini, F., Pasetto, D., Bader, D.:A.: Scalable graph exploration on multicore processors. In: SC’10: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–11. IEEE, 2010
Bader, D. A., Madduri, K.: Snap, small-world network analysis and partitioning: An open-source parallel graph framework for the exploration of large-scale networks. In: 2008 IEEE international symposium on parallel and distributed processing, pp. 1–12, IEEE, 2008
Beamer, S., Asanovic, K., Patterson, D.: Direction-optimizing breadth-first search. In: SC’12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pp 1–10, IEEE, 2012
Bernaschi, M., Carbone, G., Mastrostefano, E., Bisson, M., Fatica, M.: Enhanced gpu-based distributed breadth first search. In: Proceedings of the 12th ACM International Conference on Computing Frontiers, pages 1–8, 2015
Bisson, Mauro, Bernaschi, Massimo, Mastrostefano, Enrico: Parallel distributed breadth first search on the kepler architecture. IEEE Transact. Parallel Distrib. Syst 27(7), 2091–2102 (2015)
Buluç, Aydin, Madduri, K.: Parallel breadth-first search on distributed memory systems. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–12, 2011
Busato, Federico, Bombieri, Nicola: Bfs-4k: an efficient implementation of bfs for kepler gpu architectures. IEEE Transact. Parallel Distrib. Syst. 26(7), 1826–1838 (2014)
Checconi, F.o, Petrini, F., Willcock, J., Lumsdaine, A., Choudhury, A. Roy, Sabharwal, Y.: Breaking the speed and scalability barriers for graph exploration on distributed-memory machines. In: SC’12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pages 1–12, IEEE, 2012
De Domenico, Manlio, Lima, Antonio, Mougel, Paul, Musolesi, Mirco: The anatomy of a scientific rumor. Sci. Rep. 3(1), 1–9 (2013)
Dong, R.u, Cao, H., Ye, X., Zhang, Y., Hao, Q., Fan, D.: Highly efficient and gpu-friendly implementation of bfs on single-node system. In: 2020 IEEE Intl Conf on Parallel and Distributed Processing with Applications, Big Data and Cloud Computing, Sustainable Computing and Communications, Social Computing and Networking (ISPA/BDCloud/SocialCom/SustainCom), pp 544–553, IEEE, 2020
Faloutsos, Michalis, Faloutsos, Petros, Faloutsos, Christos: On power-law relationships of the internet topology. In: The Structure and Dynamics of Networks, pp. 195–206. Princeton University Press, New jersey (2011)
Graph500. http://www.graph500.org, (2010)
Harish, P., Narayanan, P. J.: Accelerating large graph algorithms on the gpu using cuda. In International conference on high-performance computing, Springer, pp 197–208, 2007
Hiragushi, T., Takahashi, D.: Efficient hybrid breadth-first search on gpus. In: International Conference on Algorithms and Architectures for Parallel Processing, Springer, pages 40–50, 2013
Hong, Sungpack, Kim, Sang Kyun, Oguntebi, Tayo, Olukotun, Kunle: Accelerating cuda graph algorithms at maximum warp. Acm. Sigplan. Notices 46(8), 267–276 (2011)
Khorasani, F., Vora, Keval, G., Rajiv, B., Laxmi N., Cusha.: Vertex-centric graph processing on gpus. In: Proceedings of the 23rd international symposium on High-performance parallel and distributed computing, pages 239–252, 2014
Klymko, C., Gleich, D., Kolda, T, G.: Using triangles to improve community detection in directed networks. arXiv preprint arXiv:1404.5874,(2014)
Li, Z., Wang, H., Zhang, P., Hui, P., Huang, J., Liao, J., Zhang, J., Bu, J.: Live-streaming fraud detection: a heterogeneous graph neural network approach. In: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 3670–3678, 2021
Liu, H., Huang, H H.: Enterprise: breadth-first graph traversal on gpus. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–12, 2015
Luo, L., Wong, M., Hwu, Wen-m.: An effective gpu implementation of breadth-first search. In: Design Automation Conference, pages 52–55, IEEE, 2010
Merrill, Duane, Garland, Michael, Grimshaw, Andrew: Scalable gpu graph traversal. Acm. Sigplan. Notices 47(8), 117–128 (2012)
Mislove, A., Marcon, M., Gummadi, K. P., Druschel, P., Bhattacharjee, B.: Measurement and analysis of online social networks. In: Proceedings of the 7th ACM SIGCOMM conference on Internet measurement, pages 29–42, 2007
Murphy, Richard C., Wheeler, Kyle B., Barrett, Brian W., Ang, James A.: Introducing the graph 500. Cray Use. Group (CUG). 19, 45–74 (2010)
Nvidia. nvidia t4 70w low profile pcie gpu accelerator. https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/tesla-t4/t4-tensor-core-product-brief.pdf, (2020)
Pan, Y., Pearce, R., Owens, J, D.: Scalable breadth-first search on a gpu cluster. In: 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pages 1090–1101. IEEE, 2018
Pan, Y., Wang, Y., Wu, Y., Yang, C., Owens, J. D.: Multi-gpu graph analytics. In: 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS), IEEE,pages 479–490, 2017
Pham, T.-A. N., Li, X., Cong, G., Zhang, Z.: A general graph-based model for recommendation in event-based social networks. In: 2015 IEEE 31st international conference on data engineering, pp. 567–578, IEEE, 2015
Potluri, Sreeram, Goswami, Anshuman, Venkata, Manjunath Gorentla, Imam, Neena: Efficient breadth first search on multi-gpu systems using gpu-centric openshmem, pp. 82–96. Springer, In Workshop on OpenSHMEM and Related Technologies (2017)
Sabet, Amir Hossein N., Zhao, Zhijia, Gupta R.: Subway Minimizing data transfer during out-of-gpu-memory graph processing. In: Proceedings of the Fifteenth European Conference on Computer Systems, pages 1–16, 2020
Sabet, Amir Hossein Nodehi., Qiu, Junqiao, Zhao, Zhijia: Tigr: Transforming irregular graphs for gpu-friendly graph processing. ACM SIGPLAN Notices 53(2), 622–636 (2018)
Takac, L., Zabovsky, M.: Data analysis in public social networks. In: International scientific conference and international workshop present day trends of innovations. Present Day Trends of Innovations Lamza Poland, 2012
Ting, Y., Yan, C., Xiang-wei, M.: Personalized recommendation system based on web log mining and weighted bipartite graph. In: 2013 international conference on computational and information sciences, pp 587–590, IEEE, 2013
Ueno, K., Suzumura, T.: Highly scalable graph search for the graph500 benchmark. In: Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing, pages 149–160, 2012
Wang, Pengyu, Wang, Jing, Li, Chao, Wang, Jianzong, Zhu, Haojin, Guo, Minyi: Grus: Toward unified-memory-efficient high-performance graph processing on gpu. ACM Transact. Architec. Code Optimiz. (TACO) 18(2), 1–25 (2021)
Yang, Jaewon, Leskovec, Jure: Defining and evaluating network communities based on ground-truth. Knowledge Info. Syst. 42(1), 181–213 (2015)
Yasui, Y., Fujisawa, K.: Fast and scalable numa-based thread parallel breadth-first search. In: 2015 International Conference on High Performance Computing and Simulation (HPCS), pp 377–385, IEEE, 2015
Yin, H., Benson, A. R., Leskovec, J., Gleich, D. F.:Local higher-order graph clustering. In: Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, pages 555–564, 2017
Yoo, A., Chow, E., Henderson, K.h, McLendon, W., Hendrickson, B., Catalyurek, U.: A scalable distributed parallel breadth-first search algorithm on bluegene/l. In: SC’05: Proceedings of the 2005 ACM/IEEE Conference on Supercomputing, pp 25–25, IEEE, 2005
Zhang, C., Cao, H., Ye, X., Wang, G., Hao, Q., Fan, D.: Highly efficient breadth-first search on cpu-based single-node system. In: 2019 IEEE 21st International Conference on High Performance Computing and Communications; IEEE 17th International Conference on Smart City; IEEE 5th International Conference on Data Science and Systems (HPCC/SmartCity/DSS), pages 2066–2071, IEEE, 2019
Zhong, Jianlong, He, Bingsheng: Medusa: Simplified graph processing on gpus. IEEE Transact. Parallel Distrib. Syst. 25(6), 1543–1552 (2013)
Zhong, Wenyong, Sun, Jianhua, Chen, Hao, Xiao, Jun, Chen, Zhiwen, Cheng, Chang, Shi, Xuanhua: Optimizing graph processing on gpus. IEEE Transact. Parallel Distrib. Syst. 28(4), 1149–1162 (2016)
Acknowledgements
This work was supported by National Key Research and Development Program (Grant No. 2022YFB4501404), the Beijing Natural Science Foundation (4232036), CAS Project for Youth Innovation Promotion Association.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
No potential conflict of interest was reported by the authors
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhang, Y., Cao, H., Liang, Y. et al. FSGraph: fast and scalable implementation of graph traversal on GPUs. CCF Trans. HPC 5, 277–291 (2023). https://doi.org/10.1007/s42514-023-00155-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s42514-023-00155-x