Abstract
The Graphcore Intelligence Processing Unit (IPU) is a newly developed processor type whose architecture does not rely on the traditional caching hierarchies. Developed to meet the need for more and more data-centric applications, such as machine learning, IPUs combine a dedicated portion of SRAM with each of its numerous cores, resulting in high memory bandwidth at the price of capacity. The proximity of processor cores and memory makes the IPU a promising field of experimentation for graph algorithms since it is the unpredictable, irregular memory accesses that lead to performance losses in traditional processors with pre-caching.
This paper aims to test the IPU’s suitability for algorithms with hard-to-predict memory accesses by implementing a breadth-first search (BFS) that complies with the Graph500 specifications. Precisely because of its apparent simplicity, BFS is an established benchmark that is not only subroutine for a variety of more complex graph algorithms, but also allows comparability across a wide range of architectures.
We benchmark our IPU code on a wide range of instances and compare its performance to state-of-the-art CPU and GPU codes. The results indicate that the IPU delivers speedups of up to \(4{\times }\) over the fastest competing result on an NVIDIA V100 GPU, with typical speedups of about \(1.5{\times }\) on most test instances.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Git commit: 5ee3df5, Online: https://github.com/gunrock/gunrock.
- 2.
Git commit: 426846f, Online: https://github.com/iHeartGraph/Enterprise.
- 3.
- 4.
References
Abadi, M., et al.: Tensorflow: a system for large-scale machine learning. In: 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2016), pp. 265–283 (2016)
Abu-Khzam, F.N., Collins, R.L., Fellows, M.R., Langston, M.A., Suters, W.H., Symons, C.T.: Kernelization algorithms for the vertex cover problem (2017)
Aho, A.V., Sethi, R., Ullman, J.D.: Compilers, Principles, Techniques, and Tools. Addison-Wesley Pub. Co., Boston (1986)
Azad, A., Buluç, A.: Distributed-memory algorithms for maximum cardinality matching in bipartite graphs. In: 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 32–42. IEEE (2016)
Bader, D.A., Madduri, K.: Designing multithreaded algorithms for breadth-first search and ST-connectivity on the cray MTA-2. In: 2006 International Conference on Parallel Processing (ICPP 2006), pp. 523–530. IEEE (2006)
Beamer, S., Asanović, K., Patterson, D.: The gap benchmark suite. arXiv preprint arXiv:1508.03619 (2015)
Beamer, S., Asanovic, K., Patterson, D., Beamer, S., Patterson, D.: Searching for a parent instead of fighting over children: a fast breadth-first search implementation for graph500. EECS Department, University of California, Berkeley, Technical report UCB/EECS-2011-117 (2011)
Buluç, A., Beamer, S., Madduri, K., Asanovic, K., Patterson, D.: Distributed-memory breadth-first search on massive graphs. arXiv preprint arXiv:1705.04590 (2017)
Buluç, A., Gilbert, J.R.: The combinatorial BLAS: design, implementation, and applications. Int. J. High Perf. Comput. Appl. 25(4), 496–509 (2011)
Buluç, A., Madduri, K.: Parallel breadth-first search on distributed memory systems. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–12 (2011)
Chakrabarti, D., Zhan, Y., Faloutsos, C.: R-MAT: a recursive model for graph mining. In: Proceedings of the 2004 SIAM International Conference on Data Mining, pp. 442–446. SIAM (2004)
Checconi, F., Petrini, F.: Traversing trillions of edges in real time: graph exploration on large-scale parallel machines. In: 2014 IEEE 28th International Parallel and Distributed Processing Symposium, pp. 425–434. IEEE (2014)
Chenglong, Z., Huawei, C., Guobo, W., Qinfen, H., Yang, Z., Xiaochun, Y., Dongrui, F.: Efficient optimization of graph computing on high-throughput computer. J. Comput. Res. Dev. 57(6), 1152 (2020)
Gaihre, A., Wu, Z., Yao, F., Liu, H.: XBFS: exploring runtime optimizations for breadth-first search on GPUs. In: Proceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing, pp. 121–131 (2019)
Ghosh, R.K., Bhattacharjee, G.: Parallel breadth-first search algorithms for trees and graphs. Int. J. Comput. Math. 15(1–4), 255–268 (1984)
Gregor, D., Lumsdaine, A.: Lifting sequential graph algorithms for distributed-memory parallel computation. ACM SIGPLAN Not. 40(10), 423–437 (2005)
Harish, P., Narayanan, P.J.: Accelerating large graph algorithms on the GPU using CUDA. In: Aluru, S., Parashar, M., Badrinath, R., Prasanna, V.K. (eds.) HiPC 2007. LNCS, vol. 4873, pp. 197–208. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-77220-0_21
Hennessy, J.L., Patterson, D.A.: A new golden age for computer architecture. Commun. ACM 62(2), 48–60 (2019)
Hong, S., Oguntebi, T., Olukotun, K.: Efficient parallel graph exploration on multi-core CPU and GPU. In: 2011 International Conference on Parallel Architectures and Compilation Techniques, pp. 78–88. IEEE (2011)
Jia, Z., Tillman, B., Maggioni, M., Scarpazza, D.P.: Dissecting the graphcore ipu architecture via microbenchmarking. arXiv preprint arXiv:1912.03413 (2019)
Kaya, K., Langguth, J., Panagiotas, I., Uçar, B.: Karp-Sipser based kernels for bipartite graph matching. In: 2020 Proceedings of the Twenty-Second Workshop on Algorithm Engineering and Experiments (ALENEX), pp. 134–145. SIAM (2020)
Kolodziej, S.P., et al.: The suitesparse matrix collection website interface. J. Open Source Softw. 4(35), 1244 (2019)
Korf, R.E., Schultze, P.: Large-scale parallel breadth-first search. In: AAAI, vol. 5, pp. 1380–1385 (2005)
Langguth, J., Azad, A., Halappanavar, M., Manne, F.: On parallel push-relabel based algorithms for bipartite maximum matching. Parallel Comput. 40(7), 289–308 (2014)
Langguth, J., Cai, X., Sourouri, M.: Memory bandwidth contention: communication vs computation tradeoffs in supercomputers with multicore architectures. In: 2018 IEEE 24th International Conference on Parallel and Distributed Systems (ICPADS), pp. 497–506. IEEE (2018)
Langguth, J., Patwary, M.M.A., Manne, F.: Parallel algorithms for bipartite matching problems on distributed memory computers. Parallel Comput. 37(12), 820–845 (2011)
Liu, H., Huang, H.H.: Enterprise: breadth-first graph traversal on GPUs. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–12 (2015)
Murphy, R.C., Wheeler, K.B., Barrett, B.W., Ang, J.A.: Introducing the graph 500. Cray Users Group (CUG) 19, 45–74 (2010)
Seshadhri, C., Pinar, A., Kolda, T.G.: An in-depth analysis of stochastic Kronecker graphs. J. ACM (JACM) 60(2), 1–32 (2013)
Valiant, L.G.: A bridging model for parallel computation. Commun. ACM 33(8), 103–111 (1990)
Wang, Y., Davidson, A., Pan, Y., Wu, Y., Riffel, A., Owens, J.D.: Gunrock: a high-performance graph processing library on the GPU. In: Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 1–12 (2016)
Yang, C., Buluc, A., Owens, J.D.: GraphBLAST: a high-performance linear algebra-based graph framework on the GPU (2020)
Yasui, Y., Fujisawa, K., Goto, K.: NUMA-optimized parallel breadth-first search on multicore single-node system. In: 2013 IEEE International Conference on Big Data, pp. 394–402. IEEE (2013)
Yoo, A., Chow, E., Henderson, K., McLendon, W., Hendrickson, B., Catalyurek, U.: A scalable distributed parallel breadth-first search algorithm on BlueGene/L. In: SC 2005: Proceedings of the 2005 ACM/IEEE Conference on Supercomputing, p. 25. IEEE, November 2005. https://doi.org/10.1109/SC.2005.4
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Burchard, L., Moe, J., Schroeder, D.T., Pogorelov, K., Langguth, J. (2021). iPUG: Accelerating Breadth-First Graph Traversals Using Manycore Graphcore IPUs. In: Chamberlain, B.L., Varbanescu, AL., Ltaief, H., Luszczek, P. (eds) High Performance Computing. ISC High Performance 2021. Lecture Notes in Computer Science(), vol 12728. Springer, Cham. https://doi.org/10.1007/978-3-030-78713-4_16
Download citation
DOI: https://doi.org/10.1007/978-3-030-78713-4_16
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-78712-7
Online ISBN: 978-3-030-78713-4
eBook Packages: Computer ScienceComputer Science (R0)