Abstract
Graph Neural Networks (GNNs), which learn representations of non-euclidean data, are rapidly rising in popularity and are used in several computationally demanding scientific applications. As these deep learning models become more prevalent in practical applications, their performance during inference becomes increasingly critical. GNNs have been shown to suffer from hard memory and computational bottlenecks on traditional hardware platforms (i.e. GPUs) due in part to their reliance on non-contiguous data structures. While dataflow architectures used by emerging hardware accelerators provide a potential solution to alleviate these issues, end-to-end GNN models are generally not yet supported by these platforms. Thus, it is not currently possible to directly compare the performance of GNNs on traditional GPUs with these hardware accelerators. In this work, we analyze the performance of operators relevant to modern GNNs on three platforms: NVIDIA A100 GPU, Groq GroqChip1, and SambaNova Reconfigurable Dataflow Unit (RDU). Specifically, we first profile several modern GNN models on traditional GPUs to determine the operators, fused kernels, and message passing layers most relevant to these architectures. Then, we systematically benchmark and analyze the performance for each of these levels of abstraction on each hardware platform. Our analysis shows that (1) due to their reliance on non-contiguous data, GNNs suffer from cache inefficiency on conventional GPUs (2) dataflow architectures, due in part to their cache-less design, are able to implicitly optimize for operators pertinent to GNNs and, (3) the RDU and GroqChip1 platforms enable significant inference speedup compared to traditional GPU on pertinent subsets of end-to-end GNN networks. Our open source code is available at https://github.com/ryienh/gnn-ops-benchmark.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
GroqChip1 currently supports all operators in Table 1. Operators not tested are due to lack of multi-chip support, which render benchmarking at scale impractical.
References
Abts, D., et al.: Think fast: a tensor streaming processor (tsp) for accelerating deep learning workloads. In: 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), pp. 145–158 (2020)
Awan, A.A., Jain, A., Chu, C.H., Subramoni, H., Panda, D.K.: Communication profiling and characterization of deep-learning workloads on clusters with high-performance interconnects. IEEE Micro 40(1), 35–43 (2019)
Baruah, T., et al.: GNNmark: a benchmark suite to characterize graph neural network training on GPUs. In: 2021 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pp. 13–23. IEEE (2021)
Blott, M., et al.: Evaluation of optimized CNNs on heterogeneous accelerators using a novel benchmarking approach. IEEE Trans. Comput. 70(10), 1654–1669 (2020)
Blott, M., Halder, L., Leeser, M., Doyle, L.: QuTiBench: benchmarking neural networks on heterogeneous hardware. ACM J. Emerg. Technol. Comput. Syst. (JETC) 15(4), 1–38 (2019)
Chen, Z., Cao, Y., Liu, Y., Wang, H., Xie, T., Liu, X.: A comprehensive study on challenges in deploying deep learning based software. In: Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 750–762 (2020)
Ciżnicki, M., Kierzynka, M., Kopta, P., Kurowski, K., Gepner, P.: Benchmarking data and compute intensive applications on modern CPU and GPU architectures. Procedia Comput. Sci. 9, 1900–1909 (2012)
Corso, G., Cavalleri, L., Beaini, D., Liò, P., Veličković, P.: Principal neighbourhood aggregation for graph nets. In: Advances in Neural Information Processing Systems, vol. 33, pp. 13260–13271 (2020)
Culler, D.E.: Dataflow architectures. Annu. Rev. Comput. Sci. 1(1), 225–253 (1986)
Dang, V., Mohajerani, K., Gaj, K.: High-speed hardware architectures and fair FPGA benchmarking of CRYSTALS-kyber NTRU and saber. In: NIST 3rd PQC Standardization Conference (2021)
Dwivedi, V.P., Joshi, C.K., Laurent, T., Bengio, Y., Bresson, X.: Benchmarking graph neural networks. arXiv preprint arXiv:2003.00982 (2020)
Fey, M., Lenssen, J.E.: Fast graph representation learning with PyTorch geometric. arXiv preprint arXiv:1903.02428 (2019)
Filipovič, J., Madzin, M., Fousek, J., Matyska, L.: Optimizing CUDA code by kernel fusion: application on BLAS. J. Supercomput. 71(10), 3934–3957 (2015)
Gale, T., Zaharia, M., Young, C., Elsen, E.: Sparse GPU kernels for deep learning. In: SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–14. IEEE (2020)
Gwennap, L.: Groq rocks neural networks. Microprocessor Report, Technical report (2020)
Hamilton, W., Ying, Z., Leskovec, J.: Inductive representation learning on large graphs. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Hosseini, R., Simini, F., Vishwanath, V.: Operation-level performance benchmarking of graph neural networks for scientific applications. arXiv preprint arXiv:2207.09955 (2022)
Ivanov, S., Sviridov, S., Burnaev, E.: Understanding isomorphism bias in graph data sets (2019)
Karunaratne, M., Mohite, A.K., Mitra, T., Peh, L.S.: HyCUBE: A CGRA with reconfigurable single-cycle multi-hop interconnect. In: Proceedings of the 54th Annual Design Automation Conference 2017, pp. 1–6 (2017)
Paszke, A., et al.: PyTorch: an imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems, vol. 32 (2019)
Prabhakar, R., Jairath, S.: SambaNova SN10 RDU: accelerating software 2.0 with dataflow. In: 2021 IEEE Hot Chips 33 Symposium (HCS), pp. 1–37. IEEE (2021)
Prabhakar, R., Jairath, S., Shin, J.L.: Sambanova sn10 RDU: a 7 nm dataflow architecture to accelerate software 2.0. In: 2022 IEEE International Solid-State Circuits Conference (ISSCC), vol. 65, pp. 350–352. IEEE (2022)
Qiao, B., Reiche, O., Hannig, F., Teich, J.: From loop fusion to kernel fusion: a domain-specific approach to locality optimization. In: 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pp. 242–253 (2019)
Ramakrishnan, R., Dral, P.O., Rupp, M., von Lilienfeld, O.A.: Quantum chemistry structures and properties of 134 kilo molecules. Sci. Data 1, 1–7 (2014)
Reuther, A., Michaleas, P., Jones, M., Gadepally, V., Samsi, S., Kepner, J.: Survey and benchmarking of machine learning accelerators. In: 2019 IEEE High Performance Extreme Computing Conference (HPEC), pp. 1–9. IEEE (2019)
Reuther, A., Michaleas, P., Jones, M., Gadepally, V., Samsi, S., Kepner, J.: Survey of machine learning accelerators. In: 2020 IEEE High Performance Extreme Computing Conference (HPEC), pp. 1–12. IEEE (2020)
Wang, G., Lin, Y., Yi, W.: Kernel fusion: an effective method for better power efficiency on multithreaded GPU. In: 2010 IEEE/ACM International Conference on Green Computing and Communications & International Conference on Cyber, Physical and Social Computing, pp. 344–350. IEEE (2010)
Wang, Y.E., Wei, G.Y., Brooks, D.: Benchmarking TPU, GPU, and CPU platforms for deep learning. arXiv preprint arXiv:1907.10701 (2019)
Wang, Y., Feng, B., Ding, Y.: TC-GNN: accelerating sparse graph neural network computation via dense tensor core on GPUs. arXiv preprint arXiv:2112.02052 (2021)
Wu, Z., Pan, S., Chen, F., Long, G., Zhang, C., Philip, S.Y.: A comprehensive survey on graph neural networks. IEEE Trans. Neural Netw. Learn. Syst. 32(1), 4–24 (2020)
Xie, T., Grossman, J.C.: Crystal graph convolutional neural networks for an accurate and interpretable prediction of material properties. Phys. Rev. Lett. 120, 145301 (2018). https://doi.org/10.1103/PhysRevLett.120.145301
Xu, K., Hu, W., Leskovec, J., Jegelka, S.: How powerful are graph neural networks? arXiv preprint arXiv:1810.00826 (2018)
Yang, C.: Hierarchical roofline analysis: How to collect data using performance tools on intel CPUs and NVIDIA GPUs. arXiv preprint arXiv:2009.02449 (2020)
Funding
This research used resources of the Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Hosseini, R. et al. (2023). Exploring the Use of Dataflow Architectures for Graph Neural Network Workloads. In: Bienz, A., Weiland, M., Baboulin, M., Kruse, C. (eds) High Performance Computing. ISC High Performance 2023. Lecture Notes in Computer Science, vol 13999. Springer, Cham. https://doi.org/10.1007/978-3-031-40843-4_48
Download citation
DOI: https://doi.org/10.1007/978-3-031-40843-4_48
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-40842-7
Online ISBN: 978-3-031-40843-4
eBook Packages: Computer ScienceComputer Science (R0)