Exploring the Use of Dataflow Architectures for Graph Neural Network Workloads

Hosseini, Ryien; Simini, Filippo; Vishwanath, Venkatram; Sivakumar, Ramakrishnan; Shanmugavelu, Sanjif; Chen, Zhengyu; Zlotnik, Lev; Wang, Mingran; Colangelo, Philip; Deng, Andrew; Lassen, Philip; Pathan, Shukur

doi:10.1007/978-3-031-40843-4_48

Ryien Hosseini¹¹,
Filippo Simini¹¹,
Venkatram Vishwanath¹¹,
Ramakrishnan Sivakumar¹³,
Sanjif Shanmugavelu¹³,
Zhengyu Chen^11,12,13,
Lev Zlotnik¹³,
Mingran Wang^11,12,13,
Philip Colangelo¹³,
Andrew Deng^11,12,13,
Philip Lassen¹³ &
…
Shukur Pathan^11,12,13

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13999))

Included in the following conference series:

International Conference on High Performance Computing

1032 Accesses

Abstract

Graph Neural Networks (GNNs), which learn representations of non-euclidean data, are rapidly rising in popularity and are used in several computationally demanding scientific applications. As these deep learning models become more prevalent in practical applications, their performance during inference becomes increasingly critical. GNNs have been shown to suffer from hard memory and computational bottlenecks on traditional hardware platforms (i.e. GPUs) due in part to their reliance on non-contiguous data structures. While dataflow architectures used by emerging hardware accelerators provide a potential solution to alleviate these issues, end-to-end GNN models are generally not yet supported by these platforms. Thus, it is not currently possible to directly compare the performance of GNNs on traditional GPUs with these hardware accelerators. In this work, we analyze the performance of operators relevant to modern GNNs on three platforms: NVIDIA A100 GPU, Groq GroqChip1, and SambaNova Reconfigurable Dataflow Unit (RDU). Specifically, we first profile several modern GNN models on traditional GPUs to determine the operators, fused kernels, and message passing layers most relevant to these architectures. Then, we systematically benchmark and analyze the performance for each of these levels of abstraction on each hardware platform. Our analysis shows that (1) due to their reliance on non-contiguous data, GNNs suffer from cache inefficiency on conventional GPUs (2) dataflow architectures, due in part to their cache-less design, are able to implicitly optimize for operators pertinent to GNNs and, (3) the RDU and GroqChip1 platforms enable significant inference speedup compared to traditional GPU on pertinent subsets of end-to-end GNN networks. Our open source code is available at https://github.com/ryienh/gnn-ops-benchmark.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
GroqChip1 currently supports all operators in Table 1. Operators not tested are due to lack of multi-chip support, which render benchmarking at scale impractical.

References

Abts, D., et al.: Think fast: a tensor streaming processor (tsp) for accelerating deep learning workloads. In: 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), pp. 145–158 (2020)
Google Scholar
Awan, A.A., Jain, A., Chu, C.H., Subramoni, H., Panda, D.K.: Communication profiling and characterization of deep-learning workloads on clusters with high-performance interconnects. IEEE Micro 40(1), 35–43 (2019)
Article Google Scholar
Baruah, T., et al.: GNNmark: a benchmark suite to characterize graph neural network training on GPUs. In: 2021 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pp. 13–23. IEEE (2021)
Google Scholar
Blott, M., et al.: Evaluation of optimized CNNs on heterogeneous accelerators using a novel benchmarking approach. IEEE Trans. Comput. 70(10), 1654–1669 (2020)
MATH Google Scholar
Blott, M., Halder, L., Leeser, M., Doyle, L.: QuTiBench: benchmarking neural networks on heterogeneous hardware. ACM J. Emerg. Technol. Comput. Syst. (JETC) 15(4), 1–38 (2019)
Article Google Scholar
Chen, Z., Cao, Y., Liu, Y., Wang, H., Xie, T., Liu, X.: A comprehensive study on challenges in deploying deep learning based software. In: Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 750–762 (2020)
Google Scholar
Ciżnicki, M., Kierzynka, M., Kopta, P., Kurowski, K., Gepner, P.: Benchmarking data and compute intensive applications on modern CPU and GPU architectures. Procedia Comput. Sci. 9, 1900–1909 (2012)
Article Google Scholar
Corso, G., Cavalleri, L., Beaini, D., Liò, P., Veličković, P.: Principal neighbourhood aggregation for graph nets. In: Advances in Neural Information Processing Systems, vol. 33, pp. 13260–13271 (2020)
Google Scholar
Culler, D.E.: Dataflow architectures. Annu. Rev. Comput. Sci. 1(1), 225–253 (1986)
Article Google Scholar
Dang, V., Mohajerani, K., Gaj, K.: High-speed hardware architectures and fair FPGA benchmarking of CRYSTALS-kyber NTRU and saber. In: NIST 3rd PQC Standardization Conference (2021)
Google Scholar
Dwivedi, V.P., Joshi, C.K., Laurent, T., Bengio, Y., Bresson, X.: Benchmarking graph neural networks. arXiv preprint arXiv:2003.00982 (2020)
Fey, M., Lenssen, J.E.: Fast graph representation learning with PyTorch geometric. arXiv preprint arXiv:1903.02428 (2019)
Filipovič, J., Madzin, M., Fousek, J., Matyska, L.: Optimizing CUDA code by kernel fusion: application on BLAS. J. Supercomput. 71(10), 3934–3957 (2015)
Article Google Scholar
Gale, T., Zaharia, M., Young, C., Elsen, E.: Sparse GPU kernels for deep learning. In: SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–14. IEEE (2020)
Google Scholar
Gwennap, L.: Groq rocks neural networks. Microprocessor Report, Technical report (2020)
Google Scholar
Hamilton, W., Ying, Z., Leskovec, J.: Inductive representation learning on large graphs. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Google Scholar
Hosseini, R., Simini, F., Vishwanath, V.: Operation-level performance benchmarking of graph neural networks for scientific applications. arXiv preprint arXiv:2207.09955 (2022)
Ivanov, S., Sviridov, S., Burnaev, E.: Understanding isomorphism bias in graph data sets (2019)
Google Scholar
Karunaratne, M., Mohite, A.K., Mitra, T., Peh, L.S.: HyCUBE: A CGRA with reconfigurable single-cycle multi-hop interconnect. In: Proceedings of the 54th Annual Design Automation Conference 2017, pp. 1–6 (2017)
Google Scholar
Paszke, A., et al.: PyTorch: an imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems, vol. 32 (2019)
Google Scholar
Prabhakar, R., Jairath, S.: SambaNova SN10 RDU: accelerating software 2.0 with dataflow. In: 2021 IEEE Hot Chips 33 Symposium (HCS), pp. 1–37. IEEE (2021)
Google Scholar
Prabhakar, R., Jairath, S., Shin, J.L.: Sambanova sn10 RDU: a 7 nm dataflow architecture to accelerate software 2.0. In: 2022 IEEE International Solid-State Circuits Conference (ISSCC), vol. 65, pp. 350–352. IEEE (2022)
Google Scholar
Qiao, B., Reiche, O., Hannig, F., Teich, J.: From loop fusion to kernel fusion: a domain-specific approach to locality optimization. In: 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pp. 242–253 (2019)
Google Scholar
Ramakrishnan, R., Dral, P.O., Rupp, M., von Lilienfeld, O.A.: Quantum chemistry structures and properties of 134 kilo molecules. Sci. Data 1, 1–7 (2014)
Google Scholar
Reuther, A., Michaleas, P., Jones, M., Gadepally, V., Samsi, S., Kepner, J.: Survey and benchmarking of machine learning accelerators. In: 2019 IEEE High Performance Extreme Computing Conference (HPEC), pp. 1–9. IEEE (2019)
Google Scholar
Reuther, A., Michaleas, P., Jones, M., Gadepally, V., Samsi, S., Kepner, J.: Survey of machine learning accelerators. In: 2020 IEEE High Performance Extreme Computing Conference (HPEC), pp. 1–12. IEEE (2020)
Google Scholar
Wang, G., Lin, Y., Yi, W.: Kernel fusion: an effective method for better power efficiency on multithreaded GPU. In: 2010 IEEE/ACM International Conference on Green Computing and Communications & International Conference on Cyber, Physical and Social Computing, pp. 344–350. IEEE (2010)
Google Scholar
Wang, Y.E., Wei, G.Y., Brooks, D.: Benchmarking TPU, GPU, and CPU platforms for deep learning. arXiv preprint arXiv:1907.10701 (2019)
Wang, Y., Feng, B., Ding, Y.: TC-GNN: accelerating sparse graph neural network computation via dense tensor core on GPUs. arXiv preprint arXiv:2112.02052 (2021)
Wu, Z., Pan, S., Chen, F., Long, G., Zhang, C., Philip, S.Y.: A comprehensive survey on graph neural networks. IEEE Trans. Neural Netw. Learn. Syst. 32(1), 4–24 (2020)
Article MathSciNet Google Scholar
Xie, T., Grossman, J.C.: Crystal graph convolutional neural networks for an accurate and interpretable prediction of material properties. Phys. Rev. Lett. 120, 145301 (2018). https://doi.org/10.1103/PhysRevLett.120.145301
Article Google Scholar
Xu, K., Hu, W., Leskovec, J., Jegelka, S.: How powerful are graph neural networks? arXiv preprint arXiv:1810.00826 (2018)
Yang, C.: Hierarchical roofline analysis: How to collect data using performance tools on intel CPUs and NVIDIA GPUs. arXiv preprint arXiv:2009.02449 (2020)

Download references

Funding

This research used resources of the Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357.

Author information

Authors and Affiliations

Argonne Leadership Computing Facility, Argonne National Laboratory, Lemont, IL, USA
Ryien Hosseini, Filippo Simini, Venkatram Vishwanath, Zhengyu Chen, Mingran Wang, Andrew Deng & Shukur Pathan
Groq, Mountain View, CA, USA
Zhengyu Chen, Mingran Wang, Andrew Deng & Shukur Pathan
SambaNova Systems, Palo Alto, CA, USA
Ramakrishnan Sivakumar, Sanjif Shanmugavelu, Zhengyu Chen, Lev Zlotnik, Mingran Wang, Philip Colangelo, Andrew Deng, Philip Lassen & Shukur Pathan

Authors

Ryien Hosseini
View author publications
You can also search for this author in PubMed Google Scholar
Filippo Simini
View author publications
You can also search for this author in PubMed Google Scholar
Venkatram Vishwanath
View author publications
You can also search for this author in PubMed Google Scholar
Ramakrishnan Sivakumar
View author publications
You can also search for this author in PubMed Google Scholar
Sanjif Shanmugavelu
View author publications
You can also search for this author in PubMed Google Scholar
Zhengyu Chen
View author publications
You can also search for this author in PubMed Google Scholar
Lev Zlotnik
View author publications
You can also search for this author in PubMed Google Scholar
Mingran Wang
View author publications
You can also search for this author in PubMed Google Scholar
Philip Colangelo
View author publications
You can also search for this author in PubMed Google Scholar
Andrew Deng
View author publications
You can also search for this author in PubMed Google Scholar
Philip Lassen
View author publications
You can also search for this author in PubMed Google Scholar
Shukur Pathan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ryien Hosseini .

Editor information

Editors and Affiliations

University of New Mexico, Albuquerque, NM, USA
Amanda Bienz
University of Edinburgh, Edinburgh, UK
Michèle Weiland
Université Paris-Saclay, Gif sur Yvette, France
Marc Baboulin
CERFACS, Toulouse, France
Carola Kruse

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hosseini, R. et al. (2023). Exploring the Use of Dataflow Architectures for Graph Neural Network Workloads. In: Bienz, A., Weiland, M., Baboulin, M., Kruse, C. (eds) High Performance Computing. ISC High Performance 2023. Lecture Notes in Computer Science, vol 13999. Springer, Cham. https://doi.org/10.1007/978-3-031-40843-4_48

Download citation

DOI: https://doi.org/10.1007/978-3-031-40843-4_48
Published: 25 August 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-40842-7
Online ISBN: 978-3-031-40843-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Exploring the Use of Dataflow Architectures for Graph Neural Network Workloads