Skip to main content

Exploring the Use of Dataflow Architectures for Graph Neural Network Workloads

  • Conference paper
  • First Online:
High Performance Computing (ISC High Performance 2023)

Abstract

Graph Neural Networks (GNNs), which learn representations of non-euclidean data, are rapidly rising in popularity and are used in several computationally demanding scientific applications. As these deep learning models become more prevalent in practical applications, their performance during inference becomes increasingly critical. GNNs have been shown to suffer from hard memory and computational bottlenecks on traditional hardware platforms (i.e. GPUs) due in part to their reliance on non-contiguous data structures. While dataflow architectures used by emerging hardware accelerators provide a potential solution to alleviate these issues, end-to-end GNN models are generally not yet supported by these platforms. Thus, it is not currently possible to directly compare the performance of GNNs on traditional GPUs with these hardware accelerators. In this work, we analyze the performance of operators relevant to modern GNNs on three platforms: NVIDIA A100 GPU, Groq GroqChip1, and SambaNova Reconfigurable Dataflow Unit (RDU). Specifically, we first profile several modern GNN models on traditional GPUs to determine the operators, fused kernels, and message passing layers most relevant to these architectures. Then, we systematically benchmark and analyze the performance for each of these levels of abstraction on each hardware platform. Our analysis shows that (1) due to their reliance on non-contiguous data, GNNs suffer from cache inefficiency on conventional GPUs (2) dataflow architectures, due in part to their cache-less design, are able to implicitly optimize for operators pertinent to GNNs and, (3) the RDU and GroqChip1 platforms enable significant inference speedup compared to traditional GPU on pertinent subsets of end-to-end GNN networks. Our open source code is available at https://github.com/ryienh/gnn-ops-benchmark.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    GroqChip1 currently supports all operators in Table 1. Operators not tested are due to lack of multi-chip support, which render benchmarking at scale impractical.

References

  1. Abts, D., et al.: Think fast: a tensor streaming processor (tsp) for accelerating deep learning workloads. In: 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), pp. 145–158 (2020)

    Google Scholar 

  2. Awan, A.A., Jain, A., Chu, C.H., Subramoni, H., Panda, D.K.: Communication profiling and characterization of deep-learning workloads on clusters with high-performance interconnects. IEEE Micro 40(1), 35–43 (2019)

    Article  Google Scholar 

  3. Baruah, T., et al.: GNNmark: a benchmark suite to characterize graph neural network training on GPUs. In: 2021 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pp. 13–23. IEEE (2021)

    Google Scholar 

  4. Blott, M., et al.: Evaluation of optimized CNNs on heterogeneous accelerators using a novel benchmarking approach. IEEE Trans. Comput. 70(10), 1654–1669 (2020)

    MATH  Google Scholar 

  5. Blott, M., Halder, L., Leeser, M., Doyle, L.: QuTiBench: benchmarking neural networks on heterogeneous hardware. ACM J. Emerg. Technol. Comput. Syst. (JETC) 15(4), 1–38 (2019)

    Article  Google Scholar 

  6. Chen, Z., Cao, Y., Liu, Y., Wang, H., Xie, T., Liu, X.: A comprehensive study on challenges in deploying deep learning based software. In: Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 750–762 (2020)

    Google Scholar 

  7. Ciżnicki, M., Kierzynka, M., Kopta, P., Kurowski, K., Gepner, P.: Benchmarking data and compute intensive applications on modern CPU and GPU architectures. Procedia Comput. Sci. 9, 1900–1909 (2012)

    Article  Google Scholar 

  8. Corso, G., Cavalleri, L., Beaini, D., Liò, P., Veličković, P.: Principal neighbourhood aggregation for graph nets. In: Advances in Neural Information Processing Systems, vol. 33, pp. 13260–13271 (2020)

    Google Scholar 

  9. Culler, D.E.: Dataflow architectures. Annu. Rev. Comput. Sci. 1(1), 225–253 (1986)

    Article  Google Scholar 

  10. Dang, V., Mohajerani, K., Gaj, K.: High-speed hardware architectures and fair FPGA benchmarking of CRYSTALS-kyber NTRU and saber. In: NIST 3rd PQC Standardization Conference (2021)

    Google Scholar 

  11. Dwivedi, V.P., Joshi, C.K., Laurent, T., Bengio, Y., Bresson, X.: Benchmarking graph neural networks. arXiv preprint arXiv:2003.00982 (2020)

  12. Fey, M., Lenssen, J.E.: Fast graph representation learning with PyTorch geometric. arXiv preprint arXiv:1903.02428 (2019)

  13. Filipovič, J., Madzin, M., Fousek, J., Matyska, L.: Optimizing CUDA code by kernel fusion: application on BLAS. J. Supercomput. 71(10), 3934–3957 (2015)

    Article  Google Scholar 

  14. Gale, T., Zaharia, M., Young, C., Elsen, E.: Sparse GPU kernels for deep learning. In: SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–14. IEEE (2020)

    Google Scholar 

  15. Gwennap, L.: Groq rocks neural networks. Microprocessor Report, Technical report (2020)

    Google Scholar 

  16. Hamilton, W., Ying, Z., Leskovec, J.: Inductive representation learning on large graphs. In: Advances in Neural Information Processing Systems, vol. 30 (2017)

    Google Scholar 

  17. Hosseini, R., Simini, F., Vishwanath, V.: Operation-level performance benchmarking of graph neural networks for scientific applications. arXiv preprint arXiv:2207.09955 (2022)

  18. Ivanov, S., Sviridov, S., Burnaev, E.: Understanding isomorphism bias in graph data sets (2019)

    Google Scholar 

  19. Karunaratne, M., Mohite, A.K., Mitra, T., Peh, L.S.: HyCUBE: A CGRA with reconfigurable single-cycle multi-hop interconnect. In: Proceedings of the 54th Annual Design Automation Conference 2017, pp. 1–6 (2017)

    Google Scholar 

  20. Paszke, A., et al.: PyTorch: an imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems, vol. 32 (2019)

    Google Scholar 

  21. Prabhakar, R., Jairath, S.: SambaNova SN10 RDU: accelerating software 2.0 with dataflow. In: 2021 IEEE Hot Chips 33 Symposium (HCS), pp. 1–37. IEEE (2021)

    Google Scholar 

  22. Prabhakar, R., Jairath, S., Shin, J.L.: Sambanova sn10 RDU: a 7 nm dataflow architecture to accelerate software 2.0. In: 2022 IEEE International Solid-State Circuits Conference (ISSCC), vol. 65, pp. 350–352. IEEE (2022)

    Google Scholar 

  23. Qiao, B., Reiche, O., Hannig, F., Teich, J.: From loop fusion to kernel fusion: a domain-specific approach to locality optimization. In: 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pp. 242–253 (2019)

    Google Scholar 

  24. Ramakrishnan, R., Dral, P.O., Rupp, M., von Lilienfeld, O.A.: Quantum chemistry structures and properties of 134 kilo molecules. Sci. Data 1, 1–7 (2014)

    Google Scholar 

  25. Reuther, A., Michaleas, P., Jones, M., Gadepally, V., Samsi, S., Kepner, J.: Survey and benchmarking of machine learning accelerators. In: 2019 IEEE High Performance Extreme Computing Conference (HPEC), pp. 1–9. IEEE (2019)

    Google Scholar 

  26. Reuther, A., Michaleas, P., Jones, M., Gadepally, V., Samsi, S., Kepner, J.: Survey of machine learning accelerators. In: 2020 IEEE High Performance Extreme Computing Conference (HPEC), pp. 1–12. IEEE (2020)

    Google Scholar 

  27. Wang, G., Lin, Y., Yi, W.: Kernel fusion: an effective method for better power efficiency on multithreaded GPU. In: 2010 IEEE/ACM International Conference on Green Computing and Communications & International Conference on Cyber, Physical and Social Computing, pp. 344–350. IEEE (2010)

    Google Scholar 

  28. Wang, Y.E., Wei, G.Y., Brooks, D.: Benchmarking TPU, GPU, and CPU platforms for deep learning. arXiv preprint arXiv:1907.10701 (2019)

  29. Wang, Y., Feng, B., Ding, Y.: TC-GNN: accelerating sparse graph neural network computation via dense tensor core on GPUs. arXiv preprint arXiv:2112.02052 (2021)

  30. Wu, Z., Pan, S., Chen, F., Long, G., Zhang, C., Philip, S.Y.: A comprehensive survey on graph neural networks. IEEE Trans. Neural Netw. Learn. Syst. 32(1), 4–24 (2020)

    Article  MathSciNet  Google Scholar 

  31. Xie, T., Grossman, J.C.: Crystal graph convolutional neural networks for an accurate and interpretable prediction of material properties. Phys. Rev. Lett. 120, 145301 (2018). https://doi.org/10.1103/PhysRevLett.120.145301

    Article  Google Scholar 

  32. Xu, K., Hu, W., Leskovec, J., Jegelka, S.: How powerful are graph neural networks? arXiv preprint arXiv:1810.00826 (2018)

  33. Yang, C.: Hierarchical roofline analysis: How to collect data using performance tools on intel CPUs and NVIDIA GPUs. arXiv preprint arXiv:2009.02449 (2020)

Download references

Funding

This research used resources of the Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ryien Hosseini .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Hosseini, R. et al. (2023). Exploring the Use of Dataflow Architectures for Graph Neural Network Workloads. In: Bienz, A., Weiland, M., Baboulin, M., Kruse, C. (eds) High Performance Computing. ISC High Performance 2023. Lecture Notes in Computer Science, vol 13999. Springer, Cham. https://doi.org/10.1007/978-3-031-40843-4_48

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-40843-4_48

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-40842-7

  • Online ISBN: 978-3-031-40843-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics