VGL: a high-performance graph processing framework for the NEC SX-Aurora TSUBASA vector architecture

Afanasyev, Ilya V.; Voevodin, Vladimir V.; Komatsu, Kazuhiko; Kobayashi, Hiroaki

doi:10.1007/s11227-020-03564-9

VGL: a high-performance graph processing framework for the NEC SX-Aurora TSUBASA vector architecture

Published: 26 January 2021

Volume 77, pages 8694–8715, (2021)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Ilya V. Afanasyev ORCID: orcid.org/0000-0002-0202-1548¹,
Vladimir V. Voevodin²,
Kazuhiko Komatsu³ &
…
Hiroaki Kobayashi³

488 Accesses
8 Citations
1 Altmetric
Explore all metrics

Abstract

Developing efficient graph algorithms implementations is an extremely important problem of modern computer science, since graphs are frequently used in various real-world applications. Graph algorithms typically belong to the data-intensive class, and thus using architectures with high-bandwidth memory potentially allows to solve many graph problems significantly faster compared to modern multicore CPUs. Among other supercomputer architectures, vector systems, such as the SX family of NEC vector supercomputers, are equipped with high-bandwidth memory. However, the highly irregular structure of many real-world graphs makes it extremely challenging to implement graph algorithms on vector systems, since these implementations are usually bulky and complicated, and a deep understanding of vector architectures hardware features is required. This paper presents the world first attempt to develop an efficient and simultaneously simple graph processing framework for modern vector systems. Our vector graph library (VGL) framework targets NEC SX-Aurora TSUBASA as a primary vector architecture and provides relatively simple computational and data abstractions. These abstractions incorporate many vector-oriented optimization strategies into a high-level programming model, allowing quick implementation of new graph algorithms with a small amount of code and minimal knowledge about features of vector systems. In this paper, we evaluate the VGL performance on four widely used graph processing problems: breadth-first search, single source shortest paths, connected components, and page rank. The provided comparative performance analysis demonstrates that the VGL-based implementations achieve significant acceleration over the existing high-performance frameworks and libraries: up to 14 times speedup over multicore CPUs (Ligra, Galois, GAPBS) and up to 3 times speedup compared to NVIDIA GPU (Gunrock, NVGRAPH) implementations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Developing Efficient Implementations of Connected Component Algorithms for NEC SX-Aurora TSUBASA

Article 01 August 2020

Developing an Efficient Vector-Friendly Implementation of the Breadth-First Search Algorithm for NEC SX-Aurora TSUBASA

Developing Efficient Implementations of Shortest Paths and Page Rank Algorithms for NEC SX-Aurora TSUBASA Architecture

Article 27 November 2019

References

Afanasyev I, Voevodin VV, Voevodin VV, Komatsu K, Kobayashi H (2019) Developing efficient implementations of shortest paths and page rank algorithms for NEC SX-Aurora TSUBASA architecture. Lobachevskii J Math 40(11):1753−1762
Article Google Scholar
Afanasyev IV, Antonov AS, Nikitenko DA, Voevodin VV, Voevodin VV, Komatsu K, Watanabe O, Musa A, Kobayashi H (2018) Developing efficient implementations of bellman-ford and forward-backward graph algorithms for nec sx-ace. Supercomput Front Innov 5(3):65–69
Google Scholar
Afanasyev IV, Voevodin VV, Voevodin VV, Komatsu K, Kobayashi H (2019) Analysis of relationship between simd-processing features used in nvidia gpus and nec sx-aurora tsubasa vector processors. In: International Conference on Parallel Computing Technologies. Springer, pp 125–139
Beamer S, AsanoviÄ‡ K, Patterson D (2013) Direction-optimizing breadth-first search. Sci Program 21(3–4):137–148
Google Scholar
Besta M, Podstawski M, Groner L, Solomonik E, Hoefler T (2017) To push or to pull: On reducing communication and synchronization in graph computations. In: Proceedings of the 26th international symposium on high-performance parallel and distributed computing. pp 93–104
Chakrabarti D, Zhan Y, Faloutsos C (2004) R-mat: a recursive model for graph mining. In: Proceedings of the 2004 SIAM International Conference on Data Mining. SIAM, pp 442–446
Egawa R, Komatsu K, Momose S, Isobe Y, Musa A, Takizawa H, Kobayashi H (2017) Potential of a modern vector supercomputer for practical applications: performance evaluation of SX-ACE. pp 3948–3976
Fu Z, Personick M, Thompson B (2014) Mapgraph: a high level api for fast development of high performance graph analytics on gpus. In: Proceedings of workshop on GRAph data management experiences and systems. pp 1–6
Goldberg A, Radzik T (1993) A heuristic improvement of the bellman-ford algorithm. Stanford Univ CA Dept of Computer Science, Technical report
Hillis WD, Steele GL Jr (1986) Data parallel algorithms. Commun ACM 29(12):1170–1183
Article Google Scholar
Ilic A, Pratas F, Sousa L (2013) Cache-aware roofline model: upgrading the loft. IEEE Comput Archit Lett 13(1):21–24
Article Google Scholar
Khorasani F, Vora K, Gupta R, Bhuyan LN (2014) Cusha: vertex-centric graph processing on gpus. In: Proceedings of the 23rd international symposium on High-performance parallel and distributed computing. pp 239–252
Komatsu K, Egawa R, Isobe Y, Ogata R, Takizawa H, Kobayashi H (2015) An approach to the highest efficiency of the HPCG benchmark on the SX-ACE supercomputer. In: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis (SC15). Poster, pp 1–2
Komatsu K, Momose S, Isobe Y, Watanabe O, Musa A, Yokokawa M, Aoyama T, Sato M, Kobayashi H (2018) Performance evaluation of a vector supercomputer sx-aurora tsubasa. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, SC ’18. IEEE Press, Piscataway, pp 54:1–54:12
Liu H, Huang HH (2015) Enterprise: breadth-first graph traversal on gpus. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. pp 1–12
Meyer U, Sanders P (2003) $\delta $-stepping: a parallelizable shortest path algorithm. J Algorithms 49(1):114–152
Article MathSciNet Google Scholar
Murphy RC, Wheeler KB, Barrett BW, Ang JA (2010) Introducing the graph 500. Cray Users Group (CUG) 19:45–74
Google Scholar
Nguyen D, Lenharth A, Pingali K (2013) A lightweight infrastructure for graph analytics. In: Proceedings of the twenty-fourth ACM symposium on operating systems principles. pp 456–471
Page L, Brin S, Motwani R, Winograd T (1999) The pagerank citation ranking: bringing order to the web. Technical report, Stanford InfoLab
Shiloach Y, Vishkin U (1980) An o (log n) parallel connectivity algorithm. Technical report, Computer Science Department, Technion
Shun J, Blelloch GE (2013) Ligra: a lightweight graph processing framework for shared memory. In: ACM sigplan notices, vol. 48. ACM, pp 135–146
Stanford Large Network Dataset Collection-SNAP. https://snap.stanford.edu/data/
The Koblenz Network Collection-KONECT. http://konect.uni-koblenz.de
Wang Y, Davidson A, Pan Y, Wu Y, Riffel A, Owens JD (2016) Gunrock: a high-performance graph processing library on the gpu. In: Proceedings of the 21st ACM SIGPLAN symposium on principles and practice of parallel programming. pp 1–12
Williams S, Waterman A, Patterson D (2009) Roofline: an insightful visual performance model for multicore architectures. Commun ACM 52(4):65–76
Article Google Scholar
Yamada Y, Momose S (2018) Vector engine processor of nec brand-new supercomputer sx-aurora TSUBASA. In: Intenational symposium on high performance chips (Hot Chips2018)
Zhang Y, Kiriansky V, Mendis C, Amarasinghe S, Zaharia M (2017) Making caches work for graph analytics. In: 2017 IEEE International Conference on Big Data (Big Data). IEEE, pp 293–302
Zhong J, He B (2013) Medusa: simplified graph processing on gpus. IEEE Trans Parallel Distrib Syst 25(6):1543–1552
Article Google Scholar

Download references

Acknowledgements

The results described in Section 5 were obtained in Lomonosov Moscow State University with the financial support of the Russian Science Foundation (Agreement N 20-11-20194). The reported study was funded by RFBR, Project Number 19-37-90002.

Author information

Authors and Affiliations

Moscow Center of Fundamental and Applied Mathematics, Moscow, Russia, 119991
Ilya V. Afanasyev
Research Computing Center of Moscow State University, Moscow, Russia, 119234
Vladimir V. Voevodin
Tohoku University, Sendai, Miyagi, 980-8579, Japan
Kazuhiko Komatsu & Hiroaki Kobayashi

Authors

Ilya V. Afanasyev
View author publications
You can also search for this author in PubMed Google Scholar
Vladimir V. Voevodin
View author publications
You can also search for this author in PubMed Google Scholar
Kazuhiko Komatsu
View author publications
You can also search for this author in PubMed Google Scholar
Hiroaki Kobayashi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ilya V. Afanasyev.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Afanasyev, I.V., Voevodin, V.V., Komatsu, K. et al. VGL: a high-performance graph processing framework for the NEC SX-Aurora TSUBASA vector architecture. J Supercomput 77, 8694–8715 (2021). https://doi.org/10.1007/s11227-020-03564-9

Download citation

Accepted: 10 December 2020
Published: 26 January 2021
Issue Date: August 2021
DOI: https://doi.org/10.1007/s11227-020-03564-9

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

VGL: a high-performance graph processing framework for the NEC SX-Aurora TSUBASA vector architecture

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Developing Efficient Implementations of Connected Component Algorithms for NEC SX-Aurora TSUBASA

Developing an Efficient Vector-Friendly Implementation of the Breadth-First Search Algorithm for NEC SX-Aurora TSUBASA

Developing Efficient Implementations of Shortest Paths and Page Rank Algorithms for NEC SX-Aurora TSUBASA Architecture

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now