research-article

Graptor: efficient pull and push style vectorized graph processing

Author:

Hans VandierendonckAuthors Info & Claims

ICS '20: Proceedings of the 34th ACM International Conference on Supercomputing

Article No.: 13, Pages 1 - 13

https://doi.org/10.1145/3392717.3392753

Published: 29 June 2020 Publication History

Abstract

Vectorization seeks to accelerate computation through data-level parallelism. Vectorization has been applied to graph processing, where the graph is traversed either in a push style or a pull style. As it is not well understood which style will perform better, there is a need for both vectorized push and pull style traversals. This paper is the first to present a general solution to vectorizing push style traversal. It more-over presents an enhanced vectorized pull style traversal.

Our solution consists of three components: CleanCut, a graph partitioning approach that rules out inter-thread race conditions; VectorFast, a compact graph representation that supports fast-forwarding through the edge stream; and Graptor, a domain-specific language and compiler for auto-vectorizing and optimizing graph processing codes.

Experimental evaluation demonstrates average speedups of 2.72X over Ligra, 2.46X over GraphGrind, and 2.33X over GraphIt. Graptor outperforms Grazelle, which performs vectorized pull style graph processing, 4.05X.

References

[1]

V. Agarwal, F. Petrini, D. Pasetto, and D. A. Bader. 2010. Scalable Graph Exploration on Multicore Processors. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC '10). IEEE Computer Society, Washington, DC, USA, 1--11.

Digital Library

[2]

V. Balaji and B. Lucia. 2019. Combining Data Duplication and Graph Reordering to Accelerate Parallel Graph Processing. In Proceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing (HPDC '19). ACM, New York, NY, USA, 133--144.

Digital Library

[3]

S. Beamer, K. Asanović, and D. Patterson. 2012. Direction-optimizing Breadth-first Search. In Proc. of the Intl. Conference on High Performance Computing, Networking, Storage and Analysis. 12:1--12:10.

[4]

S. Beamer, K. Asanović, and D. Patterson. 2015. GRAIL: The Graph Algorithm Iron Law. In Proceedings of the 5th Workshop on Irregular Applications: Architectures and Algorithms (IA3 '15). ACM, New York, NY, USA, Article 13, 4 pages.

Digital Library

[5]

S. Beamer, K. Asanović, and D. Patterson. 2015. Locality exists in graph processing: Workload characterization on an Ivy Bridge server. In Workload Characterization (IISWC), 2015 IEEE International Symposium on. IEEE, 56--65.

[6]

M. Besta, F. Marending, E. Solomonik, and T. Hoefler. 2017. Slim-Sell: A Vectorizable Graph Representation for Breadth-First Search. In 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 32--41.

[7]

M. Besta, M. Podstawski, L. Groner, E. Solomonik, and T. Hoefler. 2017. To Push or To Pull: On Reducing Communication and Synchronization in Graph Computations. In Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing (HPDC '17). ACM, New York, NY, USA, 93--104.

Digital Library

[8]

G. E. Blelloch, J. T. Fineman, and J. Shun. 2012. Greedy Sequential Maximal Independent Set and Matching Are Parallel on Average. In Proceedings of the Twenty-fourth Annual ACM Symposium on Parallelism in Algorithms and Architectures (SPAA '12). ACM, New York, NY, USA, 308--317.

Digital Library

[9]

L. Chen, X. Huo, B. Ren, S. Jain, and G. Agrawal. 2015. Efficient and Simplified Parallel Graph Processing over CPU and MIC. In 2015 IEEE International Parallel and Distributed Processing Symposium. 819--828.

Digital Library

[10]

T. Gao, Y. Lu, B. Zhang, and G. Suo. 2014. Using the Intel Many Integrated Core to Accelerate Graph Traversal. Int. J. High Perform. Comput. Appl. 28, 3 (Aug. 2014), 255--266.

Digital Library

[11]

E. A. Golovina, A. S. Semenov, and A. S. Frolov. 2014. Performance Evaluation of Breadth-First Search on Intel Xeon Phi. Vychislitel'nye Metody i Programmirovanie 15, 1 (2014), 49--48.

[12]

R. L. Graham. 1969. Bounds on Multiprocessing Timing Anomalies. SIAM J. Appl. Math. (1969), 416--429.

[13]

O. Green, M. Dukhan, and R. Vuduc. 2015. Branch-Avoiding Graph Algorithms. In Proceedings of the 27th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA '15). ACM, New York, NY, USA, 212--223.

Digital Library

[14]

S. Grossman, H. Litz, and C. Kozyrakis. 2018. Making Pull-based Graph Processing Performant. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '18). ACM, New York, NY, USA, 246--260.

Digital Library

[15]

S. Hong, T. Oguntebi, and K. Olukotun. 2011. Efficient parallel graph exploration on multi-core CPU and GPU. In Parallel Architectures and Compilation Techniques (PACT), 2011 International Conference on. IEEE, 78--88.

[16]

Intel 2015. Intel Architecture Instruction Set Extensions Programming Reference. 319433--023.

[17]

P. Jiang, L. Chen, and G. Agrawal. 2016. Reusing Data Reorganization for Efficient SIMD Parallelization of Adaptive Irregular Applications. In Proceedings of the 2016 International Conference on Supercomputing (ICS '16). ACM, New York, NY, USA, Article 16, 10 pages.

Digital Library

[18]

U. Kang, Charalampos E. Tsourakakis, Ana Paula Appel, Christos Faloutsos, and Jure Leskovec. 2011. HADI: Mining Radii of Large Graphs. ACM Trans. Knowl. Discov. Data 5, 2, Article 8 (Feb. 2011), 24 pages.

Digital Library

[19]

M. Kreutzer, G. Hager, G. Wellein, H. Fehske, and A. Bishop. 2014. A Unified Sparse Matrix Data Format for Efficient General Sparse Matrix-Vector Multiplication on Modern Processors with Wide SIMD Units. SIAM Journal on Scientific Computing 36, 5 (2014), C401--C423. arXiv:https://doi.org/10.1137/130930352

Digital Library

[20]

J. Lin, Q. Wu, Y. Tan, J. Yu, Q. Zhang, X. Li, and L. Luo. 2017. MicRun: A framework for scale-free graph algorithms on SIMD architecture of the Xeon Phi. In 2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP). 127--136.

[21]

W. Liu and B. Vinter. 2015. CSR5: An Efficient Storage Format for Cross-Platform Sparse Matrix-Vector Multiplication. In Proceedings of the 29th ACM on International Conference on Supercomputing (ICS '15). ACM, New York, NY, USA, 339--350.

Digital Library

[22]

A. Lumsdaine, D. Gregor, B. Hendrickson, and J. Berry. 2007. Challenges in parallel graph processing. Parallel Processing Letters 17, 01 (2007), 5--20.

[23]

J. Malicevic, B. Lepers, and W. Zwaenepoel. 2017. Everything You Always Wanted to Know About Multicore Graph Processing but Were Afraid to Ask. In Proceedings of the 2017 USENIX Conference on Usenix Annual Technical Conference (USENIX ATC '17). USENIX Association, Berkeley, CA, USA, 631--643. http://dl.acm.org/citation.cfm?id=3154690.3154750

[24]

F. McSherry. 2005. A Uniform Approach to Accelerated PageRank Computation. In Proceedings of the 14th International Conference on World Wide Web (WWW '05). ACM, New York, NY, USA, 575--582.

Digital Library

[25]

K. Meng, J. Li, G. Tan, and N. Sun. 2019. A Pattern Based Algorithmic Autotuner for Graph Processing on GPUs. In Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming (PPoPP '19). ACM, New York, NY, USA, 201--213.

Digital Library

[26]

L. Page, S. Brin, R. Motwani, and T. Winograd. 1999. The PageRank Citation Ranking: Bringing Order to the Web. Technical Report. Stanford InfoLab.

[27]

M. Paredes, G. Riley, and M. Luján. 2016. Breadth First Search Vectorization on the Intel Xeon Phi. In Proceedings of the ACM International Conference on Computing Frontiers (CF '16). ACM, New York, NY, USA, 1--10.

Digital Library

[28]

J.-S. Park, M. Penner, and V. K. Prasanna. 2004. Optimizing graph algorithms for improved cache performance. IEEE Transactions on Parallel and Distributed Systems 15, 9 (Sep. 2004), 769--782.

Digital Library

[29]

A. E. Sariyüce, E. Saulé, K. Kaya, and U. V. Çatalyürek. 2015. Regularizing Graph Centrality Computations. J. Parallel Distrib. Comput. 76, C (Feb. 2015), 106--119.

Digital Library

[30]

E. Saulé and Ü. V. Çatalyürek. 2012. An Early Evaluation of the Scalability of Graph Algorithms on the Intel MIC Architecture. In 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops PhD Forum. 1629--1639.

Digital Library

[31]

J. Shun and G. E. Blelloch. 2013. Ligra: A Lightweight Graph Processing Framework for Shared Memory. In Proc of ACM Symp. on Principles and Practice of Parallel Programming. 135--146.

[32]

A. Sodani, R. Gramunt, J. Corbal, H. Kim, K. Vinod, S. Chinthamani, S. Hutsell, R. Agarwal, and Y. Liu. 2016. Knights Landing: Second-Generation Intel Xeon Phi Product. IEEE Micro 36, 2 (Mar 2016), 34--46.

Digital Library

[33]

N. Stephens, S. Biles, M. Boettcher, J. Eapen, M. Eyole, G. Gabrielli, M. Horsnell, G. Magklis, A. Martinez, N. Premillieu, A. Reid, A. Rico, and P. Walker. 2017. The ARM Scalable Vector Extension. IEEE Micro 37, 2 (Mar 2017), 26--39.

Digital Library

[34]

J. Sun, H. Vandierendonck, and D. S. Nikolopoulos. 2017. Accelerating Graph Analytics by Utilising the Memory Locality of Graph Partitioning. In 2017 46th International Conference on Parallel Processing (ICPP). 181--190.

[35]

J. Sun, H. Vandierendonck, and D. S. Nikolopoulos. 2017. GraphGrind: Addressing Load Imbalance of Graph Partitioning. In Proceedings of the International Conference on Supercomputing (ICS '17). ACM, New York, NY, USA, Article 16, 10 pages.

Digital Library

[36]

J. Sun, H. Vandierendonck, and D. S. Nikolopoulos. 2019. VEBO: A Vertex- and Edge-balanced Ordering Heuristic to Load Balance Parallel Graph Processing. In Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming (PPoPP '19). ACM, New York, NY, USA, 391--392.

Digital Library

[37]

K. Thomas. 2019. Using Cray Systems with Knights Landing Processors. https://www.nersc.gov/assets/Uploads/Using-KNL-Processors-Feb2019.pdf.

[38]

H. Wang, L. Geng, R. Lee, K. Hou, Y. Zhang, and X. Zhang. 2019. SEP-graph: Finding Shortest Execution Paths for Graph Processing Under a Hybrid Framework on GPU. In Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming (PPoPP '19). ACM, New York, NY, USA, 38--52.

Digital Library

[39]

B. Xie, J. Zhan, W. Liu, X. Gao, Z. Jia, X. He, and L. Zhang. 2018. CVR: Efficient Vectorization of SpMV on x86 Processors. In Proceedings of the 2018 International Symposium on Code Generation and Optimization (CGO 2018). ACM, New York, NY, USA, 149--162.

Digital Library

[40]

K. Zhang, R. Chen, and H. Chen. 2015. NUMA-aware graph-structured analytics. In Proc. of ACM Symp. on Principles and Practice of Parallel Programming. 183--193.

[41]

Y. Zhang, M. Yang, R. Baghadi, S. Kamil, J. Shun, and A. Amarasinghe. 2018. GraphIt - A High-Performance DSL for Graph Analytics. eprint arXiv:1805.00923 (June 2018).

Cited By

Peng ZZhang MLi KJin RRen BDehnavi MKulkarni MKrishnamoorthy S(2023)iQANProceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3572848.3577527(313-328)Online publication date: 25-Feb-2023
https://dl.acm.org/doi/10.1145/3572848.3577527
Vandierendonck HRauchwerger LCameron KNikolopoulos DPnevmatikatos D(2022)Software-defined floating-point number formats and their application to graph processingProceedings of the 36th ACM International Conference on Supercomputing10.1145/3524059.3532360(1-17)Online publication date: 28-Jun-2022
https://dl.acm.org/doi/10.1145/3524059.3532360
Koohi Esfahani MKilpatrick PVandierendonck HLee JAgrawal KSpear M(2022)LOTUSProceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3503221.3508402(219-233)Online publication date: 2-Apr-2022
https://dl.acm.org/doi/10.1145/3503221.3508402
Show More Cited By

Index Terms

Graptor: efficient pull and push style vectorized graph processing
1. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel algorithms
      1. Vector / streaming algorithms
2. Software and its engineering
  1. Software notations and tools
    1. Compilers
    2. Context specific languages
      1. Domain specific languages

Recommendations

Vectorising k-Core Decomposition for GPU Acceleration
SSDBM '20: Proceedings of the 32nd International Conference on Scientific and Statistical Database Management

k-Core decomposition is a well-studied community detection problem in graph analytics in which each k-core of vertices induces a subgraph where all vertices have degree at least k. The decomposition is expensive to compute on large graphs and efforts to ...
Outer-loop vectorization: revisited for short SIMD architectures
PACT '08: Proceedings of the 17th international conference on Parallel architectures and compilation techniques

Vectorization has been an important method of using data-level parallelism to accelerate scientific workloads on vector machines such as Cray for the past three decades. In the last decade it has also proven useful for accelerating multi-media and ...
Processing Big Data Graphs on Memory-Restricted Systems
PACT '14: Proceedings of the 23rd international conference on Parallel architectures and compilation

With the advent of big-data, processing large graphs quickly has become increasingly important. Most existing approaches either utilize in-memory processing techniques, which can only process graphs that fit completely in RAM, or disk-based techniques ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICS '20: Proceedings of the 34th ACM International Conference on Supercomputing

June 2020

499 pages

ISBN:9781450379830

DOI:10.1145/3392717

General Chairs:
Eduard Ayguadé
Universitat Politècnica de Catalunya and Barcelona Supercomputing Center
,
Wen-mei Hwu
University of Illinois at Urbana-Champaign
,
Program Chairs:
Rosa M. Badia
Barcelona Supercomputing Center and Universitat Politècnica de Catalunya
,
H. Peter Hofstee
IBM Austin

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 29 June 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Horizon 2020 Framework Programme

Conference

ICS '20

Sponsor:

SIGARCH

ICS '20: 2020 International Conference on Supercomputing

June 29 - July 2, 2020

Spain, Barcelona

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
312
Total Downloads

Downloads (Last 12 months)36
Downloads (Last 6 weeks)4

Reflects downloads up to 12 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Peng ZZhang MLi KJin RRen BDehnavi MKulkarni MKrishnamoorthy S(2023)iQANProceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3572848.3577527(313-328)Online publication date: 25-Feb-2023
https://dl.acm.org/doi/10.1145/3572848.3577527
Vandierendonck HRauchwerger LCameron KNikolopoulos DPnevmatikatos D(2022)Software-defined floating-point number formats and their application to graph processingProceedings of the 36th ACM International Conference on Supercomputing10.1145/3524059.3532360(1-17)Online publication date: 28-Jun-2022
https://dl.acm.org/doi/10.1145/3524059.3532360
Koohi Esfahani MKilpatrick PVandierendonck HLee JAgrawal KSpear M(2022)LOTUSProceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3503221.3508402(219-233)Online publication date: 2-Apr-2022
https://dl.acm.org/doi/10.1145/3503221.3508402
Koohi Esfahani MKilpatrick PVandierendonck H(2022)SAPCo Sort: Optimizing Degree-Ordering for Power-Law Graphs2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS55109.2022.00015(138-140)Online publication date: May-2022
https://doi.org/10.1109/ISPASS55109.2022.00015
Koohi Esfahani MKilpatrick PVandierendonck H(2021)Exploiting in-Hub Temporal Locality in SpMV-based Graph ProcessingProceedings of the 50th International Conference on Parallel Processing10.1145/3472456.3472462(1-10)Online publication date: 9-Aug-2021
https://dl.acm.org/doi/10.1145/3472456.3472462
Koohi Esfahani MKilpatrick PVandierendonck H(2021)Thrifty Label Propagation: Fast Connected Components for Skewed-Degree Graphs2021 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/Cluster48925.2021.00042(226-237)Online publication date: Sep-2021
https://doi.org/10.1109/Cluster48925.2021.00042

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten