ABSTRACT
Vectorization seeks to accelerate computation through data-level parallelism. Vectorization has been applied to graph processing, where the graph is traversed either in a push style or a pull style. As it is not well understood which style will perform better, there is a need for both vectorized push and pull style traversals. This paper is the first to present a general solution to vectorizing push style traversal. It more-over presents an enhanced vectorized pull style traversal.
Our solution consists of three components: CleanCut, a graph partitioning approach that rules out inter-thread race conditions; VectorFast, a compact graph representation that supports fast-forwarding through the edge stream; and Graptor, a domain-specific language and compiler for auto-vectorizing and optimizing graph processing codes.
Experimental evaluation demonstrates average speedups of 2.72X over Ligra, 2.46X over GraphGrind, and 2.33X over GraphIt. Graptor outperforms Grazelle, which performs vectorized pull style graph processing, 4.05X.
- V. Agarwal, F. Petrini, D. Pasetto, and D. A. Bader. 2010. Scalable Graph Exploration on Multicore Processors. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC '10). IEEE Computer Society, Washington, DC, USA, 1--11. Google ScholarDigital Library
- V. Balaji and B. Lucia. 2019. Combining Data Duplication and Graph Reordering to Accelerate Parallel Graph Processing. In Proceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing (HPDC '19). ACM, New York, NY, USA, 133--144. Google ScholarDigital Library
- S. Beamer, K. Asanović, and D. Patterson. 2012. Direction-optimizing Breadth-first Search. In Proc. of the Intl. Conference on High Performance Computing, Networking, Storage and Analysis. 12:1--12:10.Google Scholar
- S. Beamer, K. Asanović, and D. Patterson. 2015. GRAIL: The Graph Algorithm Iron Law. In Proceedings of the 5th Workshop on Irregular Applications: Architectures and Algorithms (IA3 '15). ACM, New York, NY, USA, Article 13, 4 pages. Google ScholarDigital Library
- S. Beamer, K. Asanović, and D. Patterson. 2015. Locality exists in graph processing: Workload characterization on an Ivy Bridge server. In Workload Characterization (IISWC), 2015 IEEE International Symposium on. IEEE, 56--65.Google Scholar
- M. Besta, F. Marending, E. Solomonik, and T. Hoefler. 2017. Slim-Sell: A Vectorizable Graph Representation for Breadth-First Search. In 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 32--41. Google ScholarCross Ref
- M. Besta, M. Podstawski, L. Groner, E. Solomonik, and T. Hoefler. 2017. To Push or To Pull: On Reducing Communication and Synchronization in Graph Computations. In Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing (HPDC '17). ACM, New York, NY, USA, 93--104. Google ScholarDigital Library
- G. E. Blelloch, J. T. Fineman, and J. Shun. 2012. Greedy Sequential Maximal Independent Set and Matching Are Parallel on Average. In Proceedings of the Twenty-fourth Annual ACM Symposium on Parallelism in Algorithms and Architectures (SPAA '12). ACM, New York, NY, USA, 308--317. Google ScholarDigital Library
- L. Chen, X. Huo, B. Ren, S. Jain, and G. Agrawal. 2015. Efficient and Simplified Parallel Graph Processing over CPU and MIC. In 2015 IEEE International Parallel and Distributed Processing Symposium. 819--828. Google ScholarDigital Library
- T. Gao, Y. Lu, B. Zhang, and G. Suo. 2014. Using the Intel Many Integrated Core to Accelerate Graph Traversal. Int. J. High Perform. Comput. Appl. 28, 3 (Aug. 2014), 255--266. Google ScholarDigital Library
- E. A. Golovina, A. S. Semenov, and A. S. Frolov. 2014. Performance Evaluation of Breadth-First Search on Intel Xeon Phi. Vychislitel'nye Metody i Programmirovanie 15, 1 (2014), 49--48.Google Scholar
- R. L. Graham. 1969. Bounds on Multiprocessing Timing Anomalies. SIAM J. Appl. Math. (1969), 416--429.Google Scholar
- O. Green, M. Dukhan, and R. Vuduc. 2015. Branch-Avoiding Graph Algorithms. In Proceedings of the 27th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA '15). ACM, New York, NY, USA, 212--223. Google ScholarDigital Library
- S. Grossman, H. Litz, and C. Kozyrakis. 2018. Making Pull-based Graph Processing Performant. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '18). ACM, New York, NY, USA, 246--260. Google ScholarDigital Library
- S. Hong, T. Oguntebi, and K. Olukotun. 2011. Efficient parallel graph exploration on multi-core CPU and GPU. In Parallel Architectures and Compilation Techniques (PACT), 2011 International Conference on. IEEE, 78--88.Google Scholar
- Intel 2015. Intel Architecture Instruction Set Extensions Programming Reference. 319433--023.Google Scholar
- P. Jiang, L. Chen, and G. Agrawal. 2016. Reusing Data Reorganization for Efficient SIMD Parallelization of Adaptive Irregular Applications. In Proceedings of the 2016 International Conference on Supercomputing (ICS '16). ACM, New York, NY, USA, Article 16, 10 pages. Google ScholarDigital Library
- U. Kang, Charalampos E. Tsourakakis, Ana Paula Appel, Christos Faloutsos, and Jure Leskovec. 2011. HADI: Mining Radii of Large Graphs. ACM Trans. Knowl. Discov. Data 5, 2, Article 8 (Feb. 2011), 24 pages. Google ScholarDigital Library
- M. Kreutzer, G. Hager, G. Wellein, H. Fehske, and A. Bishop. 2014. A Unified Sparse Matrix Data Format for Efficient General Sparse Matrix-Vector Multiplication on Modern Processors with Wide SIMD Units. SIAM Journal on Scientific Computing 36, 5 (2014), C401--C423. arXiv:https://doi.org/10.1137/130930352 Google ScholarDigital Library
- J. Lin, Q. Wu, Y. Tan, J. Yu, Q. Zhang, X. Li, and L. Luo. 2017. MicRun: A framework for scale-free graph algorithms on SIMD architecture of the Xeon Phi. In 2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP). 127--136. Google ScholarCross Ref
- W. Liu and B. Vinter. 2015. CSR5: An Efficient Storage Format for Cross-Platform Sparse Matrix-Vector Multiplication. In Proceedings of the 29th ACM on International Conference on Supercomputing (ICS '15). ACM, New York, NY, USA, 339--350. Google ScholarDigital Library
- A. Lumsdaine, D. Gregor, B. Hendrickson, and J. Berry. 2007. Challenges in parallel graph processing. Parallel Processing Letters 17, 01 (2007), 5--20.Google ScholarCross Ref
- J. Malicevic, B. Lepers, and W. Zwaenepoel. 2017. Everything You Always Wanted to Know About Multicore Graph Processing but Were Afraid to Ask. In Proceedings of the 2017 USENIX Conference on Usenix Annual Technical Conference (USENIX ATC '17). USENIX Association, Berkeley, CA, USA, 631--643. http://dl.acm.org/citation.cfm?id=3154690.3154750Google Scholar
- F. McSherry. 2005. A Uniform Approach to Accelerated PageRank Computation. In Proceedings of the 14th International Conference on World Wide Web (WWW '05). ACM, New York, NY, USA, 575--582. Google ScholarDigital Library
- K. Meng, J. Li, G. Tan, and N. Sun. 2019. A Pattern Based Algorithmic Autotuner for Graph Processing on GPUs. In Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming (PPoPP '19). ACM, New York, NY, USA, 201--213. Google ScholarDigital Library
- L. Page, S. Brin, R. Motwani, and T. Winograd. 1999. The PageRank Citation Ranking: Bringing Order to the Web. Technical Report. Stanford InfoLab.Google Scholar
- M. Paredes, G. Riley, and M. Luján. 2016. Breadth First Search Vectorization on the Intel Xeon Phi. In Proceedings of the ACM International Conference on Computing Frontiers (CF '16). ACM, New York, NY, USA, 1--10. Google ScholarDigital Library
- J.-S. Park, M. Penner, and V. K. Prasanna. 2004. Optimizing graph algorithms for improved cache performance. IEEE Transactions on Parallel and Distributed Systems 15, 9 (Sep. 2004), 769--782. Google ScholarDigital Library
- A. E. Sariyüce, E. Saulé, K. Kaya, and U. V. Çatalyürek. 2015. Regularizing Graph Centrality Computations. J. Parallel Distrib. Comput. 76, C (Feb. 2015), 106--119. Google ScholarDigital Library
- E. Saulé and Ü. V. Çatalyürek. 2012. An Early Evaluation of the Scalability of Graph Algorithms on the Intel MIC Architecture. In 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops PhD Forum. 1629--1639. Google ScholarDigital Library
- J. Shun and G. E. Blelloch. 2013. Ligra: A Lightweight Graph Processing Framework for Shared Memory. In Proc of ACM Symp. on Principles and Practice of Parallel Programming. 135--146.Google Scholar
- A. Sodani, R. Gramunt, J. Corbal, H. Kim, K. Vinod, S. Chinthamani, S. Hutsell, R. Agarwal, and Y. Liu. 2016. Knights Landing: Second-Generation Intel Xeon Phi Product. IEEE Micro 36, 2 (Mar 2016), 34--46. Google ScholarDigital Library
- N. Stephens, S. Biles, M. Boettcher, J. Eapen, M. Eyole, G. Gabrielli, M. Horsnell, G. Magklis, A. Martinez, N. Premillieu, A. Reid, A. Rico, and P. Walker. 2017. The ARM Scalable Vector Extension. IEEE Micro 37, 2 (Mar 2017), 26--39. Google ScholarDigital Library
- J. Sun, H. Vandierendonck, and D. S. Nikolopoulos. 2017. Accelerating Graph Analytics by Utilising the Memory Locality of Graph Partitioning. In 2017 46th International Conference on Parallel Processing (ICPP). 181--190. Google ScholarCross Ref
- J. Sun, H. Vandierendonck, and D. S. Nikolopoulos. 2017. GraphGrind: Addressing Load Imbalance of Graph Partitioning. In Proceedings of the International Conference on Supercomputing (ICS '17). ACM, New York, NY, USA, Article 16, 10 pages. Google ScholarDigital Library
- J. Sun, H. Vandierendonck, and D. S. Nikolopoulos. 2019. VEBO: A Vertex- and Edge-balanced Ordering Heuristic to Load Balance Parallel Graph Processing. In Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming (PPoPP '19). ACM, New York, NY, USA, 391--392. Google ScholarDigital Library
- K. Thomas. 2019. Using Cray Systems with Knights Landing Processors. https://www.nersc.gov/assets/Uploads/Using-KNL-Processors-Feb2019.pdf.Google Scholar
- H. Wang, L. Geng, R. Lee, K. Hou, Y. Zhang, and X. Zhang. 2019. SEP-graph: Finding Shortest Execution Paths for Graph Processing Under a Hybrid Framework on GPU. In Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming (PPoPP '19). ACM, New York, NY, USA, 38--52. Google ScholarDigital Library
- B. Xie, J. Zhan, W. Liu, X. Gao, Z. Jia, X. He, and L. Zhang. 2018. CVR: Efficient Vectorization of SpMV on x86 Processors. In Proceedings of the 2018 International Symposium on Code Generation and Optimization (CGO 2018). ACM, New York, NY, USA, 149--162. Google ScholarDigital Library
- K. Zhang, R. Chen, and H. Chen. 2015. NUMA-aware graph-structured analytics. In Proc. of ACM Symp. on Principles and Practice of Parallel Programming. 183--193.Google Scholar
- Y. Zhang, M. Yang, R. Baghadi, S. Kamil, J. Shun, and A. Amarasinghe. 2018. GraphIt - A High-Performance DSL for Graph Analytics. eprint arXiv:1805.00923 (June 2018).Google Scholar
Index Terms
- Graptor: efficient pull and push style vectorized graph processing
Recommendations
Vectorising k-Core Decomposition for GPU Acceleration
SSDBM '20: Proceedings of the 32nd International Conference on Scientific and Statistical Database Managementk-Core decomposition is a well-studied community detection problem in graph analytics in which each k-core of vertices induces a subgraph where all vertices have degree at least k. The decomposition is expensive to compute on large graphs and efforts to ...
Outer-loop vectorization: revisited for short SIMD architectures
PACT '08: Proceedings of the 17th international conference on Parallel architectures and compilation techniquesVectorization has been an important method of using data-level parallelism to accelerate scientific workloads on vector machines such as Cray for the past three decades. In the last decade it has also proven useful for accelerating multi-media and ...
Processing Big Data Graphs on Memory-Restricted Systems
PACT '14: Proceedings of the 23rd international conference on Parallel architectures and compilationWith the advent of big-data, processing large graphs quickly has become increasingly important. Most existing approaches either utilize in-memory processing techniques, which can only process graphs that fit completely in RAM, or disk-based techniques ...
Comments