skip to main content
10.1145/3392717.3392753acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article

Graptor: efficient pull and push style vectorized graph processing

Published:29 June 2020Publication History

ABSTRACT

Vectorization seeks to accelerate computation through data-level parallelism. Vectorization has been applied to graph processing, where the graph is traversed either in a push style or a pull style. As it is not well understood which style will perform better, there is a need for both vectorized push and pull style traversals. This paper is the first to present a general solution to vectorizing push style traversal. It more-over presents an enhanced vectorized pull style traversal.

Our solution consists of three components: CleanCut, a graph partitioning approach that rules out inter-thread race conditions; VectorFast, a compact graph representation that supports fast-forwarding through the edge stream; and Graptor, a domain-specific language and compiler for auto-vectorizing and optimizing graph processing codes.

Experimental evaluation demonstrates average speedups of 2.72X over Ligra, 2.46X over GraphGrind, and 2.33X over GraphIt. Graptor outperforms Grazelle, which performs vectorized pull style graph processing, 4.05X.

References

  1. V. Agarwal, F. Petrini, D. Pasetto, and D. A. Bader. 2010. Scalable Graph Exploration on Multicore Processors. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC '10). IEEE Computer Society, Washington, DC, USA, 1--11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. V. Balaji and B. Lucia. 2019. Combining Data Duplication and Graph Reordering to Accelerate Parallel Graph Processing. In Proceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing (HPDC '19). ACM, New York, NY, USA, 133--144. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. S. Beamer, K. Asanović, and D. Patterson. 2012. Direction-optimizing Breadth-first Search. In Proc. of the Intl. Conference on High Performance Computing, Networking, Storage and Analysis. 12:1--12:10.Google ScholarGoogle Scholar
  4. S. Beamer, K. Asanović, and D. Patterson. 2015. GRAIL: The Graph Algorithm Iron Law. In Proceedings of the 5th Workshop on Irregular Applications: Architectures and Algorithms (IA3 '15). ACM, New York, NY, USA, Article 13, 4 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. S. Beamer, K. Asanović, and D. Patterson. 2015. Locality exists in graph processing: Workload characterization on an Ivy Bridge server. In Workload Characterization (IISWC), 2015 IEEE International Symposium on. IEEE, 56--65.Google ScholarGoogle Scholar
  6. M. Besta, F. Marending, E. Solomonik, and T. Hoefler. 2017. Slim-Sell: A Vectorizable Graph Representation for Breadth-First Search. In 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 32--41. Google ScholarGoogle ScholarCross RefCross Ref
  7. M. Besta, M. Podstawski, L. Groner, E. Solomonik, and T. Hoefler. 2017. To Push or To Pull: On Reducing Communication and Synchronization in Graph Computations. In Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing (HPDC '17). ACM, New York, NY, USA, 93--104. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. G. E. Blelloch, J. T. Fineman, and J. Shun. 2012. Greedy Sequential Maximal Independent Set and Matching Are Parallel on Average. In Proceedings of the Twenty-fourth Annual ACM Symposium on Parallelism in Algorithms and Architectures (SPAA '12). ACM, New York, NY, USA, 308--317. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. L. Chen, X. Huo, B. Ren, S. Jain, and G. Agrawal. 2015. Efficient and Simplified Parallel Graph Processing over CPU and MIC. In 2015 IEEE International Parallel and Distributed Processing Symposium. 819--828. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. T. Gao, Y. Lu, B. Zhang, and G. Suo. 2014. Using the Intel Many Integrated Core to Accelerate Graph Traversal. Int. J. High Perform. Comput. Appl. 28, 3 (Aug. 2014), 255--266. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. E. A. Golovina, A. S. Semenov, and A. S. Frolov. 2014. Performance Evaluation of Breadth-First Search on Intel Xeon Phi. Vychislitel'nye Metody i Programmirovanie 15, 1 (2014), 49--48.Google ScholarGoogle Scholar
  12. R. L. Graham. 1969. Bounds on Multiprocessing Timing Anomalies. SIAM J. Appl. Math. (1969), 416--429.Google ScholarGoogle Scholar
  13. O. Green, M. Dukhan, and R. Vuduc. 2015. Branch-Avoiding Graph Algorithms. In Proceedings of the 27th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA '15). ACM, New York, NY, USA, 212--223. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. S. Grossman, H. Litz, and C. Kozyrakis. 2018. Making Pull-based Graph Processing Performant. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '18). ACM, New York, NY, USA, 246--260. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. S. Hong, T. Oguntebi, and K. Olukotun. 2011. Efficient parallel graph exploration on multi-core CPU and GPU. In Parallel Architectures and Compilation Techniques (PACT), 2011 International Conference on. IEEE, 78--88.Google ScholarGoogle Scholar
  16. Intel 2015. Intel Architecture Instruction Set Extensions Programming Reference. 319433--023.Google ScholarGoogle Scholar
  17. P. Jiang, L. Chen, and G. Agrawal. 2016. Reusing Data Reorganization for Efficient SIMD Parallelization of Adaptive Irregular Applications. In Proceedings of the 2016 International Conference on Supercomputing (ICS '16). ACM, New York, NY, USA, Article 16, 10 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. U. Kang, Charalampos E. Tsourakakis, Ana Paula Appel, Christos Faloutsos, and Jure Leskovec. 2011. HADI: Mining Radii of Large Graphs. ACM Trans. Knowl. Discov. Data 5, 2, Article 8 (Feb. 2011), 24 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. M. Kreutzer, G. Hager, G. Wellein, H. Fehske, and A. Bishop. 2014. A Unified Sparse Matrix Data Format for Efficient General Sparse Matrix-Vector Multiplication on Modern Processors with Wide SIMD Units. SIAM Journal on Scientific Computing 36, 5 (2014), C401--C423. arXiv:https://doi.org/10.1137/130930352 Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. J. Lin, Q. Wu, Y. Tan, J. Yu, Q. Zhang, X. Li, and L. Luo. 2017. MicRun: A framework for scale-free graph algorithms on SIMD architecture of the Xeon Phi. In 2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP). 127--136. Google ScholarGoogle ScholarCross RefCross Ref
  21. W. Liu and B. Vinter. 2015. CSR5: An Efficient Storage Format for Cross-Platform Sparse Matrix-Vector Multiplication. In Proceedings of the 29th ACM on International Conference on Supercomputing (ICS '15). ACM, New York, NY, USA, 339--350. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. A. Lumsdaine, D. Gregor, B. Hendrickson, and J. Berry. 2007. Challenges in parallel graph processing. Parallel Processing Letters 17, 01 (2007), 5--20.Google ScholarGoogle ScholarCross RefCross Ref
  23. J. Malicevic, B. Lepers, and W. Zwaenepoel. 2017. Everything You Always Wanted to Know About Multicore Graph Processing but Were Afraid to Ask. In Proceedings of the 2017 USENIX Conference on Usenix Annual Technical Conference (USENIX ATC '17). USENIX Association, Berkeley, CA, USA, 631--643. http://dl.acm.org/citation.cfm?id=3154690.3154750Google ScholarGoogle Scholar
  24. F. McSherry. 2005. A Uniform Approach to Accelerated PageRank Computation. In Proceedings of the 14th International Conference on World Wide Web (WWW '05). ACM, New York, NY, USA, 575--582. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. K. Meng, J. Li, G. Tan, and N. Sun. 2019. A Pattern Based Algorithmic Autotuner for Graph Processing on GPUs. In Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming (PPoPP '19). ACM, New York, NY, USA, 201--213. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. L. Page, S. Brin, R. Motwani, and T. Winograd. 1999. The PageRank Citation Ranking: Bringing Order to the Web. Technical Report. Stanford InfoLab.Google ScholarGoogle Scholar
  27. M. Paredes, G. Riley, and M. Luján. 2016. Breadth First Search Vectorization on the Intel Xeon Phi. In Proceedings of the ACM International Conference on Computing Frontiers (CF '16). ACM, New York, NY, USA, 1--10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. J.-S. Park, M. Penner, and V. K. Prasanna. 2004. Optimizing graph algorithms for improved cache performance. IEEE Transactions on Parallel and Distributed Systems 15, 9 (Sep. 2004), 769--782. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. A. E. Sariyüce, E. Saulé, K. Kaya, and U. V. Çatalyürek. 2015. Regularizing Graph Centrality Computations. J. Parallel Distrib. Comput. 76, C (Feb. 2015), 106--119. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. E. Saulé and Ü. V. Çatalyürek. 2012. An Early Evaluation of the Scalability of Graph Algorithms on the Intel MIC Architecture. In 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops PhD Forum. 1629--1639. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. J. Shun and G. E. Blelloch. 2013. Ligra: A Lightweight Graph Processing Framework for Shared Memory. In Proc of ACM Symp. on Principles and Practice of Parallel Programming. 135--146.Google ScholarGoogle Scholar
  32. A. Sodani, R. Gramunt, J. Corbal, H. Kim, K. Vinod, S. Chinthamani, S. Hutsell, R. Agarwal, and Y. Liu. 2016. Knights Landing: Second-Generation Intel Xeon Phi Product. IEEE Micro 36, 2 (Mar 2016), 34--46. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. N. Stephens, S. Biles, M. Boettcher, J. Eapen, M. Eyole, G. Gabrielli, M. Horsnell, G. Magklis, A. Martinez, N. Premillieu, A. Reid, A. Rico, and P. Walker. 2017. The ARM Scalable Vector Extension. IEEE Micro 37, 2 (Mar 2017), 26--39. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. J. Sun, H. Vandierendonck, and D. S. Nikolopoulos. 2017. Accelerating Graph Analytics by Utilising the Memory Locality of Graph Partitioning. In 2017 46th International Conference on Parallel Processing (ICPP). 181--190. Google ScholarGoogle ScholarCross RefCross Ref
  35. J. Sun, H. Vandierendonck, and D. S. Nikolopoulos. 2017. GraphGrind: Addressing Load Imbalance of Graph Partitioning. In Proceedings of the International Conference on Supercomputing (ICS '17). ACM, New York, NY, USA, Article 16, 10 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. J. Sun, H. Vandierendonck, and D. S. Nikolopoulos. 2019. VEBO: A Vertex- and Edge-balanced Ordering Heuristic to Load Balance Parallel Graph Processing. In Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming (PPoPP '19). ACM, New York, NY, USA, 391--392. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. K. Thomas. 2019. Using Cray Systems with Knights Landing Processors. https://www.nersc.gov/assets/Uploads/Using-KNL-Processors-Feb2019.pdf.Google ScholarGoogle Scholar
  38. H. Wang, L. Geng, R. Lee, K. Hou, Y. Zhang, and X. Zhang. 2019. SEP-graph: Finding Shortest Execution Paths for Graph Processing Under a Hybrid Framework on GPU. In Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming (PPoPP '19). ACM, New York, NY, USA, 38--52. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. B. Xie, J. Zhan, W. Liu, X. Gao, Z. Jia, X. He, and L. Zhang. 2018. CVR: Efficient Vectorization of SpMV on x86 Processors. In Proceedings of the 2018 International Symposium on Code Generation and Optimization (CGO 2018). ACM, New York, NY, USA, 149--162. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. K. Zhang, R. Chen, and H. Chen. 2015. NUMA-aware graph-structured analytics. In Proc. of ACM Symp. on Principles and Practice of Parallel Programming. 183--193.Google ScholarGoogle Scholar
  41. Y. Zhang, M. Yang, R. Baghadi, S. Kamil, J. Shun, and A. Amarasinghe. 2018. GraphIt - A High-Performance DSL for Graph Analytics. eprint arXiv:1805.00923 (June 2018).Google ScholarGoogle Scholar

Index Terms

  1. Graptor: efficient pull and push style vectorized graph processing

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          ICS '20: Proceedings of the 34th ACM International Conference on Supercomputing
          June 2020
          499 pages
          ISBN:9781450379830
          DOI:10.1145/3392717
          • General Chairs:
          • Eduard Ayguadé,
          • Wen-mei Hwu,
          • Program Chairs:
          • Rosa M. Badia,
          • H. Peter Hofstee

          Copyright © 2020 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 29 June 2020

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          Overall Acceptance Rate584of2,055submissions,28%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader