skip to main content
10.1145/3307650.3322254acmconferencesArticle/Chapter ViewAbstractPublication PagesiscaConference Proceedingsconference-collections
research-article

SCU: a GPU stream compaction unit for graph processing

Authors Info & Claims
Published:22 June 2019Publication History

ABSTRACT

Graph processing algorithms are key in many emerging applications in areas such as machine learning and data analytics. Although the processing of large scale graphs exhibits a high degree of parallelism, the memory access pattern tend to be highly irregular, leading to poor GPGPU efficiency due to memory divergence. To ameliorate this issue, GPGPU applications perform a stream compaction operation each iteration of the algorithm to extract the subset of active nodes/edges, so subsequent steps work on compacted dataset.

We show that GPGPU architectures are inefficient for stream compaction, and propose to offload this task to a programmable Stream Compaction Unit (SCU) tailored to the requirements of this kernel. The SCU is a small unit tightly integrated in the GPU that efficiently gathers the active nodes/edges into a compacted array in memory. Applications can make use of it through a simple API. The remaining steps of the graph-based algorithm are executed on the GPU cores taking benefit of the large amount of parallelism in the GPU, but they operate on the SCU-prepared data and achieve larger memory coalescing and, hence, much higher efficiency. Besides, the SCU performs filtering of repeated and already visited nodes during the compaction process, significantly reducing GPGPU workload, and writes the compacted nodes/edges in an order that improves memory coalescing by reducing memory divergence.

We evaluate the performance of a state-of-the-art GPGPU architecture extended with our SCU for a wide variety of applications. Results show that for high-performance and for low-power GPU systems the SCU achieves speedups of 1.37x and 2.32x, 84.7% and 69% energy savings, and an area increase of 3.3% and 4.1% respectively.

References

  1. M. Mohri, F. Pereira, and M. Riley, "Weighted finite-state transducers in speech recognition," Computer Speech & Language, vol. 16, no. 1, pp. 69--88, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. J. E. Gonzalez, R. S. Xin, A. Dave, D. Crankshaw, M. J. Franklin, and I. Stoica, "Graphx: Graph processing in a distributed dataflow framework.," in OSDI, vol. 14, pp. 599--613, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Y. Low, J. E. Gonzalez, A. Kyrola, D. Bickson, C. E. Guestrin, and J. Hellerstein, "Graphlab: A new framework for parallel machine learning," arXiv preprint arXiv:1408.2041, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. M. Zaharia, R. S. Xin, P. Wendell, T. Das, M. Armbrust, A. Dave, X. Meng, J. Rosen, S. Venkataraman, and M. J. Franklin, "Apache spark: a unified engine for big data processing," Communications of the ACM, vol. 59, no. 11, pp. 56--65, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. M. Capotă, T. Hegeman, A. Iosup, A. Prat-Pérez, O. Erling, and P. Boncz, "Graphalytics: A big data benchmark for graph-processing platforms," in Proceedings of the GRADES'15, p. 7, ACM, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. M. Armbrust, R. S. Xin, C. Lian, Y. Huai, D. Liu, J. K. Bradley, X. Meng, T. Kaftan, M. J. Franklin, and A. Ghodsi, "Spark sql: Relational data processing in spark," in Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 1383--1394, ACM, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. A. Lumsdaine, D. Gregor, B. Hendrickson, and J. Berry, "Challenges in parallel graph processing," Parallel Processing Letters, vol. 17, no. 01, pp. 5--20, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  8. A. H. Nodehi Sabet, J. Qiu, and Z. Zhao, "Tigr: Transforming irregular graphs for gpu-friendly graph processing," in Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 622--636, ACM, 2018.Google ScholarGoogle Scholar
  9. S. Beamer III, Understanding and improving graph algorithm performance. University of California, Berkeley, 2016.Google ScholarGoogle Scholar
  10. M. Billeter, O. Olsson, and U. Assarsson, "Efficient stream compaction on wide simd many-core architectures," in Proceedings of the conference on high performance graphics 2009, pp. 159--166, ACM, 2009.Google ScholarGoogle Scholar
  11. D. Merrill, M. Garland, and A. Grimshaw, "High-performance and scalable gpu graph traversal," ACM Transactions on Parallel Computing, vol. 1, no. 2, p. 14, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. A. Davidson, S. Baxter, M. Garland, and J. D. Owens, "Work-efficient parallel gpu methods for single-source shortest paths," in Parallel and Distributed Processing Symposium, 2014 IEEE 28th International, pp. 349--359, IEEE, 2014.Google ScholarGoogle Scholar
  13. N. Bell and M. Garland, "Implementing sparse matrix-vector multiplication on throughput-oriented processors," in Proceedings of the conference on high performance computing networking, storage and analysis, p. 18, ACM, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. N. Bell and J. Hoberock, "Thrust: A productivity-oriented library for cuda," in GPU computing gems Jade edition, pp. 359--371, Elsevier, 2011.Google ScholarGoogle Scholar
  15. A. Geil, Y. Wang, and J. D. Owens, "Wtf, gpu! computing twitter's who-to-follow on the gpu," in Proceedings of the second ACM conference on Online social networks, pp. 63--68, ACM, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. L. Page, S. Brin, R. Motwani, and T. Winograd, "The pagerank citation ranking: Bringing order to the web.," tech. rep., Stanford InfoLab, 1999.Google ScholarGoogle Scholar
  17. D. Compiler, "Synopsys inc," 2000.Google ScholarGoogle Scholar
  18. S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and N. P. Jouppi, "Mcpat: an integrated power, area, and timing modeling framework for multicore and manycore architectures," in Microarchitecture, 2009. MICRO-42. 42nd Annual IEEE/ACM International Symposium on, pp. 469--480, IEEE, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. P. Rosenfeld, E. Cooper-Balis, and B. Jacob, "Dramsim2: A cycle accurate memory system simulator," IEEE Computer Architecture Letters, vol. 10, no. 1, pp. 16--19, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. J. Leng, T. Hetherington, A. ElTantawy, S. Gilani, N. S. Kim, T. M. Aamodt, and V. J. Reddi, "Gpuwattch: enabling energy optimizations in gpgpus," in ACM SIGARCH Computer Architecture News, vol. 41, pp. 487--498, ACM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. M. Technology, TN-53-01. LPDDR4 Power Calculator. Technical Report, 2016.Google ScholarGoogle Scholar
  22. A. Bakhoda, G. L. Yuan, W. W. Fung, H. Wong, and T. M. Aamodt, "Analyzing cuda workloads using a detailed gpu simulator," in Performance Analysis of Systems and Software, 2009. ISPASS 2009. IEEE International Symposium on, pp. 163--174, IEEE, 2009.Google ScholarGoogle Scholar
  23. T. A. Davis and Y. Hu, "The university of florida sparse matrix collection," ACM Transactions on Mathematical Software (TOMS), vol. 38, no. 1, p. 1, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. DIMACS, "10th dimacs implementation challenge - graph partitioning and graph clustering," 2010.Google ScholarGoogle Scholar
  25. T. D. Han and T. S. Abdelrahman, "Reducing branch divergence in gpu programs," in Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units, p. 3, ACM, 2011.Google ScholarGoogle Scholar
  26. S. Hong, S. K. Kim, T. Oguntebi, and K. Olukotun, "Accelerating cuda graph algorithms at maximum warp," in ACM SIGPLAN Notices, vol. 46, pp. 267--276, ACM, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. F. Khorasani, R. Gupta, and L. N. Bhuyan, "Scalable simd-efficient graph processing on gpus," in Parallel Architecture and Compilation (PACT), 2015 International Conference on, pp. 39--50, IEEE, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. A. Gharaibeh, T. Reza, E. Santos-Neto, L. B. Costa, S. Sallinen, and M. Ripeanu, "Efficient large-scale graph processing on hybrid cpu and gpu systems," arXiv preprint arXiv:1312.3018, 2013.Google ScholarGoogle Scholar
  29. F. Khorasani, K. Vora, R. Gupta, and L. N. Bhuyan, "Cusha: vertex-centric graph processing on gpus," in Proceedings of the 23rd international symposium on High-performance parallel and distributed computing, pp. 239--252, ACM, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Y. Wang, A. Davidson, Y. Pan, Y. Wu, A. Riffel, and J. D. Owens, "Gunrock: A high-performance graph processing library on the gpu," in Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '16, (New York, NY, USA), pp. 11:1--11:12, ACM, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. T. J. Ham, L. Wu, N. Sundaram, N. Satish, and M. Martonosi, "Graphicionado: A high-performance and energy-efficient accelerator for graph analytics," in Microarchitecture (MICRO), 2016 49th Annual IEEE/ACM International Symposium on, pp. 1--13, IEEE, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. J. Zhou, S. Liu, Q. Guo, X. Zhou, T. Zhi, D. Liu, C. Wang, X. Zhou, Y. Chen, and T. Chen, "Tunao: A high-performance and energy-efficient reconfigurable accelerator for graph processing," in Cluster, Cloud and Grid Computing (CCGRID), 2017 17th IEEE/ACM International Symposium on, pp. 731--734, IEEE, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. G. Dai, T. Huang, Y. Chi, J. Zhao, G. Sun, Y. Liu, Y. Wang, Y. Xie, and H. Yang, "Graphh: A processing-in-memory architecture for large-scale graph processing," IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2018.Google ScholarGoogle Scholar
  34. L. Song, Y. Zhuo, X. Qian, H. Li, and Y. Chen, "Graphr: Accelerating graph processing using reram," in 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 531--543, Feb 2018.Google ScholarGoogle Scholar

Index Terms

  1. SCU: a GPU stream compaction unit for graph processing

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        ISCA '19: Proceedings of the 46th International Symposium on Computer Architecture
        June 2019
        849 pages
        ISBN:9781450366694
        DOI:10.1145/3307650

        Copyright © 2019 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 22 June 2019

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        ISCA '19 Paper Acceptance Rate62of365submissions,17%Overall Acceptance Rate543of3,203submissions,17%

        Upcoming Conference

        ISCA '24

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader