research-article

SCU: a GPU stream compaction unit for graph processing

Authors:
Albert Segura

Universitat Politècnica de Catalunya - BarcelonaTech (UPC)

Universitat Politècnica de Catalunya - BarcelonaTech (UPC)
View Profile

,
Jose-Maria Arnau

Universitat Politècnica de Catalunya - BarcelonaTech (UPC)

Universitat Politècnica de Catalunya - BarcelonaTech (UPC)
View Profile

,
Antonio González

Universitat Politècnica de Catalunya - BarcelonaTech (UPC)

Universitat Politècnica de Catalunya - BarcelonaTech (UPC)
View Profile

ISCA '19: Proceedings of the 46th International Symposium on Computer ArchitectureJune 2019Pages 424–435https://doi.org/10.1145/3307650.3322254

Published:22 June 2019Publication History

ISCA '19: Proceedings of the 46th International Symposium on Computer Architecture

Pages 424–435

ABSTRACT

Graph processing algorithms are key in many emerging applications in areas such as machine learning and data analytics. Although the processing of large scale graphs exhibits a high degree of parallelism, the memory access pattern tend to be highly irregular, leading to poor GPGPU efficiency due to memory divergence. To ameliorate this issue, GPGPU applications perform a stream compaction operation each iteration of the algorithm to extract the subset of active nodes/edges, so subsequent steps work on compacted dataset.

We show that GPGPU architectures are inefficient for stream compaction, and propose to offload this task to a programmable Stream Compaction Unit (SCU) tailored to the requirements of this kernel. The SCU is a small unit tightly integrated in the GPU that efficiently gathers the active nodes/edges into a compacted array in memory. Applications can make use of it through a simple API. The remaining steps of the graph-based algorithm are executed on the GPU cores taking benefit of the large amount of parallelism in the GPU, but they operate on the SCU-prepared data and achieve larger memory coalescing and, hence, much higher efficiency. Besides, the SCU performs filtering of repeated and already visited nodes during the compaction process, significantly reducing GPGPU workload, and writes the compacted nodes/edges in an order that improves memory coalescing by reducing memory divergence.

We evaluate the performance of a state-of-the-art GPGPU architecture extended with our SCU for a wide variety of applications. Results show that for high-performance and for low-power GPU systems the SCU achieves speedups of 1.37x and 2.32x, 84.7% and 69% energy savings, and an area increase of 3.3% and 4.1% respectively.

References

M. Mohri, F. Pereira, and M. Riley, "Weighted finite-state transducers in speech recognition," Computer Speech & Language, vol. 16, no. 1, pp. 69--88, 2002. Google ScholarDigital Library
J. E. Gonzalez, R. S. Xin, A. Dave, D. Crankshaw, M. J. Franklin, and I. Stoica, "Graphx: Graph processing in a distributed dataflow framework.," in OSDI, vol. 14, pp. 599--613, 2014. Google ScholarDigital Library
Y. Low, J. E. Gonzalez, A. Kyrola, D. Bickson, C. E. Guestrin, and J. Hellerstein, "Graphlab: A new framework for parallel machine learning," arXiv preprint arXiv:1408.2041, 2014. Google ScholarDigital Library
M. Zaharia, R. S. Xin, P. Wendell, T. Das, M. Armbrust, A. Dave, X. Meng, J. Rosen, S. Venkataraman, and M. J. Franklin, "Apache spark: a unified engine for big data processing," Communications of the ACM, vol. 59, no. 11, pp. 56--65, 2016. Google ScholarDigital Library
M. Capotă, T. Hegeman, A. Iosup, A. Prat-Pérez, O. Erling, and P. Boncz, "Graphalytics: A big data benchmark for graph-processing platforms," in Proceedings of the GRADES'15, p. 7, ACM, 2015. Google ScholarDigital Library
M. Armbrust, R. S. Xin, C. Lian, Y. Huai, D. Liu, J. K. Bradley, X. Meng, T. Kaftan, M. J. Franklin, and A. Ghodsi, "Spark sql: Relational data processing in spark," in Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 1383--1394, ACM, 2015. Google ScholarDigital Library
A. Lumsdaine, D. Gregor, B. Hendrickson, and J. Berry, "Challenges in parallel graph processing," Parallel Processing Letters, vol. 17, no. 01, pp. 5--20, 2007.Google ScholarCross Ref
A. H. Nodehi Sabet, J. Qiu, and Z. Zhao, "Tigr: Transforming irregular graphs for gpu-friendly graph processing," in Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 622--636, ACM, 2018.Google Scholar
S. Beamer III, Understanding and improving graph algorithm performance. University of California, Berkeley, 2016.Google Scholar
M. Billeter, O. Olsson, and U. Assarsson, "Efficient stream compaction on wide simd many-core architectures," in Proceedings of the conference on high performance graphics 2009, pp. 159--166, ACM, 2009.Google Scholar
D. Merrill, M. Garland, and A. Grimshaw, "High-performance and scalable gpu graph traversal," ACM Transactions on Parallel Computing, vol. 1, no. 2, p. 14, 2015. Google ScholarDigital Library
A. Davidson, S. Baxter, M. Garland, and J. D. Owens, "Work-efficient parallel gpu methods for single-source shortest paths," in Parallel and Distributed Processing Symposium, 2014 IEEE 28th International, pp. 349--359, IEEE, 2014.Google Scholar
N. Bell and M. Garland, "Implementing sparse matrix-vector multiplication on throughput-oriented processors," in Proceedings of the conference on high performance computing networking, storage and analysis, p. 18, ACM, 2009. Google ScholarDigital Library
N. Bell and J. Hoberock, "Thrust: A productivity-oriented library for cuda," in GPU computing gems Jade edition, pp. 359--371, Elsevier, 2011.Google Scholar
A. Geil, Y. Wang, and J. D. Owens, "Wtf, gpu! computing twitter's who-to-follow on the gpu," in Proceedings of the second ACM conference on Online social networks, pp. 63--68, ACM, 2014. Google ScholarDigital Library
L. Page, S. Brin, R. Motwani, and T. Winograd, "The pagerank citation ranking: Bringing order to the web.," tech. rep., Stanford InfoLab, 1999.Google Scholar
D. Compiler, "Synopsys inc," 2000.Google Scholar
S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and N. P. Jouppi, "Mcpat: an integrated power, area, and timing modeling framework for multicore and manycore architectures," in Microarchitecture, 2009. MICRO-42. 42nd Annual IEEE/ACM International Symposium on, pp. 469--480, IEEE, 2009. Google ScholarDigital Library
P. Rosenfeld, E. Cooper-Balis, and B. Jacob, "Dramsim2: A cycle accurate memory system simulator," IEEE Computer Architecture Letters, vol. 10, no. 1, pp. 16--19, 2011. Google ScholarDigital Library
J. Leng, T. Hetherington, A. ElTantawy, S. Gilani, N. S. Kim, T. M. Aamodt, and V. J. Reddi, "Gpuwattch: enabling energy optimizations in gpgpus," in ACM SIGARCH Computer Architecture News, vol. 41, pp. 487--498, ACM, 2013. Google ScholarDigital Library
M. Technology, TN-53-01. LPDDR4 Power Calculator. Technical Report, 2016.Google Scholar
A. Bakhoda, G. L. Yuan, W. W. Fung, H. Wong, and T. M. Aamodt, "Analyzing cuda workloads using a detailed gpu simulator," in Performance Analysis of Systems and Software, 2009. ISPASS 2009. IEEE International Symposium on, pp. 163--174, IEEE, 2009.Google Scholar
T. A. Davis and Y. Hu, "The university of florida sparse matrix collection," ACM Transactions on Mathematical Software (TOMS), vol. 38, no. 1, p. 1, 2011. Google ScholarDigital Library
DIMACS, "10th dimacs implementation challenge - graph partitioning and graph clustering," 2010.Google Scholar
T. D. Han and T. S. Abdelrahman, "Reducing branch divergence in gpu programs," in Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units, p. 3, ACM, 2011.Google Scholar
S. Hong, S. K. Kim, T. Oguntebi, and K. Olukotun, "Accelerating cuda graph algorithms at maximum warp," in ACM SIGPLAN Notices, vol. 46, pp. 267--276, ACM, 2011.Google ScholarDigital Library
F. Khorasani, R. Gupta, and L. N. Bhuyan, "Scalable simd-efficient graph processing on gpus," in Parallel Architecture and Compilation (PACT), 2015 International Conference on, pp. 39--50, IEEE, 2015. Google ScholarDigital Library
A. Gharaibeh, T. Reza, E. Santos-Neto, L. B. Costa, S. Sallinen, and M. Ripeanu, "Efficient large-scale graph processing on hybrid cpu and gpu systems," arXiv preprint arXiv:1312.3018, 2013.Google Scholar
F. Khorasani, K. Vora, R. Gupta, and L. N. Bhuyan, "Cusha: vertex-centric graph processing on gpus," in Proceedings of the 23rd international symposium on High-performance parallel and distributed computing, pp. 239--252, ACM, 2014. Google ScholarDigital Library
Y. Wang, A. Davidson, Y. Pan, Y. Wu, A. Riffel, and J. D. Owens, "Gunrock: A high-performance graph processing library on the gpu," in Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '16, (New York, NY, USA), pp. 11:1--11:12, ACM, 2016. Google ScholarDigital Library
T. J. Ham, L. Wu, N. Sundaram, N. Satish, and M. Martonosi, "Graphicionado: A high-performance and energy-efficient accelerator for graph analytics," in Microarchitecture (MICRO), 2016 49th Annual IEEE/ACM International Symposium on, pp. 1--13, IEEE, 2016. Google ScholarDigital Library
J. Zhou, S. Liu, Q. Guo, X. Zhou, T. Zhi, D. Liu, C. Wang, X. Zhou, Y. Chen, and T. Chen, "Tunao: A high-performance and energy-efficient reconfigurable accelerator for graph processing," in Cluster, Cloud and Grid Computing (CCGRID), 2017 17th IEEE/ACM International Symposium on, pp. 731--734, IEEE, 2017. Google ScholarDigital Library
G. Dai, T. Huang, Y. Chi, J. Zhao, G. Sun, Y. Liu, Y. Wang, Y. Xie, and H. Yang, "Graphh: A processing-in-memory architecture for large-scale graph processing," IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2018.Google Scholar
L. Song, Y. Zhuo, X. Qian, H. Li, and Y. Chen, "Graphr: Accelerating graph processing using reram," in 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 531--543, Feb 2018.Google Scholar

Index Terms

SCU: a GPU stream compaction unit for graph processing
1. Computing methodologies
  1. Computer graphics
    1. Graphics systems and interfaces
      1. Graphics processors
2. Hardware
  1. Integrated circuits
    1. Reconfigurable logic and FPGAs
      1. Hardware accelerators

Recommendations

Efficient stream compaction on wide SIMD many-core architectures
HPG '09: Proceedings of the Conference on High Performance Graphics 2009

Stream compaction is a common parallel primitive used to remove unwanted elements in sparse data. This allows highly parallel algorithms to maintain performance over several processing steps and reduces overall memory usage.

For wide SIMD many-core ...
Read More
A BSP model graph processing system on many cores

Large-scale graph processing plays an increasingly important role for many data-related applications. Recently GPU has been adopted to accelerate various graph processing algorithms. However, since the architecture of GPU is very different from ...
Read More
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance Computing

The graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ISCA '19: Proceedings of the 46th International Symposium on Computer Architecture
June 2019
849 pages
ISBN:9781450366694
DOI:10.1145/3307650
General Chair:
Srilatha (Bobbie) Manne
Microsoft
,
Program Chairs:
Hillery Hunter
IBM
,
Erik Altman
IBM Research
Copyright © 2019 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 22 June 2019
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
GPGPU
graph processing
stream compaction
Qualifiers
- research-article
Conference

Acceptance Rates
ISCA '19 Paper Acceptance Rate62of365submissions,17%Overall Acceptance Rate543of3,203submissions,17%
More
Upcoming Conference
ISCA '24

Sponsor:

sigarch

ISCA '24: The 51st Annual International Symposium on Computer Architecture

June 29 - July 3, 2024

Buenos Aires , Argentina
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 13
  Total Citations
  View Citations
- 556
  Total Downloads
- Downloads (Last 12 months)50
- Downloads (Last 6 weeks)5
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

SCU: a GPU stream compaction unit for graph processing

ISCA '19: Proceedings of the 46th International Symposium on Computer Architecture

ABSTRACT

References

Cited By

Index Terms

Recommendations

Efficient stream compaction on wide SIMD many-core architectures

A BSP model graph processing system on many cores

On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

SCU: a GPU stream compaction unit for graph processing

ISCA '19: Proceedings of the 46th International Symposium on Computer Architecture

ABSTRACT

References

Cited By

Index Terms

Recommendations

Efficient stream compaction on wide SIMD many-core architectures

A BSP model graph processing system on many cores

On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media