ABSTRACT
This paper uses betweenness centrality as a case study to research efficient work stealing in a heterogeneous system environment. Betweenness centrality is an important algorithm in graph processing. It presents multiple-level parallelism and is an interesting problem to exploit various optimizations. We investigate queue-based work stealing to distribute its tasks across GPU compute units (CUs) and across the CPU and the GPU, which has not been done by prior work. In particular, we demonstrate how to leverage the new platform-atomic operations on AMD Accelerated Processing Units (APUs) to operate cross-device queues in a lock-free manner in shared virtual memory. To make the work stealing runtime and the application more efficient, we apply new architectural features, including atomic operations with different memory scopes and or-derings for different synchronization scenarios. We implement our solution using heterogeneous system architecture (HSA). Our results show that betweenness centrality with CPU-GPU work stealing achieves an average of 15% (up to 30%) performance improvement over GPU-only execution for diverse graph inputs. Our work stealing solution can be applied widely to other applications too. Finally, we analyze important parameters critical for queuing and stealing.
- The 10th DIMACS Implementation Challenge Graph Partitioning and Graph Clustering. Web resource. http://www.cc.gatech.edu/dimacs10/.Google Scholar
- C. Augonnet, S. Thibault, R. Namyst, and P.-A. Wacrenier. StarPU: A unified platform for task scheduling on heterogeneous multicore architectures. Concurrency and Computation: Practice and Experience, 23(2), 2011. Google ScholarDigital Library
- S. Benkner, S. Pllana, J.L. Traff, P. Tsigas, U. Dolinsky, C. Augonnet, B. Bachmayer, C. Kessler, D. Moloney, and V. Osipov. Peppher: Efficient and productive usage of hybrid computing systems. IEEE Micro, 31(5), Sept 2011. Google ScholarDigital Library
- M. Bernaschi, G. Carbone, and F. Vella. Scalable betweenness centrality on multi-GPU systems. In Proceedings of the ACM International Conference on Computing Frontiers, May 2016. Google ScholarDigital Library
- R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: An efficient multithreaded runtime system. In Proceedings of the Fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Aug 1995. Google ScholarDigital Library
- R. D. Blumofe and C. E. Leiserson. Scheduling multithreaded computations by work stealing. Journal of the ACM, 46(5), 1999. Google ScholarDigital Library
- M. Boyer, K. Skadron, S. Che, and N. Jayasena. Load balancing in a changing world: Dealing with heterogeneity and performance variability. In Proceedings of the ACM International Conference on Computing Frontiers, May 2013. Google ScholarDigital Library
- U. Brandes. A faster algorithm for betweenness centrality. Journal of Mathematical Sociology, 25:163--177, 2001.Google ScholarCross Ref
- S. Chatterjee, M. Grossman, A. S. Sbirlea, and V. Sarkar. Dynamic task parallelism with a GPU work-stealing runtime system. Languages and Compilers for Parallel Computing, pages 203--217, 2011.Google Scholar
- CL Offline Compiler and SNACK. Web resource. https://github.com/HSAFoundation/CLOC.Google Scholar
- Graph input for interacting proteins. Web resource. http://www.sommer.jp/graphs/.Google Scholar
- Heterogeneous System Architecture: A Technical Review. Web resource. http://developer.amd.com/wordpress/media/2012/10/hsa10.pdf.Google Scholar
- Heterogeneous System Architecture (HSA). Web resource. http://hsafoundation.com/.Google Scholar
- Y. Jia, V. Lu, J. Hoberock, M. Garland, and J. C. Hart. Edge v. node parallelism for graph centrality metrics. GPU Computing Gems, 2:15--30, 2011.Google Scholar
- D. Kaeli, P. Mistry, D. Schaa, and D. P. Zhang. Heterogeneous Computing with OpenCL 2.0. Morgan Kaufmann, 2015. Google ScholarDigital Library
- J. Kepner and J. Gilbert. Graph Algorithms in the Language of Linear Algebra. Society for Industrial and Applied Mathematics, January 2011. Google ScholarDigital Library
- N. M. Lê, A. Pop, A. Cohen, and F. Zappa Nardelli. Correct and efficient work-stealing for weak memory models. In Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Feb 2013. Google ScholarDigital Library
- C.-K. Luk, S. Hong, and H. Kim. Qilin: Exploiting parallelism on heterogeneous multiprocessors with adaptive mapping. In MICRO-42, 2009. Google ScholarDigital Library
- A. McLaughlin and D. Bader. Scalable and high performance betweenness centrality on the gpu. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, Nov 2014. Google ScholarDigital Library
- S. Mukherjee, Y. Sun, P. Blinzer, A. K. Ziabari, and D. R. Kaeli. A comprehensive performance analysis of HSA and opencl 2.0. In 2016 IEEE International Symposium on Performance Analysis of Systems and Software, April 2016.Google ScholarCross Ref
- R. Nasre, M. Burtscher, and K. Pingali. Data-driven versus topology-driven irregular computations on gpus. In Proceedings of the 27th IEEE International Parallel and Distributed Processing Symposium, May 2013. Google ScholarDigital Library
- GTGraph: A Suite of Synthetic Random Graph Generators. Web resource. http://www.cse.psu.edu/~madduri/software/GTgraph/index.html.Google Scholar
- OpenCL. Web resource. http://www.khronos.org/opencl/.Google Scholar
- A. E. Sariyuce, K. Kaya, E. Saule, and U. V. Catalyurek. Betweenness centrality on GPUs and heterogeneous architectures. In Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units, Mar 2013. Google ScholarDigital Library
- Z. Shi and B. Zhang. Fast network centrality analysis using GPUs. BMC Bioinformatics, 12(149), 2011.Google Scholar
- The University of Florida Sparse Matrix Collection. Web resource. http://www.cise.ufl.edu/research/sparse/matrices/.Google Scholar
- P. Tsigas and D. Cedermann. GPU Computing Gems Jade Edition, chapter Dynamic Load Balancing Using Work-Stealing. Morgan Kaufmann, 2011.Google ScholarDigital Library
- S. Tzeng, A. Patney, and J. D. Owens. Task management for irregular-parallel workloads on the GPU. In Proceedings of High Performance Graphics, June 2010. Google ScholarDigital Library
- Y. Wang, A. Davidson, Y. Pan, Y. Wu, A. Riffel, and J. D. Owens. Gunrock: A high-performance graph processing library on the GPU. In Proceedings of the 21st Symposium on Principles and Practice of Parallel Programming, Mar 2016. Google ScholarDigital Library
Index Terms
- Work Stealing in a Shared Virtual-Memory Heterogeneous Environment: A Case Study with Betweenness Centrality
Recommendations
Betweenness Centrality in an HSA-enabled System
HPGP '16: Proceedings of the ACM Workshop on High Performance Graph ProcessingThis paper studies different approaches to implementing betweenness centrality in a heterogeneous system. Betweenness centrality is an important algorithm in graph processing. It presents multiple levels of parallelism when processing a graph, and is an ...
Exploiting Concurrent GPU Operations for Efficient Work Stealing on Multi-GPUs
SBAC-PAD '12: Proceedings of the 2012 IEEE 24th International Symposium on Computer Architecture and High Performance ComputingThe race for Exascale computing has naturally led the current technologies to converge to multi-CPU/multi-GPU computers, based on thousands of CPUs and GPUs interconnected by PCI-Express buses or interconnection networks. To exploit this high computing ...
Preliminary Experiments with XKaapi on Intel Xeon Phi Coprocessor
SBAC-PAD '13: Proceedings of the 2013 25th International Symposium on Computer Architecture and High Performance ComputingThis paper presents preliminary performance comparisons of parallel applications developed natively for the Intel Xeon Phi accelerator using three different parallel programming environments and their associated runtime systems. We compare Intel OpenMP, ...
Comments