skip to main content
10.1145/2016604.2016612acmconferencesArticle/Chapter ViewAbstractPublication PagescfConference Proceedingsconference-collections
research-article

Scaling scientific applications on clusters of hybrid multicore/GPU nodes

Published: 03 May 2011 Publication History

Abstract

Rapid advances in the performance and programmability of graphics accelerators have made GPU computing a compelling solution for a wide variety of application domains. However, the increased complexity as a result of architectural heterogeneity and imbalances in hardware resources poses significant programming challenges in harnessing the performance advantages of GPU accelerated parallel systems. Moreover, the speedup derived from GPU often gets offset by longer communication latencies and inefficient task scheduling. To achieve the best possible performance, a suitable parallel programming model is therefore essential.
In this paper, we explore a new hybrid parallel programming model that incorporates GPU acceleration with the Partitioned Global Address Space (PGAS) programming paradigm. As we demonstrate, by combining Unified Parallel C (UPC) and CUDA as a case study, this hybrid model offers programmers with both enhanced programmability and powerful heterogeneous execution. Two application benchmarks, namely NAS Parallel Benchmark (NPB) FT and MG, are used to show the effectiveness of our proposed hybrid approach. Experimental results indicate that both implementations achieve significantly better performance due to optimization opportunities offered by the hybrid model, such as the funneled execution mode and fine-grained overlapping of communication and computation.

References

[1]
The Berkeley UPC Compiler. http://upc.lbl.gov.
[2]
NAS Parallel Benchmarks. http://www.nas.nasa.gov/Resources/Software/npb.html.
[3]
PGI Accelerator Compilers. http://www.pgroup.com/resources/accel.htm.
[4]
The X10 programming language. http://x10-lang.org.
[5]
NVIDIA's next generation CUDA compute architecture: Fermi. White paper V1.1, 2009. Available online on http://www.nvidia.com.
[6]
C. Augonnet, S. Thibault, R. Namyst, and P. Wacrenier. StarPU: A unified platform for task scheduling on heterogeneous multicore architectures. In Proc. 15th International Euro-Par Conference, Lecture Notes in Computer Science 5704, pages 863--874, Aug. 2009.
[7]
C. Bell, D. Bonachea, R. Nishtala, and K. Yelick. Optimizing bandwidth limited problems using one-sided communication and overlap. In Proc. 20th IEEE International Parallel & Distributed Processing Symposium (IPDPS'06), Apr. 2006.
[8]
D. Bonachea. Proposal for extending the UPC memory copy library functions and supporting extensions to gasnet, v2.0. Technical Report LBNL-56495 v2.0, Lawrence Berkeley National Lab, Berkeley, CA, USA, 2007.
[9]
F. Cantonnet, Y. Yao, M. Zahran, and T. El-Ghazawi. Productivity analysis of the UPC language. In Proc. 18th IEEE International Parallel & Distributed Processing Symposium (IPDPS'04), Apr. 2004.
[10]
B. L. Chamberlain, S. J. Deitz, and L. Snyder. A comparative study of the NAS MG benchmark across parallel languages and architectures. In Proc. ACM/IEEE 2000 Conference on Supercomputing (SC'00), Nov. 2000.
[11]
T. El-Ghazawi and F. Cantonnet. UPC performance and potential: A NPB experimental study. In Proc. ACM/IEEE 2002 conference on Supercomputing (SC'02), Nov. 2002.
[12]
Z. Guz, E. Bolotin, I. Keidar, A. Kolodny, A. Mendelson, and U. C. Weiser. Many-core vs. many-thread machines: Stay away from the valley. IEEE Comput. Archit. Lett., 8(1):25--28, Jan. 2009.
[13]
V. V. Kindratenko, J. J. Enos, G. Shi, M. T. Showerman, G. W. Arnold, J. E. Stone, J. C. Phillips, and W. Hwu. GPU clusters for high-performance computing. In Proc. IEEE International Conference on Cluster Computing and Workshops (CLUSTER'09), Aug. 2009.
[14]
S. Lee, S. Min, and R. Eigenmann. OpenMP to GPGPU: a compiler framework for automatic translation and optimization. In Proc. 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP'09), pages 101--110, Feb. 2009.
[15]
J. Michalakes and M. Vachharajani. GPU acceleration of numerical weather prediction. Parallel Processing Letters, 18(4):531--548, Dec. 2008.
[16]
R. Nishtala, P. Hargrove, D. Bonachea, and K. Yelick. Scaling communication-intensive applications on BlueGene/P using one-sided communication and overlap. In Proc. 23rd IEEE International Parallel & Distributed Processing Symposium (IPDPS'09), May 2009.
[17]
NVIDIA Corporation. NVIDIA CUDA Programming Guide 3.1, May 2010.
[18]
J. C. Phillips, J. E. Stone, and K. Schulten. Adapting a message-driven parallel application to GPU-accelerated clusters. In Proc. ACM/IEEE 2008 Conference on Supercomputing (SC'08), Nov. 2008.
[19]
J. A. Stuart and J. D. Owens. Message passing on data-parallel architectures. In Proc. 23rd IEEE International Parallel & Distributed Processing Symposium (IPDPS'09), May 2009.

Cited By

View all
  • (2018)Efficient NAS Benchmark Kernels with C++ Parallel Programming2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)10.1109/PDP2018.2018.00120(733-740)Online publication date: Mar-2018
  • (2017)CLEAR: A Holistic Figure-of-Merit for Electronic, Photonic, Plasmonic and Hybrid Photonic-Plasmonic Compute System ComparisonAdvanced Photonics 2017 (IPR, NOMA, Sensors, Networks, SPPCom, PS)10.1364/IPRSN.2017.JTu4A.8(JTu4A.8)Online publication date: 2017
  • (2016)A High Performance Parallel and Heterogeneous Approach to Narrowband BeamformingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2015.249403827:8(2196-2207)Online publication date: 1-Aug-2016
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CF '11: Proceedings of the 8th ACM International Conference on Computing Frontiers
May 2011
268 pages
ISBN:9781450306980
DOI:10.1145/2016604
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 May 2011

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. GPU
  2. UPC
  3. hybrid parallel programming
  4. multicore

Qualifiers

  • Research-article

Conference

CF'11
Sponsor:
CF'11: Computing Frontiers Conference
May 3 - 5, 2011
Ischia, Italy

Acceptance Rates

Overall Acceptance Rate 273 of 785 submissions, 35%

Upcoming Conference

CF '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)4
  • Downloads (Last 6 weeks)1
Reflects downloads up to 01 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2018)Efficient NAS Benchmark Kernels with C++ Parallel Programming2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)10.1109/PDP2018.2018.00120(733-740)Online publication date: Mar-2018
  • (2017)CLEAR: A Holistic Figure-of-Merit for Electronic, Photonic, Plasmonic and Hybrid Photonic-Plasmonic Compute System ComparisonAdvanced Photonics 2017 (IPR, NOMA, Sensors, Networks, SPPCom, PS)10.1364/IPRSN.2017.JTu4A.8(JTu4A.8)Online publication date: 2017
  • (2016)A High Performance Parallel and Heterogeneous Approach to Narrowband BeamformingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2015.249403827:8(2196-2207)Online publication date: 1-Aug-2016
  • (2015)PLB-HeCProceedings of the 2015 IEEE International Conference on Cluster Computing10.1109/CLUSTER.2015.24(96-105)Online publication date: 8-Sep-2015
  • (2013)Task Scheduling Greedy Heuristics for GPU Heterogeneous Cluster Involving the Weights of the ProcessorProceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing Workshops and PhD Forum10.1109/IPDPSW.2013.38(1817-1827)Online publication date: 20-May-2013
  • (2013)Parallelizing bioinformatics and security applications on a low-cost multi-core system2013 ACS International Conference on Computer Systems and Applications (AICCSA)10.1109/AICCSA.2013.6616452(1-4)Online publication date: May-2013
  • (2012)The Design and Implementation of Parallel Algorithm Accelerator Based on CPU-GPU Collaborative Computing EnvironmentAdvanced Materials Research10.4028/www.scientific.net/AMR.529.408529(408-412)Online publication date: Jun-2012
  • (2012)Towards efficient GPU sharing on multicore processorsACM SIGMETRICS Performance Evaluation Review10.1145/2381056.238108140:2(119-124)Online publication date: 8-Oct-2012
  • (2012)A Load Distribution Algorithm Based on Profiling for Heterogeneous GPU ClustersProceedings of the 2012 Third Workshop on Applications for Multi-Core Architecture10.1109/WAMCA.2012.13(1-6)Online publication date: 24-Oct-2012
  • (2012)On Improving the Performance of Multi-threaded CUDA Applications with Concurrent Kernel Execution by Kernel ReorderingProceedings of the 2012 Symposium on Application Accelerators in High Performance Computing10.1109/SAAHPC.2012.12(74-83)Online publication date: 10-Jul-2012
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media