research-article

Scaling scientific applications on clusters of hybrid multicore/GPU nodes

Authors:

Miaoqing Huang,

Vikram K. Narayana,

Tarek El-GhazawiAuthors Info & Claims

CF '11: Proceedings of the 8th ACM International Conference on Computing Frontiers

Article No.: 6, Pages 1 - 10

https://doi.org/10.1145/2016604.2016612

Published: 03 May 2011 Publication History

Abstract

Rapid advances in the performance and programmability of graphics accelerators have made GPU computing a compelling solution for a wide variety of application domains. However, the increased complexity as a result of architectural heterogeneity and imbalances in hardware resources poses significant programming challenges in harnessing the performance advantages of GPU accelerated parallel systems. Moreover, the speedup derived from GPU often gets offset by longer communication latencies and inefficient task scheduling. To achieve the best possible performance, a suitable parallel programming model is therefore essential.

In this paper, we explore a new hybrid parallel programming model that incorporates GPU acceleration with the Partitioned Global Address Space (PGAS) programming paradigm. As we demonstrate, by combining Unified Parallel C (UPC) and CUDA as a case study, this hybrid model offers programmers with both enhanced programmability and powerful heterogeneous execution. Two application benchmarks, namely NAS Parallel Benchmark (NPB) FT and MG, are used to show the effectiveness of our proposed hybrid approach. Experimental results indicate that both implementations achieve significantly better performance due to optimization opportunities offered by the hybrid model, such as the funneled execution mode and fine-grained overlapping of communication and computation.

References

[1]

The Berkeley UPC Compiler. http://upc.lbl.gov.

[2]

NAS Parallel Benchmarks. http://www.nas.nasa.gov/Resources/Software/npb.html.

[3]

PGI Accelerator Compilers. http://www.pgroup.com/resources/accel.htm.

[4]

The X10 programming language. http://x10-lang.org.

[5]

NVIDIA's next generation CUDA compute architecture: Fermi. White paper V1.1, 2009. Available online on http://www.nvidia.com.

[6]

C. Augonnet, S. Thibault, R. Namyst, and P. Wacrenier. StarPU: A unified platform for task scheduling on heterogeneous multicore architectures. In Proc. 15th International Euro-Par Conference, Lecture Notes in Computer Science 5704, pages 863--874, Aug. 2009.

Digital Library

[7]

C. Bell, D. Bonachea, R. Nishtala, and K. Yelick. Optimizing bandwidth limited problems using one-sided communication and overlap. In Proc. 20th IEEE International Parallel & Distributed Processing Symposium (IPDPS'06), Apr. 2006.

Digital Library

[8]

D. Bonachea. Proposal for extending the UPC memory copy library functions and supporting extensions to gasnet, v2.0. Technical Report LBNL-56495 v2.0, Lawrence Berkeley National Lab, Berkeley, CA, USA, 2007.

[9]

F. Cantonnet, Y. Yao, M. Zahran, and T. El-Ghazawi. Productivity analysis of the UPC language. In Proc. 18th IEEE International Parallel & Distributed Processing Symposium (IPDPS'04), Apr. 2004.

[10]

B. L. Chamberlain, S. J. Deitz, and L. Snyder. A comparative study of the NAS MG benchmark across parallel languages and architectures. In Proc. ACM/IEEE 2000 Conference on Supercomputing (SC'00), Nov. 2000.

Digital Library

[11]

T. El-Ghazawi and F. Cantonnet. UPC performance and potential: A NPB experimental study. In Proc. ACM/IEEE 2002 conference on Supercomputing (SC'02), Nov. 2002.

Digital Library

[12]

Z. Guz, E. Bolotin, I. Keidar, A. Kolodny, A. Mendelson, and U. C. Weiser. Many-core vs. many-thread machines: Stay away from the valley. IEEE Comput. Archit. Lett., 8(1):25--28, Jan. 2009.

Digital Library

[13]

V. V. Kindratenko, J. J. Enos, G. Shi, M. T. Showerman, G. W. Arnold, J. E. Stone, J. C. Phillips, and W. Hwu. GPU clusters for high-performance computing. In Proc. IEEE International Conference on Cluster Computing and Workshops (CLUSTER'09), Aug. 2009.

[14]

S. Lee, S. Min, and R. Eigenmann. OpenMP to GPGPU: a compiler framework for automatic translation and optimization. In Proc. 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP'09), pages 101--110, Feb. 2009.

Digital Library

[15]

J. Michalakes and M. Vachharajani. GPU acceleration of numerical weather prediction. Parallel Processing Letters, 18(4):531--548, Dec. 2008.

[16]

R. Nishtala, P. Hargrove, D. Bonachea, and K. Yelick. Scaling communication-intensive applications on BlueGene/P using one-sided communication and overlap. In Proc. 23rd IEEE International Parallel & Distributed Processing Symposium (IPDPS'09), May 2009.

Digital Library

[17]

NVIDIA Corporation. NVIDIA CUDA Programming Guide 3.1, May 2010.

[18]

J. C. Phillips, J. E. Stone, and K. Schulten. Adapting a message-driven parallel application to GPU-accelerated clusters. In Proc. ACM/IEEE 2008 Conference on Supercomputing (SC'08), Nov. 2008.

Digital Library

[19]

J. A. Stuart and J. D. Owens. Message passing on data-parallel architectures. In Proc. 23rd IEEE International Parallel & Distributed Processing Symposium (IPDPS'09), May 2009.

Digital Library

Cited By

Griebler DLoff JMencagli GDanelutto MFernandes L(2018)Efficient NAS Benchmark Kernels with C++ Parallel Programming2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)10.1109/PDP2018.2018.00120(733-740)Online publication date: Mar-2018
https://doi.org/10.1109/PDP2018.2018.00120
Sun SNarayana VEl-Ghazawi TSorger V(2017)CLEAR: A Holistic Figure-of-Merit for Electronic, Photonic, Plasmonic and Hybrid Photonic-Plasmonic Compute System ComparisonAdvanced Photonics 2017 (IPR, NOMA, Sensors, Networks, SPPCom, PS)10.1364/IPRSN.2017.JTu4A.8(JTu4A.8)Online publication date: 2017
https://doi.org/10.1364/IPRSN.2017.JTu4A.8
Sarofeen CGillett P(2016)A High Performance Parallel and Heterogeneous Approach to Narrowband BeamformingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2015.249403827:8(2196-2207)Online publication date: 1-Aug-2016
https://doi.org/10.1109/TPDS.2015.2494038
Show More Cited By

Index Terms

Scaling scientific applications on clusters of hybrid multicore/GPU nodes
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Heterogeneous (hybrid) systems
2. General and reference
  1. Cross-computing tools and techniques
    1. Performance

Recommendations

Towards efficient GPU sharing on multicore processors

Scalable systems employing a mix of GPUs with CPUs are becoming increasingly prevalent in high-performance computing. The presence of such accelerators introduces significant challenges and complexities to both language developers and end users. This ...
Towards efficient GPU sharing on multicore processors
PMBS '11: Proceedings of the second international workshop on Performance modeling, benchmarking and simulation of high performance computing systems

Scalable systems employing a mix of GPUs with CPUs are becoming increasingly prevalent in high-performance computing (HPC). The presence of such accelerators introduces significant challenges and complexities to both language developers and end users. ...
Divide and Conquer on Hybrid GPU-Accelerated Multicore Systems

With the raw computing power of graphics processing units (GPUs) being more widely available in commodity multicore systems, there is an imminent need to harness their power for important numerical libraries such as LAPACK. In this paper, we consider ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CF '11: Proceedings of the 8th ACM International Conference on Computing Frontiers

May 2011

268 pages

ISBN:9781450306980

DOI:10.1145/2016604

General Chair:
Calin Cascaval
Qualcomm Research
,
Program Chairs:
Pedro Trancoso
University of Cyprus, CY
,
Viktor Prasanna
University of Southern California

Copyright © 2011 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMICRO: ACM Special Interest Group on Microarchitectural Research and Processing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 May 2011

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

CF'11

Sponsor:

SIGMICRO

CF'11: Computing Frontiers Conference

May 3 - 5, 2011

Ischia, Italy

Acceptance Rates

Overall Acceptance Rate 273 of 785 submissions, 35%

Upcoming Conference

CF '25

Sponsor:
sigmicro

22nd ACM International Conference on Computing Frontiers

May 28 - 30, 2025

Cagliari , Italy

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

14
Total Citations
View Citations
352
Total Downloads

Downloads (Last 12 months)4
Downloads (Last 6 weeks)1

Reflects downloads up to 01 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Griebler DLoff JMencagli GDanelutto MFernandes L(2018)Efficient NAS Benchmark Kernels with C++ Parallel Programming2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)10.1109/PDP2018.2018.00120(733-740)Online publication date: Mar-2018
https://doi.org/10.1109/PDP2018.2018.00120
Sun SNarayana VEl-Ghazawi TSorger V(2017)CLEAR: A Holistic Figure-of-Merit for Electronic, Photonic, Plasmonic and Hybrid Photonic-Plasmonic Compute System ComparisonAdvanced Photonics 2017 (IPR, NOMA, Sensors, Networks, SPPCom, PS)10.1364/IPRSN.2017.JTu4A.8(JTu4A.8)Online publication date: 2017
https://doi.org/10.1364/IPRSN.2017.JTu4A.8
Sarofeen CGillett P(2016)A High Performance Parallel and Heterogeneous Approach to Narrowband BeamformingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2015.249403827:8(2196-2207)Online publication date: 1-Aug-2016
https://doi.org/10.1109/TPDS.2015.2494038
Sant'Ana LCordeiro DCamargo R(2015)PLB-HeCProceedings of the 2015 IEEE International Conference on Cluster Computing10.1109/CLUSTER.2015.24(96-105)Online publication date: 8-Sep-2015
https://dl.acm.org/doi/10.1109/CLUSTER.2015.24
Zhang KWu B(2013)Task Scheduling Greedy Heuristics for GPU Heterogeneous Cluster Involving the Weights of the ProcessorProceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing Workshops and PhD Forum10.1109/IPDPSW.2013.38(1817-1827)Online publication date: 20-May-2013
https://dl.acm.org/doi/10.1109/IPDPSW.2013.38
Tzanoudakis TPapaefstathiou IManifavas C(2013)Parallelizing bioinformatics and security applications on a low-cost multi-core system2013 ACS International Conference on Computer Systems and Applications (AICCSA)10.1109/AICCSA.2013.6616452(1-4)Online publication date: May-2013
https://doi.org/10.1109/AICCSA.2013.6616452
Yang FShi TChu HWang K(2012)The Design and Implementation of Parallel Algorithm Accelerator Based on CPU-GPU Collaborative Computing EnvironmentAdvanced Materials Research10.4028/www.scientific.net/AMR.529.408529(408-412)Online publication date: Jun-2012
https://doi.org/10.4028/www.scientific.net/AMR.529.408
Wang LHuang MEl-Ghazawi T(2012)Towards efficient GPU sharing on multicore processorsACM SIGMETRICS Performance Evaluation Review10.1145/2381056.238108140:2(119-124)Online publication date: 8-Oct-2012
https://dl.acm.org/doi/10.1145/2381056.2381081
Camargo R(2012)A Load Distribution Algorithm Based on Profiling for Heterogeneous GPU ClustersProceedings of the 2012 Third Workshop on Applications for Multi-Core Architecture10.1109/WAMCA.2012.13(1-6)Online publication date: 24-Oct-2012
https://dl.acm.org/doi/10.1109/WAMCA.2012.13
Wende FCordes FSteinke T(2012)On Improving the Performance of Multi-threaded CUDA Applications with Concurrent Kernel Execution by Kernel ReorderingProceedings of the 2012 Symposium on Application Accelerators in High Performance Computing10.1109/SAAHPC.2012.12(74-83)Online publication date: 10-Jul-2012
https://dl.acm.org/doi/10.1109/SAAHPC.2012.12
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten