research-article

FLAT: a GPU programming framework to provide embedded MPI

Authors:

Tsutomu YoshinagaAuthors Info & Claims

GPGPU-5: Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units

Pages 20 - 29

https://doi.org/10.1145/2159430.2159433

Published: 03 March 2012 Publication History

Get Access

Abstract

For leveraging multiple GPUs in a cluster system, it is necessary to assign application tasks to multiple GPUs and execute those tasks with appropriately using communication primitives to handle data transfer among GPUs. In current GPU programming models, communication primitives such as MPI functions cannot be used within GPU kernels. Instead, such functions should be used in the CPU code. Therefore, programmer must handle both GPU kernel and CPU code for data communications. This makes GPU programming and its optimization very difficult.

In this paper, we propose a programming framework named FLAT which enables programmers to use MPI functions within GPU kernels. Our framework automatically transforms MPI functions written in a GPU kernel into runtime routines executed on the CPU. The execution model and the implementation of FLAT are described, and the applicability of FLAT in terms of scalability and programmability is discussed. We also evaluate the performance of FLAT. The result shows that FLAT achieves good scalability for intended applications.

References

[1]

NVIDIA GPUDirect#8482;. http://developer.nvidia.com/gpudirect.

Google Scholar

[2]

R. Babich, M. A. Clark, and B. Joó. Parallelizing the QUDA Library for Multi-GPU Calculations in Lattice Quantum Chromodynamics. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC '10, pages 1--11, Washington, DC, USA, 2010. IEEE Computer Society.

Digital Library

Google Scholar

[3]

Q.-k. Chen and J.-k. Zhang. A Stream Processor Cluster Architecture Model with the Hybrid Technology of MPI and CUDA. In Proceedings of the 2009 First IEEE International Conference on Information Science and Engineering, ICISE '09, pages 86--89, Washington, DC, USA, 2009. IEEE Computer Society.

Digital Library

Google Scholar

[4]

A. Danalis, G. Marin, C. McCurdy, J. S. Meredith, P. C. Roth, K. Spafford, V. Tipparaju, and J. S. Vetter. The Scalable Heterogeneous Computing (SHOC) benchmark suite. In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, GPGPU '10, pages 63--74, New York, NY, USA, 2010. ACM.

Digital Library

Google Scholar

[5]

D. A. Jacobsen, J. C. Thibault, and I. Senocak. An MPI-CUDA Implementation for Massively Parallel Incompressible Flow Computations on Multi-GPU Clusters, 1 2010.

Google Scholar

[6]

J. Kim, H. Kim, J. H. Lee, and J. Lee. Achieving a single compute device image in opencl for multiple gpus. In Proceedings of the 16th ACM symposium on Principles and practice of parallel programming, PPoPP '11, pages 277--288, New York, NY, USA, 2011. ACM.

Digital Library

Google Scholar

[7]

D. Komatitsch, G. Erlebacher, D. Göddeke, and D. Michéa. High-order finite-element seismic wave propagation modeling with MPI on a large GPU cluster. J. Comput. Phys., 229:7692--7714, October 2010.

Digital Library

Google Scholar

[8]

O. S. Lawlor. Message passing for GPGPU clusters: CudaMPI. In CLUSTER, pages 1--8. IEEE, 2009.

Crossref

Google Scholar

[9]

S. Lee, S.-J. Min, and R. Eigenmann. OpenMP to GPGPU: a compiler framework for automatic translation and optimization. SIGPLAN Not., 44:101--110, February 2009.

Digital Library

Google Scholar

[10]

A. Leung, N. Vasilache, B. Meister, M. Baskaran, D. Wohlford, C. Bastoul, and R. Lethin. A mapping path for multi-GPGPU accelerated computers from a portable high level programming abstraction. In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, GPGPU '10, pages 51--61, New York, NY, USA, 2010. ACM.

Digital Library

Google Scholar

[11]

G. Noaje, M. Krajecki, and C. Jaillet. MultiGPU computing using MPI or OpenMP. In Proceedings of the Proceedings of the 2010 IEEE 6th International Conference on Intelligent Computer Communication and Processing, ICCP '10, pages 347--354, Washington, DC, USA, 2010. IEEE Computer Society.

Digital Library

Google Scholar

[12]

J. Stuart and J. Owens. Multi-gpu mapreduce on gpu clusters. In Parallel Distributed Processing Symposium (IPDPS), 2011 IEEE International, pages 1068--1079, may 2011.

Digital Library

Google Scholar

[13]

TOP500 Super Computing Sites. TOP500 List - June 2011(1--100). http://www.top500.org/list/2011/06/100.

Google Scholar

[14]

K. H. Tsoi, A. H. Tse, P. Pietzuch, and W. Luk. Programming framework for clusters with heterogeneous accelerators. SIGARCH Comput. Archit. News, 38:53--59, January 2011.

Digital Library

Google Scholar

Cited By

View all

Geens RShi MSymons AFang CVerhelst M(2024)Energy Cost Modelling for Optimizing Large Language Model Inference on Hardware Accelerators2024 IEEE 37th International System-on-Chip Conference (SOCC)10.1109/SOCC62300.2024.10737844(1-6)Online publication date: 16-Sep-2024
https://doi.org/10.1109/SOCC62300.2024.10737844
Temuçin YSchonbein WLevy SSojoodi AGrant RAfsahi A(2024)Design and Implementation of MPI-Native GPU-Initiated MPI Partitioned CommunicationProceedings of the SC '24 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis10.1109/SCW63240.2024.00065(436-447)Online publication date: 17-Nov-2024
https://dl.acm.org/doi/10.1109/SCW63240.2024.00065
Potluri SGoswami AVenkata MImam N(2018)Efficient Breadth First Search on Multi-GPU Systems Using GPU-Centric OpenSHMEMOpenSHMEM and Related Technologies. Big Compute and Big Data Convergence10.1007/978-3-319-73814-7_6(82-96)Online publication date: 10-Jan-2018
https://doi.org/10.1007/978-3-319-73814-7_6
Show More Cited By

Index Terms

FLAT: a GPU programming framework to provide embedded MPI
1. Software and its engineering
  1. Software notations and tools
    1. Compilers
      1. Runtime environments
      2. Source code generation

Recommendations

On the efficacy of GPU-integrated MPI for scientific applications
HPDC '13: Proceedings of the 22nd international symposium on High-performance parallel and distributed computing

Scientific computing applications are quickly adapting to leverage the massive parallelism of GPUs in large-scale clusters. However, the current hybrid programming models require application developers to explicitly manage the disjointed host and GPU ...
MVAPICH2-GPU: optimized GPU to GPU communication for InfiniBand clusters

Data parallel architectures, such as General Purpose Graphics Units (GPGPUs) have seen a tremendous rise in their application for High End Computing. However, data movement in and out of GPGPUs remain the biggest hurdle to overall performance and ...
On the efficacy of GPU-integrated MPI for scientific applications
HPDC '13: Proceedings of the 22nd international symposium on High-performance parallel and distributed computing

Scientific computing applications are quickly adapting to leverage the massive parallelism of GPUs in large-scale clusters. However, the current hybrid programming models require application developers to explicitly manage the disjointed host and GPU ...

Comments

Information & Contributors

Information

Published In

GPGPU-5: Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units

March 2012

122 pages

ISBN:9781450312332

DOI:10.1145/2159430

Editors:
David Kaeli
Northeastern University, Boston, MA
,
John Cavazos
University of Delaware, Newark, DE
,
Enqiang Sun

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 March 2012

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

GPGPU-5

Sponsor:

GPGPU-5: The 5th Annual Workshop on General Purpose Processing with Graphics Processing Units

March 3, 2012

London, United Kingdom

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

15
Total Citations
View Citations
255
Total Downloads

Downloads (Last 12 months)6
Downloads (Last 6 weeks)1

Reflects downloads up to 07 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

View all

Geens RShi MSymons AFang CVerhelst M(2024)Energy Cost Modelling for Optimizing Large Language Model Inference on Hardware Accelerators2024 IEEE 37th International System-on-Chip Conference (SOCC)10.1109/SOCC62300.2024.10737844(1-6)Online publication date: 16-Sep-2024
https://doi.org/10.1109/SOCC62300.2024.10737844
Temuçin YSchonbein WLevy SSojoodi AGrant RAfsahi A(2024)Design and Implementation of MPI-Native GPU-Initiated MPI Partitioned CommunicationProceedings of the SC '24 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis10.1109/SCW63240.2024.00065(436-447)Online publication date: 17-Nov-2024
https://dl.acm.org/doi/10.1109/SCW63240.2024.00065
Potluri SGoswami AVenkata MImam N(2018)Efficient Breadth First Search on Multi-GPU Systems Using GPU-Centric OpenSHMEMOpenSHMEM and Related Technologies. Big Compute and Big Data Convergence10.1007/978-3-319-73814-7_6(82-96)Online publication date: 10-Jan-2018
https://doi.org/10.1007/978-3-319-73814-7_6
LeBeane MHamidouche KBenton BBreternitz MReinhardt SJohn LMohr BRaghavan P(2017)GPU triggered networking for intra-kernel communicationsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3126908.3126950(1-12)Online publication date: 12-Nov-2017
https://dl.acm.org/doi/10.1145/3126908.3126950
Potluri SGoswami ARossetti DNewburn CVenkata MImam N(2017)GPU-Centric Communication on NVIDIA GPU Clusters with InfiniBand: A Case Study with OpenSHMEM2017 IEEE 24th International Conference on High Performance Computing (HiPC)10.1109/HiPC.2017.00037(253-262)Online publication date: Dec-2017
https://doi.org/10.1109/HiPC.2017.00037
Gysi TBär JHoefler TWest J(2016)dCUDAProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.5555/3014904.3014974(1-12)Online publication date: 13-Nov-2016
https://dl.acm.org/doi/10.5555/3014904.3014974
Banerjee DHamidouche KPanda DKaeli DCavazos J(2016)Designing high performance communication runtime for GPU managed memoryProceedings of the 9th Annual Workshop on General Purpose Processing using Graphics Processing Unit10.1145/2884045.2884050(82-91)Online publication date: 12-Mar-2016
https://dl.acm.org/doi/10.1145/2884045.2884050
Gysi TBar JHoefler T(2016)dCUDA: Hardware Supported Overlap of Computation and CommunicationSC16: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC.2016.51(609-620)Online publication date: Nov-2016
https://doi.org/10.1109/SC.2016.51
Liu DZhang HZhou JShen WCao MChen SQian QDai D(2016)Video processing using GPU-accelerator under desktop virtualization environment2016 International Conference on Audio, Language and Image Processing (ICALIP)10.1109/ICALIP.2016.7846638(766-770)Online publication date: Jul-2016
https://doi.org/10.1109/ICALIP.2016.7846638
Sant'Ana LCordeiro DCamargo R(2015)PLB-HeCProceedings of the 2015 IEEE International Conference on Cluster Computing10.1109/CLUSTER.2015.24(96-105)Online publication date: 8-Sep-2015
https://dl.acm.org/doi/10.1109/CLUSTER.2015.24
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Abstract

References

Cited By

Index Terms

Recommendations

On the efficacy of GPU-integrated MPI for scientific applications

MVAPICH2-GPU: optimized GPU to GPU communication for InfiniBand clusters

On the efficacy of GPU-integrated MPI for scientific applications

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations