skip to main content
10.1145/2159430.2159433acmconferencesArticle/Chapter ViewAbstractPublication PagesgpgpuConference Proceedingsconference-collections
research-article

FLAT: a GPU programming framework to provide embedded MPI

Published: 03 March 2012 Publication History

Abstract

For leveraging multiple GPUs in a cluster system, it is necessary to assign application tasks to multiple GPUs and execute those tasks with appropriately using communication primitives to handle data transfer among GPUs. In current GPU programming models, communication primitives such as MPI functions cannot be used within GPU kernels. Instead, such functions should be used in the CPU code. Therefore, programmer must handle both GPU kernel and CPU code for data communications. This makes GPU programming and its optimization very difficult.
In this paper, we propose a programming framework named FLAT which enables programmers to use MPI functions within GPU kernels. Our framework automatically transforms MPI functions written in a GPU kernel into runtime routines executed on the CPU. The execution model and the implementation of FLAT are described, and the applicability of FLAT in terms of scalability and programmability is discussed. We also evaluate the performance of FLAT. The result shows that FLAT achieves good scalability for intended applications.

References

[1]
NVIDIA GPUDirect#8482;. http://developer.nvidia.com/gpudirect.
[2]
R. Babich, M. A. Clark, and B. Joó. Parallelizing the QUDA Library for Multi-GPU Calculations in Lattice Quantum Chromodynamics. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC '10, pages 1--11, Washington, DC, USA, 2010. IEEE Computer Society.
[3]
Q.-k. Chen and J.-k. Zhang. A Stream Processor Cluster Architecture Model with the Hybrid Technology of MPI and CUDA. In Proceedings of the 2009 First IEEE International Conference on Information Science and Engineering, ICISE '09, pages 86--89, Washington, DC, USA, 2009. IEEE Computer Society.
[4]
A. Danalis, G. Marin, C. McCurdy, J. S. Meredith, P. C. Roth, K. Spafford, V. Tipparaju, and J. S. Vetter. The Scalable Heterogeneous Computing (SHOC) benchmark suite. In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, GPGPU '10, pages 63--74, New York, NY, USA, 2010. ACM.
[5]
D. A. Jacobsen, J. C. Thibault, and I. Senocak. An MPI-CUDA Implementation for Massively Parallel Incompressible Flow Computations on Multi-GPU Clusters, 1 2010.
[6]
J. Kim, H. Kim, J. H. Lee, and J. Lee. Achieving a single compute device image in opencl for multiple gpus. In Proceedings of the 16th ACM symposium on Principles and practice of parallel programming, PPoPP '11, pages 277--288, New York, NY, USA, 2011. ACM.
[7]
D. Komatitsch, G. Erlebacher, D. Göddeke, and D. Michéa. High-order finite-element seismic wave propagation modeling with MPI on a large GPU cluster. J. Comput. Phys., 229:7692--7714, October 2010.
[8]
O. S. Lawlor. Message passing for GPGPU clusters: CudaMPI. In CLUSTER, pages 1--8. IEEE, 2009.
[9]
S. Lee, S.-J. Min, and R. Eigenmann. OpenMP to GPGPU: a compiler framework for automatic translation and optimization. SIGPLAN Not., 44:101--110, February 2009.
[10]
A. Leung, N. Vasilache, B. Meister, M. Baskaran, D. Wohlford, C. Bastoul, and R. Lethin. A mapping path for multi-GPGPU accelerated computers from a portable high level programming abstraction. In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, GPGPU '10, pages 51--61, New York, NY, USA, 2010. ACM.
[11]
G. Noaje, M. Krajecki, and C. Jaillet. MultiGPU computing using MPI or OpenMP. In Proceedings of the Proceedings of the 2010 IEEE 6th International Conference on Intelligent Computer Communication and Processing, ICCP '10, pages 347--354, Washington, DC, USA, 2010. IEEE Computer Society.
[12]
J. Stuart and J. Owens. Multi-gpu mapreduce on gpu clusters. In Parallel Distributed Processing Symposium (IPDPS), 2011 IEEE International, pages 1068--1079, may 2011.
[13]
TOP500 Super Computing Sites. TOP500 List - June 2011(1--100). http://www.top500.org/list/2011/06/100.
[14]
K. H. Tsoi, A. H. Tse, P. Pietzuch, and W. Luk. Programming framework for clusters with heterogeneous accelerators. SIGARCH Comput. Archit. News, 38:53--59, January 2011.

Cited By

View all
  • (2024)Energy Cost Modelling for Optimizing Large Language Model Inference on Hardware Accelerators2024 IEEE 37th International System-on-Chip Conference (SOCC)10.1109/SOCC62300.2024.10737844(1-6)Online publication date: 16-Sep-2024
  • (2024)Design and Implementation of MPI-Native GPU-Initiated MPI Partitioned CommunicationProceedings of the SC '24 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis10.1109/SCW63240.2024.00065(436-447)Online publication date: 17-Nov-2024
  • (2018)Efficient Breadth First Search on Multi-GPU Systems Using GPU-Centric OpenSHMEMOpenSHMEM and Related Technologies. Big Compute and Big Data Convergence10.1007/978-3-319-73814-7_6(82-96)Online publication date: 10-Jan-2018
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
GPGPU-5: Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units
March 2012
122 pages
ISBN:9781450312332
DOI:10.1145/2159430
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 March 2012

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. GPGPU
  2. MPI
  3. cluster system
  4. programming framework

Qualifiers

  • Research-article

Conference

GPGPU-5
Sponsor:

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)6
  • Downloads (Last 6 weeks)1
Reflects downloads up to 07 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Energy Cost Modelling for Optimizing Large Language Model Inference on Hardware Accelerators2024 IEEE 37th International System-on-Chip Conference (SOCC)10.1109/SOCC62300.2024.10737844(1-6)Online publication date: 16-Sep-2024
  • (2024)Design and Implementation of MPI-Native GPU-Initiated MPI Partitioned CommunicationProceedings of the SC '24 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis10.1109/SCW63240.2024.00065(436-447)Online publication date: 17-Nov-2024
  • (2018)Efficient Breadth First Search on Multi-GPU Systems Using GPU-Centric OpenSHMEMOpenSHMEM and Related Technologies. Big Compute and Big Data Convergence10.1007/978-3-319-73814-7_6(82-96)Online publication date: 10-Jan-2018
  • (2017)GPU triggered networking for intra-kernel communicationsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3126908.3126950(1-12)Online publication date: 12-Nov-2017
  • (2017)GPU-Centric Communication on NVIDIA GPU Clusters with InfiniBand: A Case Study with OpenSHMEM2017 IEEE 24th International Conference on High Performance Computing (HiPC)10.1109/HiPC.2017.00037(253-262)Online publication date: Dec-2017
  • (2016)dCUDAProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.5555/3014904.3014974(1-12)Online publication date: 13-Nov-2016
  • (2016)Designing high performance communication runtime for GPU managed memoryProceedings of the 9th Annual Workshop on General Purpose Processing using Graphics Processing Unit10.1145/2884045.2884050(82-91)Online publication date: 12-Mar-2016
  • (2016)dCUDA: Hardware Supported Overlap of Computation and CommunicationSC16: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC.2016.51(609-620)Online publication date: Nov-2016
  • (2016)Video processing using GPU-accelerator under desktop virtualization environment2016 International Conference on Audio, Language and Image Processing (ICALIP)10.1109/ICALIP.2016.7846638(766-770)Online publication date: Jul-2016
  • (2015)PLB-HeCProceedings of the 2015 IEEE International Conference on Cluster Computing10.1109/CLUSTER.2015.24(96-105)Online publication date: 8-Sep-2015
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media