ABSTRACT
For leveraging multiple GPUs in a cluster system, it is necessary to assign application tasks to multiple GPUs and execute those tasks with appropriately using communication primitives to handle data transfer among GPUs. In current GPU programming models, communication primitives such as MPI functions cannot be used within GPU kernels. Instead, such functions should be used in the CPU code. Therefore, programmer must handle both GPU kernel and CPU code for data communications. This makes GPU programming and its optimization very difficult.
In this paper, we propose a programming framework named FLAT which enables programmers to use MPI functions within GPU kernels. Our framework automatically transforms MPI functions written in a GPU kernel into runtime routines executed on the CPU. The execution model and the implementation of FLAT are described, and the applicability of FLAT in terms of scalability and programmability is discussed. We also evaluate the performance of FLAT. The result shows that FLAT achieves good scalability for intended applications.
- NVIDIA GPUDirect#8482;. http://developer.nvidia.com/gpudirect.Google Scholar
- R. Babich, M. A. Clark, and B. Joó. Parallelizing the QUDA Library for Multi-GPU Calculations in Lattice Quantum Chromodynamics. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC '10, pages 1--11, Washington, DC, USA, 2010. IEEE Computer Society. Google ScholarDigital Library
- Q.-k. Chen and J.-k. Zhang. A Stream Processor Cluster Architecture Model with the Hybrid Technology of MPI and CUDA. In Proceedings of the 2009 First IEEE International Conference on Information Science and Engineering, ICISE '09, pages 86--89, Washington, DC, USA, 2009. IEEE Computer Society. Google ScholarDigital Library
- A. Danalis, G. Marin, C. McCurdy, J. S. Meredith, P. C. Roth, K. Spafford, V. Tipparaju, and J. S. Vetter. The Scalable Heterogeneous Computing (SHOC) benchmark suite. In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, GPGPU '10, pages 63--74, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
- D. A. Jacobsen, J. C. Thibault, and I. Senocak. An MPI-CUDA Implementation for Massively Parallel Incompressible Flow Computations on Multi-GPU Clusters, 1 2010.Google Scholar
- J. Kim, H. Kim, J. H. Lee, and J. Lee. Achieving a single compute device image in opencl for multiple gpus. In Proceedings of the 16th ACM symposium on Principles and practice of parallel programming, PPoPP '11, pages 277--288, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
- D. Komatitsch, G. Erlebacher, D. Göddeke, and D. Michéa. High-order finite-element seismic wave propagation modeling with MPI on a large GPU cluster. J. Comput. Phys., 229:7692--7714, October 2010. Google ScholarDigital Library
- O. S. Lawlor. Message passing for GPGPU clusters: CudaMPI. In CLUSTER, pages 1--8. IEEE, 2009.Google ScholarCross Ref
- S. Lee, S.-J. Min, and R. Eigenmann. OpenMP to GPGPU: a compiler framework for automatic translation and optimization. SIGPLAN Not., 44:101--110, February 2009. Google ScholarDigital Library
- A. Leung, N. Vasilache, B. Meister, M. Baskaran, D. Wohlford, C. Bastoul, and R. Lethin. A mapping path for multi-GPGPU accelerated computers from a portable high level programming abstraction. In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, GPGPU '10, pages 51--61, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
- G. Noaje, M. Krajecki, and C. Jaillet. MultiGPU computing using MPI or OpenMP. In Proceedings of the Proceedings of the 2010 IEEE 6th International Conference on Intelligent Computer Communication and Processing, ICCP '10, pages 347--354, Washington, DC, USA, 2010. IEEE Computer Society. Google ScholarDigital Library
- J. Stuart and J. Owens. Multi-gpu mapreduce on gpu clusters. In Parallel Distributed Processing Symposium (IPDPS), 2011 IEEE International, pages 1068--1079, may 2011. Google ScholarDigital Library
- TOP500 Super Computing Sites. TOP500 List - June 2011(1--100). http://www.top500.org/list/2011/06/100.Google Scholar
- K. H. Tsoi, A. H. Tse, P. Pietzuch, and W. Luk. Programming framework for clusters with heterogeneous accelerators. SIGARCH Comput. Archit. News, 38:53--59, January 2011. Google ScholarDigital Library
Index Terms
- FLAT: a GPU programming framework to provide embedded MPI
Recommendations
On the efficacy of GPU-integrated MPI for scientific applications
HPDC '13: Proceedings of the 22nd international symposium on High-performance parallel and distributed computingScientific computing applications are quickly adapting to leverage the massive parallelism of GPUs in large-scale clusters. However, the current hybrid programming models require application developers to explicitly manage the disjointed host and GPU ...
MVAPICH2-GPU: optimized GPU to GPU communication for InfiniBand clusters
Data parallel architectures, such as General Purpose Graphics Units (GPGPUs) have seen a tremendous rise in their application for High End Computing. However, data movement in and out of GPGPUs remain the biggest hurdle to overall performance and ...
On the efficacy of GPU-integrated MPI for scientific applications
HPDC '13: Proceedings of the 22nd international symposium on High-performance parallel and distributed computingScientific computing applications are quickly adapting to leverage the massive parallelism of GPUs in large-scale clusters. However, the current hybrid programming models require application developers to explicitly manage the disjointed host and GPU ...
Comments