ABSTRACT
In the last few years, GPUs have become an integral part of HPC clusters. To test these heterogeneous CPU-GPU systems, we designed a hybrid CUDA-MPI benchmark suite that consists of three communication- and compute-intensive applications: Matrix Multiplication (MM), Needleman-Wunsch (NW) and the ADFA compression algorithm [1]. The main goal of this work is to characterize these workloads on CPU-GPU clusters. Our benchmark applications are designed to allow cluster administrators to identify bottlenecks in the cluster, to decide if scaling applications to multiple nodes would improve or decrease overall throughput and to design effective scheduling policies. Our experiments show that inter-node communication can significantly degrade the throughput of communication-intensive applications. We conclude that the scalability of the applications depends primarily on two factors: the cluster configuration and the applications characteristics.
- M. Becchi and P. Crowley, ?A-DFA: A Time- and Space- Efficient DFA Compression Algorithm for Fast Regular Expression Evaluation,? ACM TACO, vol. 10, no. 1, pp. 1--26, 2013. Google ScholarDigital Library
- S. B. Needleman, and C. D. Wunsch, ?A general method applicable to the search for similarities in the amino acid sequence of two proteins,? J. of Molecular Biology, vol. 48,no. 3, pp. 443--453, 1970.Google Scholar
- How to Optimize Data Transfers in CUDA C/C++, http://devblogs.nvidia.com/parallelforall/how-optimize-datatransfers- cuda-cc.Google Scholar
- An Introduction to CUDA-Aware MPI, http://devblogs.nvidia.com/parallelforall/introduction-cudaaware-mpi.Google Scholar
Index Terms
- Design of a Hybrid MPI-CUDA Benchmark Suite for CPU-GPU Clusters
Recommendations
An OpenCL micro-benchmark suite for GPUs and CPUs
Open computing language (OpenCL) is a new industry standard for task-parallel and data-parallel heterogeneous computing on a variety of modern CPUs, GPUs, DSPs, and other microprocessor designs. OpenCL is vendor independent and hence not specialized for ...
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance ComputingThe graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...
Boosting CUDA Applications with CPU---GPU Hybrid Computing
This paper presents a cooperative heterogeneous computing framework which enables the efficient utilization of available computing resources of host CPU cores for CUDA kernels, which are designed to run only on GPU. The proposed system exploits at ...
Comments