Abstract
The CPU–GPU communication bottleneck limits the performance improvement of GPU applications in heterogeneous GPGPU systems and usually is handled by data reuse optimization. This paper analyzes data reuse through DAG abstraction and obtains rules showing that the run-time data reuse optimization can effectively relieve the bottleneck. Based on the rules, this paper proposes a run-time optimization framework for data reuse, called R-Tracker. The R-Tracker uses locality-aware searching approach to handle reuses. It can not only low costly implement the data reuse optimization but also effectively implement the searching, the data transfers, and the GPU computation concurrently. R-Tracker relaxes the constraints that are required in compiler-based approaches and thus achieves better reuse effect. The experimental results show that R-Tracker improves the performance by 1.77–16.42 % over compiler-based approach OpenMPC and 1.40–8.39 % over CGCM in single-node execution, and 48.78–60 % over CGCM in multi-node execution.
Similar content being viewed by others
References
Nickolls J, Dally WJ (2010) The GPU computing era. In: Proceedings of IEEE Micro, pp 56–69
Top500 List (2013) http://www.top500.org/statistics/list/. Accessed on 1 April 2013
Bayoumi AM, Hanafy YY (2008) Massive parallelization of SPICE device model evaluation on GPU-based SIMD architectures. In: Proceedings of IFMT, 2008
He B, Lu M, Yang K, Fang R, Govindaraju NK, Luo Q, Sander PV (2009) Relational query co-processing on graphics processors, presented at, ACM transactions on database systems, 2009, pp 1–35
NVIDIA Corporation (2011) Cuda c programming guide 4.0
Khronos OpenCL Working Group (2012) The opencl specication
Buck I, Foley T, Horn DR, Sugerman J, Fatahalian K, Houston M, Hanrahan P (2004) Brook for GPUs: stream computing on graphics hardware, presented at, ACM Transactions on Graphics, 2004, pp 777–786
Ueng S, Lathara M, Baghsorkhi SS, Hwu WW (2008) CUDA-Lite: reducing GPU programming complexity. In: Proceedings of LCPC, 2008, pp 1–15
Han TD, Abdelrahman TS (2011) hiCUDA: high-level GPGPU programming, presented at, IEEE transactions on parallel and distributed systems, 2011, pp 78–90
Lee S, Eigenmann R (2010) OpenMPC: extended OpenMP programming and tuning for GPUs. In: Proceedings of SC, 2010
Jablin TB, Prabhu P, Jablin JA, Johnson NP, Beard SR, August DI (2011) Automatic CPU–GPU communication management and optimization. In: Proceedings of PLDI, 2011, pp 142–151
Wolfe M (2013) Optimizing data movement in the PGI accelerator programming model. http://www.pgroup.com/lit/articles/insider/v3n1a1.htm. Accessed on 24 July 2013
Yan Y, Grossman M, Sarkar V (2009) JCUDA: a programmer-friendly interface for accelerating java programs with CUDA. In: Proceedings of Euro-Par, 2009, pp 887–899
Hennessy JL, Patterson DA (2012) Computer architecture: a quantitative approach, 5th edn. pp 318–319
Che S, Boyer M, Meng J, Tarjan D, Sheaffer JW, Lee S-H, Skadron K (2009) Rodinia: a benchmark suite for heterogeneous computing, 2009
Lenna and Pilla. Hpc with gpu. http://hpcgpu.codeplex.com/releases/view/34770. Accessed on 24 July 2013
Pouchet L-N (2013) Polybench: the polyhedral benchmark suite. http://www.cse.ohio-state.edu/~pouchet/software/polybench/. Accessed on 24 July 2013
Ethier S, Tang WM, Lin ZH (2005) Gyrokinetic particle-in-cell simulations of plasma microturbulence on advanced computing platforms, presented at Journal of Physics: Conference Series, 2005 pp 1–15
Klasky S, Ethier S, Lin Z, Martins K, McCune D, Samtaney R (2003) Grid-based parallel data streaming implemented for the gyrokinetic toroidal code. In: Proceedings of SC, 2003, pp 24–33
Zhu X, Liu X, Meng X, Feng J, (2011) Performance analysis and optimization of gyrokinetic torodial code on TH-1A supercomputer. In: Proceedings of international conference on electrical and control engineering, 2011, pp 6027–6031
Aji AM, Dinan J, Buntinas D, Balaji P, Feng W, Bisset KR, Thakur R (2012) MPI-ACC MPI-ACC: an integrated and extensible approach to data movement in accelerator-based systems. In: Proceedings of HPCC, 2012
Feng X, Jin H, Zheng R, Hun K, Zeng J, Shao Z (2011) Optimization of sparse matrix-vector multiplication with variant CSR on GPUs. In: Proceedings of ICPADS, 2011, pp 165–172
Haicheng W, Gregery D, Jeffrey Y, Sudhakar Y (2011) Accelerating data warehousing applications using general purpose GPUs, present at CERCS, 2011
Becchi et al. (2010) Data-aware scheduling of legacy kernels on heterogeneous platforms with distributed memory. In: Proceedings of SPAA 2010
Becchi M, Sajjapongse K, Graves I, Procter A, Ravi V, Chakradhar S (2013) A virtual memory based runtime to support multi-tenancy in clusters with GPUs. In: Proceedings of HPDC, 2013
Sundaram N, Raghunathan A, Chakradhar ST (2009) A framework for efficient and scalable execution of domain-specific templates on GPUs. In: Proceedings of IPDPS 2009, pp 1–12
Satish N, Sundaram N, Keutzer K (2009) Optimizing the use of GPU memory in applications with large data sets. In: Proceedings of HiPC, 2009, pp 408–418
Gelado et al. (2010) An asymmetric distributed shared memory model for heterogeneous parallel systems. In: Proceedings of ASPLOS, 2010
Yang X, Liao X, Lu K, Hu Q, Song J, Su J (2011) The TianHe-1A supercomputer: its hardware and software, presented at Journal of Computer Science and Technology, 2011, pp 344–351
Acknowledgments
The authors thank the National Supercomputer Center in Tianjin for providing platforms to carry out the experiments. This work is supported by the National Natural Science Foundation of China (NSFC) under Grant No. 61173039, and the National High Technology Research and Development Program (863 Program) of China under Grant No. 2012AA010904, No. 2008AA01A202, and No. 2012AA01A306.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Li, L., Wang, E., Zhang, X. et al. A run-time optimization approach for reducing data movements using locality-aware searching. J Supercomput 69, 864–886 (2014). https://doi.org/10.1007/s11227-014-1186-x
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-014-1186-x