Skip to main content
Log in

A run-time optimization approach for reducing data movements using locality-aware searching

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

The CPU–GPU communication bottleneck limits the performance improvement of GPU applications in heterogeneous GPGPU systems and usually is handled by data reuse optimization. This paper analyzes data reuse through DAG abstraction and obtains rules showing that the run-time data reuse optimization can effectively relieve the bottleneck. Based on the rules, this paper proposes a run-time optimization framework for data reuse, called R-Tracker. The R-Tracker uses locality-aware searching approach to handle reuses. It can not only low costly implement the data reuse optimization but also effectively implement the searching, the data transfers, and the GPU computation concurrently. R-Tracker relaxes the constraints that are required in compiler-based approaches and thus achieves better reuse effect. The experimental results show that R-Tracker improves the performance by 1.77–16.42 % over compiler-based approach OpenMPC and 1.40–8.39 % over CGCM in single-node execution, and 48.78–60 % over CGCM in multi-node execution.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

References

  1. Nickolls J, Dally WJ (2010) The GPU computing era. In: Proceedings of IEEE Micro, pp 56–69

  2. Top500 List (2013) http://www.top500.org/statistics/list/. Accessed on 1 April 2013

  3. Bayoumi AM, Hanafy YY (2008) Massive parallelization of SPICE device model evaluation on GPU-based SIMD architectures. In: Proceedings of IFMT, 2008

  4. He B, Lu M, Yang K, Fang R, Govindaraju NK, Luo Q, Sander PV (2009) Relational query co-processing on graphics processors, presented at, ACM transactions on database systems, 2009, pp 1–35

  5. NVIDIA Corporation (2011) Cuda c programming guide 4.0

  6. Khronos OpenCL Working Group (2012) The opencl specication

  7. Buck I, Foley T, Horn DR, Sugerman J, Fatahalian K, Houston M, Hanrahan P (2004) Brook for GPUs: stream computing on graphics hardware, presented at, ACM Transactions on Graphics, 2004, pp 777–786

  8. Ueng S, Lathara M, Baghsorkhi SS, Hwu WW (2008) CUDA-Lite: reducing GPU programming complexity. In: Proceedings of LCPC, 2008, pp 1–15

  9. Han TD, Abdelrahman TS (2011) hiCUDA: high-level GPGPU programming, presented at, IEEE transactions on parallel and distributed systems, 2011, pp 78–90

  10. Lee S, Eigenmann R (2010) OpenMPC: extended OpenMP programming and tuning for GPUs. In: Proceedings of SC, 2010

  11. Jablin TB, Prabhu P, Jablin JA, Johnson NP, Beard SR, August DI (2011) Automatic CPU–GPU communication management and optimization. In: Proceedings of PLDI, 2011, pp 142–151

  12. Wolfe M (2013) Optimizing data movement in the PGI accelerator programming model. http://www.pgroup.com/lit/articles/insider/v3n1a1.htm. Accessed on 24 July 2013

  13. Yan Y, Grossman M, Sarkar V (2009) JCUDA: a programmer-friendly interface for accelerating java programs with CUDA. In: Proceedings of Euro-Par, 2009, pp 887–899

  14. Hennessy JL, Patterson DA (2012) Computer architecture: a quantitative approach, 5th edn. pp 318–319

  15. Che S, Boyer M, Meng J, Tarjan D, Sheaffer JW, Lee S-H, Skadron K (2009) Rodinia: a benchmark suite for heterogeneous computing, 2009

  16. Lenna and Pilla. Hpc with gpu. http://hpcgpu.codeplex.com/releases/view/34770. Accessed on 24 July 2013

  17. Pouchet L-N (2013) Polybench: the polyhedral benchmark suite. http://www.cse.ohio-state.edu/~pouchet/software/polybench/. Accessed on 24 July 2013

  18. Ethier S, Tang WM, Lin ZH (2005) Gyrokinetic particle-in-cell simulations of plasma microturbulence on advanced computing platforms, presented at Journal of Physics: Conference Series, 2005 pp 1–15

  19. Klasky S, Ethier S, Lin Z, Martins K, McCune D, Samtaney R (2003) Grid-based parallel data streaming implemented for the gyrokinetic toroidal code. In: Proceedings of SC, 2003, pp 24–33

  20. Zhu X, Liu X, Meng X, Feng J, (2011) Performance analysis and optimization of gyrokinetic torodial code on TH-1A supercomputer. In: Proceedings of international conference on electrical and control engineering, 2011, pp 6027–6031

  21. Aji AM, Dinan J, Buntinas D, Balaji P, Feng W, Bisset KR, Thakur R (2012) MPI-ACC MPI-ACC: an integrated and extensible approach to data movement in accelerator-based systems. In: Proceedings of HPCC, 2012

  22. Feng X, Jin H, Zheng R, Hun K, Zeng J, Shao Z (2011) Optimization of sparse matrix-vector multiplication with variant CSR on GPUs. In: Proceedings of ICPADS, 2011, pp 165–172

  23. Haicheng W, Gregery D, Jeffrey Y, Sudhakar Y (2011) Accelerating data warehousing applications using general purpose GPUs, present at CERCS, 2011

  24. Becchi et al. (2010) Data-aware scheduling of legacy kernels on heterogeneous platforms with distributed memory. In: Proceedings of SPAA 2010

  25. Becchi M, Sajjapongse K, Graves I, Procter A, Ravi V, Chakradhar S (2013) A virtual memory based runtime to support multi-tenancy in clusters with GPUs. In: Proceedings of HPDC, 2013

  26. Sundaram N, Raghunathan A, Chakradhar ST (2009) A framework for efficient and scalable execution of domain-specific templates on GPUs. In: Proceedings of IPDPS 2009, pp 1–12

  27. Satish N, Sundaram N, Keutzer K (2009) Optimizing the use of GPU memory in applications with large data sets. In: Proceedings of HiPC, 2009, pp 408–418

  28. Gelado et al. (2010) An asymmetric distributed shared memory model for heterogeneous parallel systems. In: Proceedings of ASPLOS, 2010

  29. Yang X, Liao X, Lu K, Hu Q, Song J, Su J (2011) The TianHe-1A supercomputer: its hardware and software, presented at Journal of Computer Science and Technology, 2011, pp 344–351

Download references

Acknowledgments

The authors thank the National Supercomputer Center in Tianjin for providing platforms to carry out the experiments. This work is supported by the National Natural Science Foundation of China (NSFC) under Grant No. 61173039, and the National High Technology Research and Development Program (863 Program) of China under Grant No. 2012AA010904, No. 2008AA01A202, and No. 2012AA01A306.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiaoshe Dong.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, L., Wang, E., Zhang, X. et al. A run-time optimization approach for reducing data movements using locality-aware searching. J Supercomput 69, 864–886 (2014). https://doi.org/10.1007/s11227-014-1186-x

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-014-1186-x

Keywords

Navigation