A run-time optimization approach for reducing data movements using locality-aware searching

Li, Liang; Wang, Endong; Zhang, Xingjun; Yan, Kang; Ju, Tao; Dong, Xiaoshe

doi:10.1007/s11227-014-1186-x

A run-time optimization approach for reducing data movements using locality-aware searching

Published: 20 April 2014

Volume 69, pages 864–886, (2014)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Liang Li¹,
Endong Wang²,
Xingjun Zhang¹,
Kang Yan¹,
Tao Ju¹ &
…
Xiaoshe Dong¹

240 Accesses
Explore all metrics

Abstract

The CPU–GPU communication bottleneck limits the performance improvement of GPU applications in heterogeneous GPGPU systems and usually is handled by data reuse optimization. This paper analyzes data reuse through DAG abstraction and obtains rules showing that the run-time data reuse optimization can effectively relieve the bottleneck. Based on the rules, this paper proposes a run-time optimization framework for data reuse, called R-Tracker. The R-Tracker uses locality-aware searching approach to handle reuses. It can not only low costly implement the data reuse optimization but also effectively implement the searching, the data transfers, and the GPU computation concurrently. R-Tracker relaxes the constraints that are required in compiler-based approaches and thus achieves better reuse effect. The experimental results show that R-Tracker improves the performance by 1.77–16.42 % over compiler-based approach OpenMPC and 1.40–8.39 % over CGCM in single-node execution, and 48.78–60 % over CGCM in multi-node execution.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Containerization technologies: taxonomies, applications and challenges

Article 08 June 2021

The Egyptian national HPC grid (EN-HPCG): open-source Slurm implementation from cluster to grid approach

Article Open access 17 April 2024

Efficient High-Level Programming in Plain Java

Article 05 December 2022

References

Nickolls J, Dally WJ (2010) The GPU computing era. In: Proceedings of IEEE Micro, pp 56–69
Top500 List (2013) http://www.top500.org/statistics/list/. Accessed on 1 April 2013
Bayoumi AM, Hanafy YY (2008) Massive parallelization of SPICE device model evaluation on GPU-based SIMD architectures. In: Proceedings of IFMT, 2008
He B, Lu M, Yang K, Fang R, Govindaraju NK, Luo Q, Sander PV (2009) Relational query co-processing on graphics processors, presented at, ACM transactions on database systems, 2009, pp 1–35
NVIDIA Corporation (2011) Cuda c programming guide 4.0
Khronos OpenCL Working Group (2012) The opencl specication
Buck I, Foley T, Horn DR, Sugerman J, Fatahalian K, Houston M, Hanrahan P (2004) Brook for GPUs: stream computing on graphics hardware, presented at, ACM Transactions on Graphics, 2004, pp 777–786
Ueng S, Lathara M, Baghsorkhi SS, Hwu WW (2008) CUDA-Lite: reducing GPU programming complexity. In: Proceedings of LCPC, 2008, pp 1–15
Han TD, Abdelrahman TS (2011) hiCUDA: high-level GPGPU programming, presented at, IEEE transactions on parallel and distributed systems, 2011, pp 78–90
Lee S, Eigenmann R (2010) OpenMPC: extended OpenMP programming and tuning for GPUs. In: Proceedings of SC, 2010
Jablin TB, Prabhu P, Jablin JA, Johnson NP, Beard SR, August DI (2011) Automatic CPU–GPU communication management and optimization. In: Proceedings of PLDI, 2011, pp 142–151
Wolfe M (2013) Optimizing data movement in the PGI accelerator programming model. http://www.pgroup.com/lit/articles/insider/v3n1a1.htm. Accessed on 24 July 2013
Yan Y, Grossman M, Sarkar V (2009) JCUDA: a programmer-friendly interface for accelerating java programs with CUDA. In: Proceedings of Euro-Par, 2009, pp 887–899
Hennessy JL, Patterson DA (2012) Computer architecture: a quantitative approach, 5th edn. pp 318–319
Che S, Boyer M, Meng J, Tarjan D, Sheaffer JW, Lee S-H, Skadron K (2009) Rodinia: a benchmark suite for heterogeneous computing, 2009
Lenna and Pilla. Hpc with gpu. http://hpcgpu.codeplex.com/releases/view/34770. Accessed on 24 July 2013
Pouchet L-N (2013) Polybench: the polyhedral benchmark suite. http://www.cse.ohio-state.edu/~pouchet/software/polybench/. Accessed on 24 July 2013
Ethier S, Tang WM, Lin ZH (2005) Gyrokinetic particle-in-cell simulations of plasma microturbulence on advanced computing platforms, presented at Journal of Physics: Conference Series, 2005 pp 1–15
Klasky S, Ethier S, Lin Z, Martins K, McCune D, Samtaney R (2003) Grid-based parallel data streaming implemented for the gyrokinetic toroidal code. In: Proceedings of SC, 2003, pp 24–33
Zhu X, Liu X, Meng X, Feng J, (2011) Performance analysis and optimization of gyrokinetic torodial code on TH-1A supercomputer. In: Proceedings of international conference on electrical and control engineering, 2011, pp 6027–6031
Aji AM, Dinan J, Buntinas D, Balaji P, Feng W, Bisset KR, Thakur R (2012) MPI-ACC MPI-ACC: an integrated and extensible approach to data movement in accelerator-based systems. In: Proceedings of HPCC, 2012
Feng X, Jin H, Zheng R, Hun K, Zeng J, Shao Z (2011) Optimization of sparse matrix-vector multiplication with variant CSR on GPUs. In: Proceedings of ICPADS, 2011, pp 165–172
Haicheng W, Gregery D, Jeffrey Y, Sudhakar Y (2011) Accelerating data warehousing applications using general purpose GPUs, present at CERCS, 2011
Becchi et al. (2010) Data-aware scheduling of legacy kernels on heterogeneous platforms with distributed memory. In: Proceedings of SPAA 2010
Becchi M, Sajjapongse K, Graves I, Procter A, Ravi V, Chakradhar S (2013) A virtual memory based runtime to support multi-tenancy in clusters with GPUs. In: Proceedings of HPDC, 2013
Sundaram N, Raghunathan A, Chakradhar ST (2009) A framework for efficient and scalable execution of domain-specific templates on GPUs. In: Proceedings of IPDPS 2009, pp 1–12
Satish N, Sundaram N, Keutzer K (2009) Optimizing the use of GPU memory in applications with large data sets. In: Proceedings of HiPC, 2009, pp 408–418
Gelado et al. (2010) An asymmetric distributed shared memory model for heterogeneous parallel systems. In: Proceedings of ASPLOS, 2010
Yang X, Liao X, Lu K, Hu Q, Song J, Su J (2011) The TianHe-1A supercomputer: its hardware and software, presented at Journal of Computer Science and Technology, 2011, pp 344–351

Download references

Acknowledgments

The authors thank the National Supercomputer Center in Tianjin for providing platforms to carry out the experiments. This work is supported by the National Natural Science Foundation of China (NSFC) under Grant No. 61173039, and the National High Technology Research and Development Program (863 Program) of China under Grant No. 2012AA010904, No. 2008AA01A202, and No. 2012AA01A306.

Author information

Authors and Affiliations

Xi’an Jiaotong University, Xi’an, China
Liang Li, Xingjun Zhang, Kang Yan, Tao Ju & Xiaoshe Dong
The State Key Laboratory of High-end Server and Storage Technology, Jinan, China
Endong Wang

Authors

Liang Li
View author publications
You can also search for this author in PubMed Google Scholar
Endong Wang
View author publications
You can also search for this author in PubMed Google Scholar
Xingjun Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Kang Yan
View author publications
You can also search for this author in PubMed Google Scholar
Tao Ju
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoshe Dong
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiaoshe Dong.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, L., Wang, E., Zhang, X. et al. A run-time optimization approach for reducing data movements using locality-aware searching. J Supercomput 69, 864–886 (2014). https://doi.org/10.1007/s11227-014-1186-x

Download citation

Published: 20 April 2014
Issue Date: August 2014
DOI: https://doi.org/10.1007/s11227-014-1186-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A run-time optimization approach for reducing data movements using locality-aware searching

Abstract

Access this article

Similar content being viewed by others

Containerization technologies: taxonomies, applications and challenges

The Egyptian national HPC grid (EN-HPCG): open-source Slurm implementation from cluster to grid approach

Efficient High-Level Programming in Plain Java

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A run-time optimization approach for reducing data movements using locality-aware searching

Abstract

Access this article

Similar content being viewed by others

Containerization technologies: taxonomies, applications and challenges

The Egyptian national HPC grid (EN-HPCG): open-source Slurm implementation from cluster to grid approach

Efficient High-Level Programming in Plain Java

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation