Abstract
The article proposes a cost model for two implementations of the all-pairs-shortest-path (APSP) problem on graphics processing units (GPUs) with a particular focus on the correct reproduction of the shape of the runtime curve for different input problem sizes and thread block configurations. The model categorizes thread blocks according to the number of active warps and defines four different approaches to estimate the mapping of thread blocks to the hardware execution resources. Each of these approaches leads to a runtime prediction, which together span an interval for the expected execution time. An experimental evaluation shows that most characteristics of the runtime curve are predicted correctly, which is especially important for small graphs where a tiny increase of the input size may result in a significant increase of the execution time. For large graphs, we show that the cost prediction deviates less than 1 % from the measured times in many cases.
Similar content being viewed by others
References
Baghsorkhi S, Delahaye M, Patel S, Gropp W, Hwu WmW (2010) An adaptive performance modeling tool for GPU architectures. SIGPLAN Not 45(5):105–114
Brodtkorb A, Hagen T, Sætra M (2013) Graphics processing unit (GPU) programming strategies and trends in GPU computing. J Parallel Distrib Comput 73(1):4–13
Buluç A, Gilbert J, Budak C (2010) Solving path problems on the GPU. Parallel Comput 36(5–6):241–253
Che S, Skadron K (2014) BenchFriend: correlating the performance of GPU benchmarks. Int J High Perform Comput Appl 28(2):238–250
Guo P, Wang L (2014) Accurate cross architecture performance modeling for sparse matrix—vector multiplication (SpMV) on GPUs. Concurr Comput Pract Exp 27(13):3281–3294. doi:10.1002/cpe.3217
Harish P, Narayanan P (2007) Accelerating large graph algorithms on the GPU using CUDA. In: Proceedings of the 14th international conference on high performance computing (HiPC ’07). Springer, Berlin pp 197–208
Hasan K, Chatterjee A, Radhakrishnan S, Antonio J (2014) Performance prediction model and analysis for compute-intensive tasks on GPUs. In: Hsu CH, Shi X, Salapura V (eds) Proceedings of the 11th IFIP international conference on network and parallel computing (NPC’14). Lecture notes in computer science, vol 8707. Springer, Berlin, pp 612–617
Hong S, Kim H (2009) An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness. SIGARCH Comput Archit News 37(3):152–163
Hu Z, Liu G, Dong W (2014) A throughput-aware analytical performance model for GPU applications. In: Proceedings of the 10th annual conference on advanced computer architecture (ACA ’14). Springer, Berlin, pp 98–112
Karami A, Mirsoleimani S, Khunjush F (2013) A statistical performance prediction model for OpenCL kernels on NVIDIA GPUs. In: Proceedings of the 17th CSI international symposium on computer architecture and digital systems (CADS ’13). IEEE, pp 15–22
Kerr A, Diamos G, Yalamanchili S (2010) Modeling GPU–CPU workloads and systems. In: Proceedings of the 3rd workshop on general-purpose computation on graphics processing units (GPGPU ’10). ACM, New York, pp 31–42
Khronos OpenCL Working Group (2013) The OpenCL specification, Version 2.0. http://www.khronos.org/opencl
Kothapalli K, Mukherjee R, Rehman M, Patidar S, Narayanan P, Srinathan K (2009) A performance prediction model for the CUDA GPGPU platform. In: Proceedings of the 2009 international conference on high performance computing (HiPC ’09). IEEE, pp 463–472
Lai J, Seznec A (2013) Performance upper bound analysis and optimization of SGEMM on Fermi and Kepler GPUs. In: Proceedings of the 2013 IEEE/ACM international symposium on code generation and optimization (CGO ’13). IEEE Computer Society, Washington, pp 1–10
Lawler E (1976) Combinatorial optimization: networks and matroids. Holt, Rinehart and Winston, New York
Lopez-Novoa U, Mendiburu A, Miguel-Alonso J (2014) A survey of performance modeling and simulation techniques for accelerator-based computing. IEEE Trans Parallel Distrib Syst PP(99):1–1
Ma L, Chamberlain R, Agrawal K (2014) Performance modeling for highly-threaded many-core GPUs. In: Proceedings of the 25th international conference on application-specific systems, architectures and processors (ASAP ’14). IEEE, pp. 84–91
Meng J, Morozov V, Kumaran K, Vishwanath V, Uram T (2011) GROPHECY: GPU performance projection from CPU code skeletons. In: Proceedings of the 2011 international conference for high performance computing, networking, storage and analysis (SC ’11). ACM, New York, pp 14:1–14:11
Nvidia: CUDA occupancy calculator. http://developer.download.nvidia.com/compute/cuda/CUDA_Occupancy_calculator.xls
Nvidia (2015) CUDA toolkit documentation, Version 7.0. http://docs.nvidia.com/cuda
Sato K, Komatsu K, Takizawa H, Kobayashi H (2011) A history-based performance prediction model with profile data classification for automatic task allocation in heterogeneous computing systems. In: Proceedings of the 9th international symposium on parallel and distributed processing with applications (ISPA ’11). IEEE Computer Society, Washington, pp 135–142
Sim J, Dasgupta A, Kim H, Vuduc R (2012) A performance analysis framework for identifying potential benefits in GPGPU applications. SIGPLAN Not 47(8):11–22
Torres Y, Gonzalez-Escribano A, Llanos D (2013) uBench: exposing the impact of CUDA block geometry in terms of performance. J Supercomput 65(3):1150–1163
Tran QN (2010) Designing efficient many-core parallel algorithms for all-pairs shortest-paths using CUDA. In: Proceedings of the 7th international conference on information technology: new generations (ITNG ’10). IEEE Computer Society, Washington, pp 7–12
Zhang Y, Owens J (2011) A quantitative performance analysis model for GPU architectures. In: Proceedings of the 17th IEEE international symposium on high performance computer architecture (HPCA ’11). IEEE Computer Society, Washington, pp 382–393
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Dümmler, J., Egerland, S. Interval-based performance modeling for the all-pairs-shortest-path problem on GPUs. J Supercomput 71, 4192–4214 (2015). https://doi.org/10.1007/s11227-015-1514-9
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-015-1514-9