Abstract
Graphics processing units (GPUs) are widely used for scientific and engineering applications with high level of parallelism. The computing power of GPUs is improving through enhancing their architectural facilities. NVIDIA’s compute unified device architecture (CUDA) stream using hyper-Q on NVIDIA graphic cards is a reputable capability for performance improvement. Without any synchronization and based on architectural capabilities, NVIDIA’s CUDA stream allows some processes to run simultaneously. Experimental results show that how to stream a number of programs affect the execution time. Therefore, the stream set with the highest amount of performance improvement is the efficient stream set. This article proposes a framework to predict the efficient stream set on two streams without trying all combinations, which would be a very time-consuming process. The proposed framework employs a performance model and a scheduler. The performance model estimates the duration of simultaneous portions of streamed programs and the scheduler uses the estimation of the model to predict the efficient stream set. The proposed prediction method relies on non-stream features of programs. The results show that even with 33% error of performance model in average, the scheduler predicts the optimized sets with 100% precision.













Similar content being viewed by others
Notes
A set of threads that execute the same instruction on different data elements.
References
NVIDIA (2015) Cuda C programming guide
Munshi A (2016) The OpenCL specification. In: 2009 IEEE Hot Chips 21 Symposium. HCS 2009, pp 11–314
Hwu W, Edition DKS and undefined (2009). Programming massively parallel processors. In: Program massively parallel process
Rennich S (2012) Cuda c/c++ streams and concurrency. In: NVIDIA (online)
M. H.-. nvidia. com/how-optimize-data-transfers-cuda-cc. How to optimize data transfers in CUDA C/C++
Mahafzah BA (2014) Performance evaluation of parallel multithreaded A heuristic search algorithm. J Inf Sci 40(3):363–375
Mahafzah BA (2011) Parallel multithreaded IDA* heuristic search: algorithm design and performance evaluation. Int J Parallel Emerg Distrib Syst 26(1):61–82
Najadat H, Jaffal YM, Mahafzah BA, Al-Omari SS (2014) A new fine-grained multithreaded game engine approach. Int J Model Simul 34(1):15–22
Mahafzah BA (2013) Performance assessment of multithreaded quicksort algorithm on simultaneous multithreaded architecture. J Supercomput 66(1):339–363
Mohammad Taisir Masadeh R, Abdel-Aziz Sharieh A, Mahafzah BA, Masadeh R, Sharieh A (2019) Humpback whale optimization algorithm based on vocal behavior for task scheduling in cloud computing. Int J Adv Sci Technol 13:121–140
Mahafzah BA, Jaradat BA (2008) The load balancing problem in OTIS-hypercube interconnection networks. J Supercomput 46:276–297
Mahafzah BA, Jaradat BA (2010) The hybrid dynamic parallel scheduling algorithm for load balancing on chained-cubic tree interconnection networks. J Supercomput 52(3):224–252
Pai S, Thazhuthaveetil MJ, Govindarajan R (2013) Improving GPGPU concurrency with elastic kernels. ACM SIGPLAN Not 48(4):407–418
Zhong J, He B (2014) Kernelet: high-throughput GPU kernel executions with dynamic slicing and scheduling, IEEE Trans Parallel Distrib Syst 25(6):1522–1532
Jiao Q, Lu M, Huynh HP, Mitra T (2015) Improving GPGPU energy-efficiency through concurrent kernel execution and DVFS. In: Proceedings of the IEEE/ACM international symposium on code generation and optimization, CGO 2015, pp 1–11
Liang Y, Huynh HP, Rupnow K, Goh RSM, Chen D (2015) Efficient GPU spatial-temporal multitasking. IEEE Trans Parallel Distrib Syst 26(3):748–760
Konstantinidis E, Cotronis Y (2015) A practical performance model for compute and memory bound GPU kernels. In: Proceedings of the 23rd Euromicro International Conference in Parallel, Distributed Network-Based Process. PDP 2015, pp 651–658
Konstantinidis E, Cotronis Y (2017) A quantitative roofline model for GPU kernel performance estimation using micro-benchmarks and hardware metric profiling. J Parallel Distrib Comput 107:37–56
Williams S, Waterman A, Patterson D (2009) Roofline: an insightful visual performance model for multicore architectures. Commun ACM 52(4):65–76
Hong S, Kim H (2009) An analytical model for a gpu architecture with memory-level and thread-level parallelism awareness. In: Proceedings - international symposium on computer architecture, pp 152–163
Hong S, Kim H (2010) An integrated GPU power and performance model. In: Proceedings - international symposium on computer architecture, pp 280–289
Baghsorkhi SS, Delahaye M, Patel SJ, Gropp WD, Hwu WMW (2010) An adaptive performance modeling tool for GPU architectures. ACM SIGPLAN Not 45(5):105–114
Jooya A, Dimopoulos N, Baniasadi A (2015) GPU design space exploration: NN-based models. In: IEEE Pacific RIM Conference on Communications Computers and Signal Processing. Proceedings, pp 159–162
Jooya A, Dimopoulos N, Baniasadi A (2019) Multiobjective GPU design space exploration optimization. Microprocess Microsyst 69:198–210
Jooya A, Dimopoulos N, Baniasadi A (2017) Optimum power-performance GPU configuration prediction based on code attributes. In: Proceedings of the International Conference on High Performance Computing and Simulation, HPCS 2017, pp 418–425
Wang Y, Ranganathan N (2011) An instruction-level energy estimation and optimization methodology for GPU. In: Proceedings of the 11th IEEE International Conference on Computer Informational Technology. CIT 2011, pp 621–628
Marsland S (2014) Machine learning: an algorithmic perspective. Chapman and Hall/CRC
Puntanen S (2010) Linear regression analysis: theory and computing by Xin Yan, Xiao Gang Su. Int Stat Rev 78(1):144
Nv C, Clara S (2014) undefined CA, undefined USA, and undefined 2014. In: CUDA Profiler Users Guide (Version 6.5). NVIDIA
CUDA-Z (2019) (Online). http://cuda-z.sourceforge.net/. Accessed 18 Sept 2019
GitHub-krrishnarraj/clpeak (2019) A tool which profiles OpenCL devices to find their peak capacities (online). https://github.com/krrishnarraj/clpeak. Accessed 18 Sept 2019
CUDA Toolkit 8.0-Feb 2017 | NVIDIA Developer (Online). https://developer.nvidia.com/cuda-80-ga2-download-archive. Accessed 18 Sept 2019
Shekofteh SK, Noori H, Naghibzadeh M, Yazdi HS, Fröning H (2019) Metric selection for GPU kernel classification. ACM Trans Archit Code Optim 15(4):1–27
IBM (2019) SPSS Software | IBM. IBM website (Online). https://www.ibm.com/analytics/spss-statistics-software. Accessed 18 Sept 2019
NVIDIA (2012) CUDA C best practices guide (v5.0)
Shekofteh S, Noori H, M. N.-… on P. and, and undefined 2019. cCUDA: Effective Co-Scheduling of Concurrent Kernels on GPUs. ieeexplore.ieee.org
“CUDA SAMPLES Reference Manual,” 2019
Che S et al (2009) Rodinia: a benchmark suite for heterogeneous computing. In: Proceedings of the 2009 IEEE International Symposium Workload Characterization. IISWC 2009, pp 44–54
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Beheshti Roui, M., Shekofteh, S.K., Noori, H. et al. Efficient scheduling of streams on GPGPUs. J Supercomput 76, 9270–9302 (2020). https://doi.org/10.1007/s11227-020-03209-x
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-020-03209-x