Skip to main content
Log in

Efficient scheduling of streams on GPGPUs

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Graphics processing units (GPUs) are widely used for scientific and engineering applications with high level of parallelism. The computing power of GPUs is improving through enhancing their architectural facilities. NVIDIA’s compute unified device architecture (CUDA) stream using hyper-Q on NVIDIA graphic cards is a reputable capability for performance improvement. Without any synchronization and based on architectural capabilities, NVIDIA’s CUDA stream allows some processes to run simultaneously. Experimental results show that how to stream a number of programs affect the execution time. Therefore, the stream set with the highest amount of performance improvement is the efficient stream set. This article proposes a framework to predict the efficient stream set on two streams without trying all combinations, which would be a very time-consuming process. The proposed framework employs a performance model and a scheduler. The performance model estimates the duration of simultaneous portions of streamed programs and the scheduler uses the estimation of the model to predict the efficient stream set. The proposed prediction method relies on non-stream features of programs. The results show that even with 33% error of performance model in average, the scheduler predicts the optimized sets with 100% precision.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

Notes

  1. A set of threads that execute the same instruction on different data elements.

References

  1. NVIDIA (2015) Cuda C programming guide

  2. Munshi A (2016) The OpenCL specification. In: 2009 IEEE Hot Chips 21 Symposium. HCS 2009, pp 11–314

  3. Hwu W, Edition DKS and undefined (2009). Programming massively parallel processors. In: Program massively parallel process

  4. Rennich S (2012) Cuda c/c++ streams and concurrency. In: NVIDIA (online)

  5. M. H.-. nvidia. com/how-optimize-data-transfers-cuda-cc. How to optimize data transfers in CUDA C/C++

  6. Mahafzah BA (2014) Performance evaluation of parallel multithreaded A heuristic search algorithm. J Inf Sci 40(3):363–375

    Article  Google Scholar 

  7. Mahafzah BA (2011) Parallel multithreaded IDA* heuristic search: algorithm design and performance evaluation. Int J Parallel Emerg Distrib Syst 26(1):61–82

    Article  MathSciNet  Google Scholar 

  8. Najadat H, Jaffal YM, Mahafzah BA, Al-Omari SS (2014) A new fine-grained multithreaded game engine approach. Int J Model Simul 34(1):15–22

    Google Scholar 

  9. Mahafzah BA (2013) Performance assessment of multithreaded quicksort algorithm on simultaneous multithreaded architecture. J Supercomput 66(1):339–363

    Article  Google Scholar 

  10. Mohammad Taisir Masadeh R, Abdel-Aziz Sharieh A, Mahafzah BA, Masadeh R, Sharieh A (2019) Humpback whale optimization algorithm based on vocal behavior for task scheduling in cloud computing. Int J Adv Sci Technol 13:121–140

    Google Scholar 

  11. Mahafzah BA, Jaradat BA (2008) The load balancing problem in OTIS-hypercube interconnection networks. J Supercomput 46:276–297

    Article  Google Scholar 

  12. Mahafzah BA, Jaradat BA (2010) The hybrid dynamic parallel scheduling algorithm for load balancing on chained-cubic tree interconnection networks. J Supercomput 52(3):224–252

    Article  Google Scholar 

  13. Pai S, Thazhuthaveetil MJ, Govindarajan R (2013) Improving GPGPU concurrency with elastic kernels. ACM SIGPLAN Not 48(4):407–418

    Article  Google Scholar 

  14. Zhong J, He B (2014) Kernelet: high-throughput GPU kernel executions with dynamic slicing and scheduling, IEEE Trans Parallel Distrib Syst 25(6):1522–1532

    Article  MathSciNet  Google Scholar 

  15. Jiao Q, Lu M, Huynh HP, Mitra T (2015) Improving GPGPU energy-efficiency through concurrent kernel execution and DVFS. In: Proceedings of the IEEE/ACM international symposium on code generation and optimization, CGO 2015, pp 1–11

  16. Liang Y, Huynh HP, Rupnow K, Goh RSM, Chen D (2015) Efficient GPU spatial-temporal multitasking. IEEE Trans Parallel Distrib Syst 26(3):748–760

    Article  Google Scholar 

  17. Konstantinidis E, Cotronis Y (2015) A practical performance model for compute and memory bound GPU kernels. In: Proceedings of the 23rd Euromicro International Conference in Parallel, Distributed Network-Based Process. PDP 2015, pp 651–658

  18. Konstantinidis E, Cotronis Y (2017) A quantitative roofline model for GPU kernel performance estimation using micro-benchmarks and hardware metric profiling. J Parallel Distrib Comput 107:37–56

    Article  Google Scholar 

  19. Williams S, Waterman A, Patterson D (2009) Roofline: an insightful visual performance model for multicore architectures. Commun ACM 52(4):65–76

    Article  Google Scholar 

  20. Hong S, Kim H (2009) An analytical model for a gpu architecture with memory-level and thread-level parallelism awareness. In: Proceedings - international symposium on computer architecture, pp 152–163

  21. Hong S, Kim H (2010) An integrated GPU power and performance model. In: Proceedings - international symposium on computer architecture, pp 280–289

  22. Baghsorkhi SS, Delahaye M, Patel SJ, Gropp WD, Hwu WMW (2010) An adaptive performance modeling tool for GPU architectures. ACM SIGPLAN Not 45(5):105–114

    Article  Google Scholar 

  23. Jooya A, Dimopoulos N, Baniasadi A (2015) GPU design space exploration: NN-based models. In: IEEE Pacific RIM Conference on Communications Computers and Signal Processing. Proceedings, pp 159–162

  24. Jooya A, Dimopoulos N, Baniasadi A (2019) Multiobjective GPU design space exploration optimization. Microprocess Microsyst 69:198–210

    Article  Google Scholar 

  25. Jooya A, Dimopoulos N, Baniasadi A (2017) Optimum power-performance GPU configuration prediction based on code attributes. In: Proceedings of the International Conference on High Performance Computing and Simulation, HPCS 2017, pp 418–425

  26. Wang Y, Ranganathan N (2011) An instruction-level energy estimation and optimization methodology for GPU. In: Proceedings of the 11th IEEE International Conference on Computer Informational Technology. CIT 2011, pp 621–628

  27. Marsland S (2014) Machine learning: an algorithmic perspective. Chapman and Hall/CRC

  28. Puntanen S (2010) Linear regression analysis: theory and computing by Xin Yan, Xiao Gang Su. Int Stat Rev 78(1):144

    Article  MathSciNet  Google Scholar 

  29. Nv C, Clara S (2014) undefined CA, undefined USA, and undefined 2014. In: CUDA Profiler Users Guide (Version 6.5). NVIDIA

  30. CUDA-Z (2019) (Online). http://cuda-z.sourceforge.net/. Accessed 18 Sept 2019

  31. GitHub-krrishnarraj/clpeak (2019) A tool which profiles OpenCL devices to find their peak capacities (online). https://github.com/krrishnarraj/clpeak. Accessed 18 Sept 2019

  32. CUDA Toolkit 8.0-Feb 2017 | NVIDIA Developer (Online). https://developer.nvidia.com/cuda-80-ga2-download-archive. Accessed 18 Sept 2019

  33. Shekofteh SK, Noori H, Naghibzadeh M, Yazdi HS, Fröning H (2019) Metric selection for GPU kernel classification. ACM Trans Archit Code Optim 15(4):1–27

    Article  Google Scholar 

  34. IBM (2019) SPSS Software | IBM. IBM website (Online). https://www.ibm.com/analytics/spss-statistics-software. Accessed 18 Sept 2019

  35. NVIDIA (2012) CUDA C best practices guide (v5.0)

  36. Shekofteh S, Noori H, M. N.-… on P. and, and undefined 2019. cCUDA: Effective Co-Scheduling of Concurrent Kernels on GPUs. ieeexplore.ieee.org

  37. “CUDA SAMPLES Reference Manual,” 2019

  38. Che S et al (2009) Rodinia: a benchmark suite for heterogeneous computing. In: Proceedings of the 2009 IEEE International Symposium Workload Characterization. IISWC 2009, pp 44–54

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hamid Noori.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

See Tables 17 and 18.

Table 17 List of selected metrics as input data for SPSS
Table 18 The values of trained coefficients in Eq. (4) and Eq. (6)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Beheshti Roui, M., Shekofteh, S.K., Noori, H. et al. Efficient scheduling of streams on GPGPUs. J Supercomput 76, 9270–9302 (2020). https://doi.org/10.1007/s11227-020-03209-x

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-020-03209-x

Keywords

Navigation