Skip to main content
Log in

A memory-driven scheduling scheme and optimization for concurrent execution in GPU

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

Concurrent execution of GPU tasks is available in modern GPU device. However, limited device memory is an obvious bottleneck in executing many GPU tasks. And the task priority and system performance are often ignored. To address these, a real-time GPU scheduling scheme is proposed in this paper. A reservation algorithm based on device memory(RBDM) is adopted to provide more opportunity for the High-priority task in the scheme. high priority first wake (HPFW) and small memory HPFW (SM-HPFW) are employed in the scheduling of waiting tasks to improve the priority response time and system performance. A CPU-based monitor is developed to check the GPU task execution. Experiments show the RBDM can work effectively. Compared with FIFO, HPFW can decrease overall priority response time significantly. Overall task completion time can be reduced by 20 % using the SM-HPFW while the distribution of device memory requirement of GPU tasks is even.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

References

  1. Adhianto, L., Banerjee, S., Fagan, M., Krentel, M., Marin, G., Mellor-Crummey, J., Tallent, N.R.: StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. Concurr. Comput. Pract. Exp. 22, 685–701 (2010). doi:10.1002/cpe

    Google Scholar 

  2. Chong, E.K.P.: Performance for imprecise evaluation computer of scheduling systems algorithms. J. Syst. Softw. 15, 261–277 (1991)

    Article  Google Scholar 

  3. Eswaran, A., Rajkumar, R.: Energy-aware memory firewalling for QoS-sensitive application. Proc. Euromicro Conf. Real-Time Syst. 2005, 11–20 (2005). doi:10.1109/ECRTS.2005.14

    Article  Google Scholar 

  4. Fang, W., Lau, K.K., Lu, M., Xiao, X., Lam, C.K., Yang, P.Y., He, B., Luo, Q., Sander, P.V., Yang, K.: Parallel data mining on graphics processors. Ph.D. thesis, Hong Kong University (2008). http://gpuminer.googlecode.com/files/gpuminer.pdf

  5. Hardy, D., Puaut, I.: Predictable code and data paging for real time systems. In: Proceedings—Euromicro Conference on Real-Time Systems, pp. 266–275 (2008). doi:10.1109/ECRTS.2008.16

  6. Hung, C.L., Hua, G.J.: Local alignment tool based on Hadoop framework and GPU architecture. BioMed Res. Int. 2014, 1–7 (2014). doi:10.1155/2014/541490

    Google Scholar 

  7. Jog, A., Bolotin, E., Guz, Z., Parker, M., Keckler, S.W., Kandermir, M.T., Das, C.R.: Application-aware memory system for fair and efficient execution of concurrent GPGPU applications. In: Workshop on General Purpose Processing Using GPUs(GPGPU-7), pp. 1–8 (2014). doi:10.1145/2576779.2576780

  8. Joo, W., Shin, D.: Resource-constrained spatial multi-tasking for embedded GPU. In: 2014 IEEE International Conference on Consumer Electronics (ICCE), pp. 2010–2011 (2014)

  9. Kato, S., Lakshmanan, K., Rajkumar, R.R., Ishikawa, Y.: TimeGraph: GPU scheduling for real-time multi-tasking environments. In: 2011 USENIX Annual Technical Conference (USENIX ATC11), p. 17 (2011)

  10. Kim, H., Rajkumar, R.: Shared-page management for improving the temporal isolation of memory reservations in resource kernels. In: Proceedings—18th IEEE International Conference on Embedded and Real-Time Computing Systems and Applications, RTCSA 2012—2nd Workshop on Cyber-Physical Systems, Networks, and Applications, CPSNA, pp. 310–319 (2012). doi:10.1109/RTCSA.2012.50

  11. Kim, H., Rajkumar, R.: Memory reservation and shared page management for real-time systems. J. Syst. Archit. 60(2), 165–178 (2014). doi:10.1016/j.sysarc.2013.07.002

    Article  Google Scholar 

  12. Lindholm, E.N.: Nvidia tesla:aunified graphics and computing architecture. Micro IEEE 28(0272–1732), 39–55 (2008)

    Article  Google Scholar 

  13. Mokhtari, R., Stumm, M.: BigKernel—high performance CPU-GPU communication pipelining for big data-style applications. In: Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium, pp. 819–828 (2014). doi:10.1109/IPDPS.2014.89

  14. Nvidia: NVIDIA’s Next Generation CUDA Compute Architecture:Kepler GK110. http://www.nvidia.com/content/PDF/kepler/NVIDIA-kepler-GK110-Architecture-Whitepaper.pdf

  15. Nvidia: Whitepaper NVIDIAs Next Generation CUDA Compute Architecture:Fermi (2009). doi:10.1016/j.immuni.2005.11.006. http://www.nvidia.com

  16. Nvidia: Cuda c programming guide (2013). http://docs.nvidia.com/cuda/cuda-c-programming-guide

  17. O’Neil, M.a., Burtscher, M.: Floating-point data compression at 75 Gb/s on a GPU. In: Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units - GPGPU-4, pp. 1–7 (2011). doi:10.1145/1964179.1964189. http://portal.acm.org/citation.cfm?doid=1964179.1964189

  18. Rixner, S., Dally, W.J., Kapasi, U.J., Mattson, P., Owens, J.D.: Memory access scheduling. In: Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201), vol. 27, pp. 1–11 (2000). :10.1145/342001.339668

  19. Stuart, J.a., Owens, J.D.: Multi-GPU MapReduce on GPU clusters. In: Proceedings—25th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2011, pp. 1068–1079 (2011). doi:10.1109/IPDPS.2011.102

  20. Sun, X.H., Wang, D.: Concurrent average memory access time. IEEE Comput. 47(5), 74–80 (2014)

    Article  Google Scholar 

  21. Volkov, V., Demmel, J., Berkeley, U.C.: Benchmarking g GPUs to Tune Dense Linear Algebra. In: Proceedings of the 2008 ACM/IEEE Conference on Superconducting (SC ’08), pp. 1–11 (2008)

  22. Yazdanpanah, H.: Evaluation performance of task scheduling algorithms in heterogeneous environments. Int. J. Comput. Appl. 138(8), 1–9 (2016)

    Google Scholar 

Download references

Acknowledgments

This research is supported by NSFC and Shanghai Municipal Education Commission. I would like to extend my sincere gatitude to my friends at Illinois Institute of Technology (IIT), who have provided selfless help for my work and life abroad during my visiting scholar career. I gratefully acknowledge IIT who has offered me a cosy work environment and my colleagues of HPCC, Shanghai university.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bao-yu Xu.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xu, By., Zhang, W., Sun, Xh. et al. A memory-driven scheduling scheme and optimization for concurrent execution in GPU. Cluster Comput 19, 2241–2250 (2016). https://doi.org/10.1007/s10586-016-0656-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10586-016-0656-8

Keywords

Navigation