A real-time and reliable dynamic migration model for concurrent taskflow in a GPU cluster

Fang, Yuling; Chen, Qingkui

doi:10.1007/s10586-018-2866-8

A real-time and reliable dynamic migration model for concurrent taskflow in a GPU cluster

Published: 20 November 2018

Volume 22, pages 585–599, (2019)
Cite this article

Cluster Computing Aims and scope Submit manuscript

Yuling Fang¹ &
Qingkui Chen¹

306 Accesses
4 Citations
Explore all metrics

Abstract

High performance GPU clusters are widely used for massive amount of concurrent dataflow processing, and have higher requirements for real-time, reliability and flexibility. However, the higher computational intensiveness and resources utilization lead to excessively high system temperature and power consumption, and even result in instantaneous failures. In this paper, we present a real-time and efficient dynamic taskflow migration approach (DTMA) based on a computing cluster. Firstly, we propose our basic theoretical models. Among them, the cluster communication model elaborates on all the communication paths and calculates the communication overhead of different migration modes. Secondly, on the basis of theoretical models and multiple instances analysis, our taskflow migration rules are summarized, and the rules help to balance cluster resources utilization and improve the overall performance of GPUs. Thirdly, the DTMA adjusts the cluster task allocation by utilizing performance and power consumption aware migration approach. This is done to reduce single node power consumption and enhance system reliability by shifting the current GPU load to other available GPU (GPUs). Moreover, the DTMA uses a circular queue to store resources information of available GPUs for better task scheduling. We evaluate the effect of DTMA through analyzing power consumption, temperature, fan speed and migration cost with different experiments. The experiment results demonstrate that DTMA is able to improve the performance and reliability of our cluster computing system, and reduce instantaneous failures.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

K-Scheduler: dynamic intra-SM multitasking management with execution profiles on GPUs

Article 12 October 2021

An study of the effect of process malleability in the energy efficiency on GPU-based clusters

Article 21 October 2019

Task Scheduling of GPU Cluster for Large-Scale Data Process with Temperature Constraint

References

Marjani, M., Nasaruddin, F., Gani, A., Karim, A., Hashem, I.A.T., Siddiqa, A., Yaqoob, I.: Big iot data analytics: architecture, opportunities, and open research challenges. Big IoT Data Anal. Archit. Oppor. Open Res. Chall. 5(99), 5247–5261 (2017)
Google Scholar
Mervis, J.: Agencies rally to tackle big data. Science 336(6077), 22 (2012)
Article Google Scholar
Lv, Z., Song, H., Basanta-Val, P., Steed, A., Jo, M.: Next-generation big data analytics: State of the art, challenges, and future research topics. IEEE Trans. Ind. Inf. 13(4), 1891–1899 (2017)
Article Google Scholar
Zhang, Y., Qiu, M., Tsai, C.W., Hassan, M.M., Alamri, A.: Health-CPS: Healthcare cyber-physical system assisted by cloud and big data. IEEE Syst. J. 11(1), 88–95 (2017)
Article Google Scholar
Venkatesh, G., Arunesh K.: Map Reduce for big data processing based on traffic aware partition and aggregation. Clust. Comput. (2018). https://doi.org/10.1007/s10586-018-1799-6
Mmel, R.: Google’s mapreduce programming model revisited. Sci. Comput. Program. 70(1), 1–30 (2008)
Article MathSciNet MATH Google Scholar
Jiang, H., Chen, Y., Qiao, Z., Weng, T.-H., Li, K.-C.: Scaling up mapreduce-based big data processing on multi-gpu systems. Clust. Comput. 18(1), 369–383 (2015)
Article Google Scholar
Ramírez-Gallego, S., Garca, S., Beítez, J.M., Herrera, F.: A distributed evolutionary multivariate discretizer for big data processing on apache spark. Swarm Evol. Comput. 38, 240–250 (2017)
Article Google Scholar
Alsheikh, M.A., Niyato, D., Lin, S., Tan, H.P., Han, Z.: Mobile big data analytics using deep learning and apache spark. IEEE Netw. 30(3), 22–29 (2016)
Article Google Scholar
Huang, W., Song, G., Hong, H., Xie, K.: Deep architecture for traffic flow prediction: Deep belief networks with multitask learning. IEEE Trans. Intell. Transp. Syst. 15(5), 2191–2201 (2014)
Article Google Scholar
Li, P., Chen, Z., Yang, L.T., Zhang, Q., Deen, M.J.: Deep convolutional computation model for feature learning on big data in Internet of Things. IEEE Trans. Ind. Inf. 14(2), 790–798 (2017)
Article Google Scholar
Chen, C.F.R., Lee, G.G.C., Xia, Y., Lin, W.S., Suzumura, T., Lin, C.Y.: Efficient multi-training framework of image deep learning on GPU cluster. In: IEEE International Symposium on Multimedia, pp. 489–494 (2016)
TOP500: Tp500list. https://www.top500.org/lists/2017/11/slides/
Li, K., Tang, X., Li, K.: Energy-efficient stochastic task scheduling on heterogeneous computing systems. IEEE Trans. Parallel Distrib. Syst. 25(11), 2867–2876 (2014)
Article Google Scholar
Kreutzer, M., Thies, J., Pieper, A., Alvermann, A., Galgon, M., Rhrig-Zllner, M., Shahzad, F., Basermann, A., Bishop, A.R., Fehske, H.: Performance Engineering and Energy Efficiency of Building Blocks for Large. Sparse Eigenvalue Computations on Heterogeneous Supercomputers. Springer, Cham (2016)
Google Scholar
Liu, W., Du, Z., Xiao, Y., Bader, D.A., Chen, X.: A waterfall model to achieve energy efficient tasks mapping for large scale GPU clusters. In: International Heterogeneity in Computing Workshop. Anchorage, pp. 82–92 (2011)
Hong, S., Kim, H.: An integrated GPU power and performance model. In: International Symposium on Computer Architecture, pp. 280–289 (2010)
Alonso, P., Dolz, M.F., Igual, F.D., Mayo, R., Quintanaor, E.S.: Reducing energy consumption of dense linear algebra operations on hybrid CPU–GPU platforms. In: IEEE International Symposium on Parallel and Distributed Processing with Applications, pp. 56–62 (2012)
Padoin, E.L., Pilla, L.L., Boito, F.Z., Kassick, R.V., Velho, P., Navaux, P.O.: Evaluating application performance and energy consumption on hybrid CPU+GPU architecture. Clust. Comput. 16(3), 511–525 (2013)
Article Google Scholar
Hong, S., Kim, H.: An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness. ACM SIGARCH Comput. Architect. News 37(3), 152–163 (2009)
Article Google Scholar
Ge, R., Feng, X., Song, S., Chang, H.C., Li, D., Cameron, K.W.: Powerpack: energy profiling and analysis of high-performance systems and applications. IEEE Trans. Parallel Distrib. Syst. 21(5), 658–671 (2010)
Article Google Scholar
Defour, D., Petit, E.: GPUburn: a system to test and mitigate GPU hardware failures. In: International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation, pp. 263–270 (2013)
Rech, P., Aguiar, C., Ferreira, R., Silvestri, M.: Neutron-induced soft errors in graphic processing units. In: IEEE Radiation Effects Data Workshop, pp. 1–6 (2012)
Guilhemsang, J., Hron, O., Ventroux, N., Goncalves, O., Giulieri, A.: Impact of the application activity on intermittent faults in embedded systems. In: VLSI Test Symposium, pp. 191–196 (2011)
Sun, D., Zhang, G., Yang, S., Zheng, W., Khan, S.U., Li, K.: Re-stream: real-time and energy-efficient resource scheduling in big data stream computing environments. Inf. Sci. 319, 92–112 (2015)
Article MathSciNet Google Scholar
Lin, S., Xie, Z.: A Jacobi\(\_\)PCG solver for sparse linear systems on multi-GPU cluster. J. Supercomput. 73(1), 1–22 (2016)
Google Scholar
Fang, Y., Chen, Q., Xiong, N.N., Zhao, D., Wang, J.: RGCA: a reliable gpu cluster architecture for large-scale internet of things computing based on effective performance-energy optimization. Sensors 17(8), 1799 (2017)
Article Google Scholar
Cook, S.: CUDA Programming: A Developer’s Guide to Parallel Computing with GPUs, vol. 44. Elsevier, Amsterdam (2012)
Google Scholar
Wikipedia: PCI express. https://www.top500.org/lists/2017/11/slides/
Laosooksathit, S., Nassar, R., Leangsuksun, C., Paun, M.: Reliability-aware performance model for optimal gpu-enabled cluster environment. J. Supercomput. 68(3), 1630–1651 (2014)
Article Google Scholar
Zhang, L., Li, K., Li, C., Li, K.: Bi-objective workflow scheduling of the energy consumption and reliability in heterogeneous computing systems. Inf. Sci. 379, 241–256 (2016)
Article Google Scholar
Thanakornworakij, T., Nassar, R., Leangsuksun, C.B., Paun, M.: Reliability model of a system of k nodes with simultaneous failures for high-performance computing applications. Int. J. High Perform. Comput. Appl. 27(4), 474–482 (2013)
Article Google Scholar
NVIDIA GeForce GTX680: The Fastest, Most Efficient GPU Ever Built. NVIDIA, Santa Clara (2012)
NVIDIA GeForce GTX980: Featuring Maxwell, The Most Advanced GPU Ever Made. NVIDIA Corporation, White Paper (2014)
Liu, B., Chen, Q.: Implementation and optimization of intra prediction in H264 video parallel decoder on CUDA. In: IEEE Fifth International Conference on Advanced Computational Intelligence, pp. 119–122 (2012)
Vacavant, A., Chateau, T., Wilhelm, A.: A benchmark dataset for outdoor foreground/background extraction. In: International Conference on Computer Vision, pp. 291–300 (2012)
Lecun, Y.: LeNet-5, Convolutional Neural Networks
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet Classification with Deep Convolutional Neural Networks, pp. 1097–1105 (2012)
Yuan, Z.W., Zhang, J.: Feature Extraction and Image Retrieval Based on Alexnet, p. 100330E(2016)

Download references

Acknowledgements

The authors gratefully acknowledge the support of the National Natural Science Foundation of China (61572325 and 60970012); Ph.D. Programs Foundation of Ministry of Education of China (Grant No. 20113120110008); Shanghai Key Programs of Science and Technology (14511107902 and 16DZ1203603); Shanghai Leading Academic Discipline Project (No. XTKX2012); Shanghai Engineering Research Center Project (GCZX14014 and C14001).

Author information

Authors and Affiliations

University of Shanghai for Science and Technology, Shanghai, 200093, China
Yuling Fang & Qingkui Chen

Authors

Yuling Fang
View author publications
You can also search for this author in PubMed Google Scholar
Qingkui Chen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Qingkui Chen.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Fang, Y., Chen, Q. A real-time and reliable dynamic migration model for concurrent taskflow in a GPU cluster. Cluster Comput 22, 585–599 (2019). https://doi.org/10.1007/s10586-018-2866-8

Download citation

Received: 25 April 2018
Revised: 02 July 2018
Accepted: 15 November 2018
Published: 20 November 2018
Issue Date: 15 June 2019
DOI: https://doi.org/10.1007/s10586-018-2866-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A real-time and reliable dynamic migration model for concurrent taskflow in a GPU cluster

Abstract

Access this article

Similar content being viewed by others

K-Scheduler: dynamic intra-SM multitasking management with execution profiles on GPUs

An study of the effect of process malleability in the energy efficiency on GPU-based clusters

Task Scheduling of GPU Cluster for Large-Scale Data Process with Temperature Constraint

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A real-time and reliable dynamic migration model for concurrent taskflow in a GPU cluster

Abstract

Access this article

Similar content being viewed by others

K-Scheduler: dynamic intra-SM multitasking management with execution profiles on GPUs

An study of the effect of process malleability in the energy efficiency on GPU-based clusters

Task Scheduling of GPU Cluster for Large-Scale Data Process with Temperature Constraint

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation