Abstract
Power controlling on reliability-aware GPU clusters with dynamically variable voltage and speed is investigated as combinatorial optimization problem, namely the problem of minimizing task execution time with energy consumption constraint and the problem of minimizing energy consumption with system reliability constraint. The two problems have applied in general multiprocessor computing and real-time multiprocessing systems where energy consumption and system reliability both are important. These problems which emphasize the trade-off among performance, power and reliability have not been well studied before. In this research, a novel power control model is built based on Model Prediction Control theory. Maximum Entropy Method is used to determine partial ordering relation of control variable and to identify the quality of solutions. Our controller can cap the redundant energy consumption by dynamically transforming energy states of the nodes in GPU cluster. We compare our controller with the control scheme, which does not consider the system reliability. The experimental results demonstrate that the proposed controller is more reliable and valuable.
Similar content being viewed by others
References
Repantis T, Gu X, Kalogeraki V (2010) Qos-aware shared component composition for distributed stream processing system. IEEE Trans Parallel Distrib Syst 20(7):968–982
Horvath T, Abdelzaher T, Shadron K, Liu X (2007) Dynamic voltage scaling in multitier web servers with end-to-end delay control. IEEE Trans Comput 56(4):444–458
Wang G, Ren X (2012) Power-efficient work distribution method for CPU-GPU heterogeneous system. In: Proceedings of international symposium on parallel and distributed processing with applications
Maruyama N, Nukada A, Mastsuoka S (2009) Software-based ECC for GPUs. In: Symposium on application accelerators in high performance computing
Dimitrov M, Mantor M, Zhou H (2009) Understanding software approaches for GPGPU reliability. In: Proceedings of 2nd workshop on general purpose processing on graphics processing units. ACM, New York
Xin-Hai X, Xue-Jun Y, Yu-Fei L, Yi-Song L, Tao T (2011) Fault-tolerance method for CPU-GPU heterogeneous system. J Softw 22(10):2538–2552
Sheaffer J, Luebke D, Skadron K (2007) A hardware redundancy and recovery mechanism for reliable scientific computation on graphics processors. In: Proceedings of 2007 graphics hardware
Haque IS, Pande VS (2009) Hard data on soft errors: a large-scale assessment of real-world error rates in GPGPU. In: Proceedings of the 10th IEEE/ACM international conference on cluster, cloud and grid computing
Xu X, Lin Y, Tang T et al (2010) HiAL-Ckpt: a hierarchical application-level checkpointing for CPU-GPU hybrid system. In: Proceedings of the 5th international conference on computer science & education, Heifei, China
Zhao B, Aydin H, Zhu D (2012) Energy management under general task-level reliability constraints. In: Proceedings of 2012 IEEE 18th real-time and embedded technology and applications symposium
Zhu D, Aydin H (2009) Reliability-aware energy management for periodic real-time tasks. IEEE Trans Comput 58(10):1382–1397
Wang X, Chen M, Fu X (2007) MIMI power control for high-density servers in an enclosure. IEEE Trans Parallel Distrib Syst 21(10):1412–1426
Wang H, Chen Q (2012) Power estimating model and analysis of general programming on GPU. J Softw 7(5):1164–1170
Sunpyo H, Hyesoon K (2010) An integrated GPU power and performance model. In: Proceedings of the 37th annual international symposium on computer architecture, Saint-Malo, France, pp 280–289
Collange S, Defour D, Tisserand A (2009) Power consumption of GPUs from a software perspective. In: Proceedings of the 9th international conference on computational science, Baton Rouge, LA, pp 914–923
Bini E, Buttazzo G, Lipari G (2005) Speed modulation in energy-aware real-time systems. In: Proc. of the 17th euromicro conference on real-time systems
Seth K, Anantaraman A, Mueller F, Fast ER (2003) Frequency-aware static timing analysis. In: Proc. of 24th IEEE real-time system symposium
Wang X, Wang Y (2011) Coordinating power control and performance management for virtualized server clusters. IEEE Trans Parallel Distrib Syst 22(2):245–259
Zhao B, Aydin H, Zhu D (2010) On maximizing reliability of real-time embedded applications under hard energy constraint. IEEE Trans Ind Inform 6(3):316–328
Zhao B, Aydin H, Zhu D (2012) Energy management under general task-level reliability constraints. In: Proceedings of 2012 IEEE 18th real-time and embedded technology and applications symposium
Srinivasan S, Nk J (2006) Safety and reliability driven task allocation in distributed systems. IEEE Trans Comput 55(7):864–879
Che S, Boyer M, Meng J, Tarjan D, Sheaffer JW (2009) Rodinia: a benchmark suite for heterogeneous computing. In: IEEE international symposium on workload charaterization
Parboil benchmark suite. http://impact.crhc.illinois.edu/parboil.php
Zhang Q, Zhou A, Jin Y (2008) RM-MEDA: a regularity model-based multi-objective estimation of distribution algorithm. IEEE Trans Evol Comput 12(1):41–63
Yari G, Chaji AR (2012) Maximum Bayesian entropy method for determining ordered weighted averaging operator weights. Comput Ind Eng 63:338–342
Farina M, Deb K, Amato P (2004) Dynamic multiobjective optimization problems: test cases, approximations, and applications. IEEE Trans Evol Comput 8(5):425–442
Bemporad A, Morari M (1999) Robust model predictive control: a survey. Lect Notes Control Inf Sci 245:207–226
Moorthy AK, Seshadrinathan K et al (2010) Wireless video quality assessment: a study of subjective scores and objective algorithms. IEEE Trans Circuits Syst Video Technol 20(4):587–599
Qu Q, Pei Y, Modestino JW (2006) An adaptive motion-based unequal error protection approach for real-time video transport over wireless IP networks. IEEE Trans Multimed 8(5):1033–1044
Acknowledgements
The authors thankfully acknowledge the support of National Nature Science Foundation of China (No. 60970012), the Innovation Program of Shanghai Science and Technology Commission (Nos. 09511501000, 09220502800), and Shanghai leading academic discipline project (XTKX2012).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Wang, H., Chen, Q. Optimization power consumption model of reliability-aware GPU clusters. J Supercomput 67, 153–174 (2014). https://doi.org/10.1007/s11227-013-0993-9
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-013-0993-9