Abstract
High-Performance Computing systems collect vast amounts of operational data with the employment of monitoring frameworks, often augmented with additional information from schedulers and runtime systems. This amount of data can be used and turned into a benefit for operational requirements, rather than being a data pool for post-mortem analysis. This work focuses on deriving a model with supervised learning which enables optimal selection of CPU frequency during the execution of a job, with the objective of minimizing the energy consumption of a HPC system. Our model is trained utilizing sensor data and performance metrics collected with two distinct open-source frameworks for monitoring and runtime optimization. Our results show good prediction of CPU power draw and number of instructions retired on realistic dynamic runtime settings within a relatively low error margin.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Agelastos, A., Allan, B., Brandt, J., Cassella, P., et al.: The lightweight distributed metric service: a scalable infrastructure for continuous monitoring of large scale computing systems and applications. In: Proceedings of SC 2014, pp. 154–165 (2014)
Auweter, A., Bode, A., Brehm, M., Brochard, L., Hammer, N., et al.: A case study of energy aware scheduling on SuperMUC. In: Kunkel, J.M., Ludwig, T., Meuer, H.W. (eds.) ISC 2014. LNCS, vol. 8488, pp. 394–409. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-07518-1_25
Box, G.E.P., Jenkins, G.M., Reinsel, G.C.: Time Series Analysis, Forecasting and Control, 4th edn, p. Chapter 3.2. Wiley, Hoboken (2008)
Eastep, J., Sylvester, S., Cantalupo, C., Geltz, B., Ardanaz, F., et al.: Global extensible open power manager: a vehicle for HPC community collaboration on co-designed energy management solutions. In: Kunkel, J.M., Yokota, R., Balaji, P., Keyes, D. (eds.) ISC 2017. LNCS, vol. 10266, pp. 394–412. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-58667-0_21
Jones, N.: How to stop data centres from gobbling up the worlds electricity. Nature 561, 163–166 (2018)
Koomey, J.G.: Worldwide electricity used in data centers. Environ. Res. Lett. 3(3), 034008 (2008)
Koomey, J.G.: Growth in data center electricity use 2005 to 2010. Analytics Press, New york (2011). http://www.analyticspress.com/datacenters.html
Kumar, A.S., Mazumdar, S.: Forecasting HPC workload using ARMA models and SSA. In: 2016 Proceedings of ICIT, pp. 294–297 (2016)
Kunkel, J., Dolz, M.F.: Understanding hardware and software metrics with respect to power consumption. Sustain. Comput. Inf. Syst. 17, 43–54 (2018)
Lin, X., Wang, Y., Pedram, M.: A reinforcement learning-based power management framework for green computing data centers. In: 2016 Proceedings of IC2E, pp. 135–138. IEEE (2016)
Netti, A., Mueller, M., Auweter, A., Guillen, C., et al.: From facility to application sensor data: modular, continuous and holistic monitoring with DCDB. In: 2019 Proceedings of SC. ACM (2019)
Triki, M., Wang, Y., Ammari, A., Pedram, M.: Hierarchical power management of a system with autonomously power-managed components using reinforcement learning. Integr. VLSI J. 48(C), 10–20 (2015)
Tuncer, O., Ates, E., Zhang, Y., Turk, A., et al.: Online diagnosis of performance variation in HPC systems using machine learning. IEEE Trans. Para. Distrib. Syst. 30(04), 883–896 (2018)
Wang, B., Terboven, C., Mller, M.S.: Performance prediction under power capping. In: 2018 Proceedings of HPCS, pp. 308–313. IEEE (2018)
Wang, Y., Xie, Q., Ammari, A., Pedram, M.: Deriving a near-optimal power management policy using model-free reinforcement learning and bayesian classification. In: 2011 Proceedings of DAC, pp. 41–46 (2011)
Wang, Z., Tian, Z., Xu, J., Maeda, R.K.V., Li, H., et al.: Modular reinforcement learning for self-adaptive energy efficiency optimization in multicore system. In: 2017 Proceedings of ASP-DAC, pp. 684–689. IEEE (2017)
Weaver, V.M.: Linux perf\_event features and overhead. In: 2013 Proceedings of the FastPath Workshop, vol. 13 (2013)
Wilde, T., Auweter, A., Shoukourian, H.: The 4 pillar framework for energy efficient HPC data centers. Comput. Sci. - R&D 29(3–4), 241–251 (2014)
Yang, S., Shafik, R.A., Merrett, G.V., Stott, E., Levine, J.M., et al.: Adaptive energy minimization of embedded heterogeneous systems using regression-based learning. In: 2017 Proceedings of the PATMOS Workshop (2015)
Acknowledgements.
This work originated from the TUM Data Innovation Lab, and was further supported by Intel Deutschland GmbH and LRZ.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Ozer, G. et al. (2020). Towards a Predictive Energy Model for HPC Runtime Systems Using Supervised Learning. In: Schwardmann, U., et al. Euro-Par 2019: Parallel Processing Workshops. Euro-Par 2019. Lecture Notes in Computer Science(), vol 11997. Springer, Cham. https://doi.org/10.1007/978-3-030-48340-1_48
Download citation
DOI: https://doi.org/10.1007/978-3-030-48340-1_48
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-48339-5
Online ISBN: 978-3-030-48340-1
eBook Packages: Computer ScienceComputer Science (R0)