Abstract
Today, significant advances in science and technology can not be envisioned without high computing capacity. To solve large problems in science, engineering, and business, data centers provide High-Performance Computing (HPC) systems with aggregation of the computing capacity of thousand of computing nodes with the cost of millions of euros per year [12]. In the datacenter, an anomaly is a suspicious/abnormal pattern in the monitoring signals. The severity of the anomaly can be different, and in extreme conditions, it can yield the outage of the datacenter. By defining complex statistical rules-based anomaly detection methods, this paper investigates the thermal anomaly detection task in one of the most powerful HPC systems in the world, namely Marconi100 hosted at CINECA. The suggested anomaly detection method is successfully validated against real thermal hazard events reported for the studied HPC cluster while in production.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
The 53th, 55th, 58th editions of the top500 list, June 2022. https://www.top500.org/
ACM: Getting started with HPC (2021). https://selects.acm.org/selections/getting-started-with-hpc
Agelastos, A., et al.: Toward rapid understanding of production HPC applications and systems. In: 2015 IEEE International Conference on Cluster Computing, pp. 464–473. IEEE (2015)
Ahad, R., Chan, E., Santos, A.: Toward autonomic cloud: Automatic anomaly detection and resolution. In: 2015 International Conference on Cloud and Autonomic Computing, pp. 200–203. IEEE (2015)
Aksar, B., et al.: E2EWatch: an end-to-end anomaly diagnosis framework for production HPC systems. In: Sousa, L., Roma, N., Tomás, P. (eds.) Euro-Par 2021. LNCS, vol. 12820, pp. 70–85. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-85665-6_5
Arzani, B., Ciraci, S., Loo, B.T., Schuster, A., Outhred, G.: Taking the blame game out of data centers operations with Net Poirot. In: Proceedings of the 2016 ACM SIGCOMM Conference, pp. 440–453. SIGCOMM 2016. Association for Computing Machinery, New York, NY, USA (2016). https://doi.org/10.1145/2934872.2934884
Bartolini, A., et al.: Paving the way toward energy-aware and automated datacentre. In: Proceedings of the 48th International Conference on Parallel Processing: Workshops, pp. 8:1–8:8. ICPP 2019. ACM, New York, NY, USA (2019). https://doi.org/10.1145/3339186.3339215, http://doi.acm.org/10.1145/3339186.3339215
Bhatele, A., Mohror, K., Langer, S.H., Isaacs, K.E.: There goes the neighborhood: performance degradation due to nearby jobs. In: SC 2013: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pp. 1–12. IEEE (2013)
Bhatele, A., et al.: The case of performance variability on dragonfly-based systems. In: 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 896–905. IEEE (2020)
Borghesi, A., Bartolini, A., Lombardi, M., Milano, M., Benini, L.: Anomaly detection using autoencoders in high performance computing systems. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 9428–9433 (2019)
Borghesi, A., Bartolini, A., Lombardi, M., Milano, M., Benini, L.: A semisupervised autoencoder-based approach for anomaly detection in high performance computing systems. Eng. Appl. Artif. Intell. 85, 634–644 (2019)
Borghesi, A., Bartolini, A., Milano, M., Benini, L.: Pricing schemes for energy-efficient HPC systems: design and exploration. Int. J. High Perform. Comput. Appl. 33(4), 716–734 (2019). https://doi.org/10.1177/1094342018814593
Borghesi, A., Molan, M., Milano, M., Bartolini, A.: Anomaly detection and anticipation in high performance computing systems. IEEE Trans. Parallel Distrib. Syst. 33(4), 739–750 (2021)
Brandt, J.M., et al.: Enabling advanced operational analysis through multi-subsystem data integration on trinity. Technical report, Sandia National Lab. (SNL-CA), Livermore, CA (United States); Sandia National ... (2015)
Conficoni, C., Bartolini, A., Tilli, A., Cavazzoni, C., Benini, L.: Integrated energy-aware management of supercomputer hybrid cooling systems. IEEE Trans. Industr. Inf. 12(4), 1299–1311 (2016)
Conficoni, C., Bartolini, A., Tilli, A., Cavazzoni, C., Benini, L.: HPC cooling: a flexible modeling tool for effective design and management. IEEE Trans. Sustain. Comput. 6(3), 441–455 (2018). https://doi.org/10.1109/TSUSC.2018.2809574
Dalmazo, B.L., Vilela, J.P., Simoes, P., Curado, M.: Expedite feature extraction for enhanced cloud anomaly detection. In: NOMS 2016–2016 IEEE/IFIP Network Operations and Management Symposium, pp. 1215–1220. IEEE (2016)
Das, A., Mueller, F., Rountree, B.: Aarohi: making real-time node failure prediction feasible. In: 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 1092–1101. IEEE (2020)
Dorier, M., Antoniu, G., Ross, R., Kimpe, D., Ibrahim, S.: CALCioM: mitigating I/O interference in HPC systems through cross-application coordination. In: 2014 IEEE 28th International Parallel and Distributed Processing Symposium, pp. 155–164. IEEE (2014)
ECP: Exascale computing project. https://www.exascaleproject.org/what-is-exascale/
Intel server board S2600IP and workstation board W2600CR technical product specification, October 2013
Jayathilaka, H., Krintz, C., Wolski, R.: Performance monitoring and root cause analysis for cloud-hosted web applications. In: Proceedings of the 26th International Conference on World Wide Web, pp. 469–478 (2017)
Marathe, A., Zhang, Y., Blanks, G., Kumbhare, N., Abdulla, G., Rountree, B.: An empirical survey of performance and energy efficiency variation on intel processors. In: Proceedings of the 5th International Workshop on Energy Efficient Supercomputing, pp. 1–8 (2017)
Netti, A., Kiziltan, Z., Babaoglu, O., Sîrbu, A., Bartolini, A., Borghesi, A.: A machine learning approach to online fault classification in HPC systems. Future Gener. Comput. Syst. 110, 1009–1022 (2020)
Netti, A., Ott, M., Guillen, C., Tafani, D., Schulz, M.: Operational data analytics in practice: experiences from design to deployment in production HPC environments. arXiv preprint arXiv:2106.14423 (2021)
Seyedkazemi Ardebili, M., Cavazzoni, C., Benini, L., Bartolini, A.: Thermal characterization of a Tier0 datacenter room in normal and thermal emergency conditions. In: Proceedings of High Performance Computing in Science and Engineering 2019 (2019)
Seyedkazemi Ardebili, M., et al.: Prediction of thermal hazards in a real datacenter room using temporal convolutional networks. In: 2021 Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 1256–1259. IEEE (2021)
Shaykhislamov, D., Voevodin, V.: An approach for dynamic detection of inefficient supercomputer applications. Procedia Comput. Sci. 136, 35–43 (2018)
Acknowledgments
The study has been conducted in the context of EU H2020-JTI-EuroHPC-2019-1 project REGALE (g.n. 956560), EuroHPC EU PILOT project (g.a. 101034126), EU Pilot for exascale EuroHPC EUPEX (g.a. 101033975), European Processor Initiative (EPI) SGA2 (g.a. 101036168), and CINECA.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 Springer Nature Switzerland AG
About this paper
Cite this paper
Ardebili, M.S., Bartolini, A., Acquaviva, A., Benini, L. (2022). Rule-Based Thermal Anomaly Detection for Tier-0 HPC Systems. In: Anzt, H., Bienz, A., Luszczek, P., Baboulin, M. (eds) High Performance Computing. ISC High Performance 2022 International Workshops. ISC High Performance 2022. Lecture Notes in Computer Science, vol 13387. Springer, Cham. https://doi.org/10.1007/978-3-031-23220-6_18
Download citation
DOI: https://doi.org/10.1007/978-3-031-23220-6_18
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-23219-0
Online ISBN: 978-3-031-23220-6
eBook Packages: Computer ScienceComputer Science (R0)