Rule-Based Thermal Anomaly Detection for Tier-0 HPC Systems

Ardebili, Mohsen Seyedkazemi; Bartolini, Andrea; Acquaviva, Andrea; Benini, Luca

doi:10.1007/978-3-031-23220-6_18

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13387))

Included in the following conference series:

International Conference on High Performance Computing

755 Accesses

Abstract

Today, significant advances in science and technology can not be envisioned without high computing capacity. To solve large problems in science, engineering, and business, data centers provide High-Performance Computing (HPC) systems with aggregation of the computing capacity of thousand of computing nodes with the cost of millions of euros per year [12]. In the datacenter, an anomaly is a suspicious/abnormal pattern in the monitoring signals. The severity of the anomaly can be different, and in extreme conditions, it can yield the outage of the datacenter. By defining complex statistical rules-based anomaly detection methods, this paper investigates the thermal anomaly detection task in one of the most powerful HPC systems in the world, namely Marconi100 hosted at CINECA. The suggested anomaly detection method is successfully validated against real thermal hazard events reported for the studied HPC cluster while in production.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 74.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

The 53th, 55th, 58th editions of the top500 list, June 2022. https://www.top500.org/
ACM: Getting started with HPC (2021). https://selects.acm.org/selections/getting-started-with-hpc
Agelastos, A., et al.: Toward rapid understanding of production HPC applications and systems. In: 2015 IEEE International Conference on Cluster Computing, pp. 464–473. IEEE (2015)
Google Scholar
Ahad, R., Chan, E., Santos, A.: Toward autonomic cloud: Automatic anomaly detection and resolution. In: 2015 International Conference on Cloud and Autonomic Computing, pp. 200–203. IEEE (2015)
Google Scholar
Aksar, B., et al.: E2EWatch: an end-to-end anomaly diagnosis framework for production HPC systems. In: Sousa, L., Roma, N., Tomás, P. (eds.) Euro-Par 2021. LNCS, vol. 12820, pp. 70–85. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-85665-6_5
Chapter Google Scholar
Arzani, B., Ciraci, S., Loo, B.T., Schuster, A., Outhred, G.: Taking the blame game out of data centers operations with Net Poirot. In: Proceedings of the 2016 ACM SIGCOMM Conference, pp. 440–453. SIGCOMM 2016. Association for Computing Machinery, New York, NY, USA (2016). https://doi.org/10.1145/2934872.2934884
Bartolini, A., et al.: Paving the way toward energy-aware and automated datacentre. In: Proceedings of the 48th International Conference on Parallel Processing: Workshops, pp. 8:1–8:8. ICPP 2019. ACM, New York, NY, USA (2019). https://doi.org/10.1145/3339186.3339215, http://doi.acm.org/10.1145/3339186.3339215
Bhatele, A., Mohror, K., Langer, S.H., Isaacs, K.E.: There goes the neighborhood: performance degradation due to nearby jobs. In: SC 2013: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pp. 1–12. IEEE (2013)
Google Scholar
Bhatele, A., et al.: The case of performance variability on dragonfly-based systems. In: 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 896–905. IEEE (2020)
Google Scholar
Borghesi, A., Bartolini, A., Lombardi, M., Milano, M., Benini, L.: Anomaly detection using autoencoders in high performance computing systems. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 9428–9433 (2019)
Google Scholar
Borghesi, A., Bartolini, A., Lombardi, M., Milano, M., Benini, L.: A semisupervised autoencoder-based approach for anomaly detection in high performance computing systems. Eng. Appl. Artif. Intell. 85, 634–644 (2019)
Article Google Scholar
Borghesi, A., Bartolini, A., Milano, M., Benini, L.: Pricing schemes for energy-efficient HPC systems: design and exploration. Int. J. High Perform. Comput. Appl. 33(4), 716–734 (2019). https://doi.org/10.1177/1094342018814593
Borghesi, A., Molan, M., Milano, M., Bartolini, A.: Anomaly detection and anticipation in high performance computing systems. IEEE Trans. Parallel Distrib. Syst. 33(4), 739–750 (2021)
Article Google Scholar
Brandt, J.M., et al.: Enabling advanced operational analysis through multi-subsystem data integration on trinity. Technical report, Sandia National Lab. (SNL-CA), Livermore, CA (United States); Sandia National ... (2015)
Google Scholar
Conficoni, C., Bartolini, A., Tilli, A., Cavazzoni, C., Benini, L.: Integrated energy-aware management of supercomputer hybrid cooling systems. IEEE Trans. Industr. Inf. 12(4), 1299–1311 (2016)
Article Google Scholar
Conficoni, C., Bartolini, A., Tilli, A., Cavazzoni, C., Benini, L.: HPC cooling: a flexible modeling tool for effective design and management. IEEE Trans. Sustain. Comput. 6(3), 441–455 (2018). https://doi.org/10.1109/TSUSC.2018.2809574
Dalmazo, B.L., Vilela, J.P., Simoes, P., Curado, M.: Expedite feature extraction for enhanced cloud anomaly detection. In: NOMS 2016–2016 IEEE/IFIP Network Operations and Management Symposium, pp. 1215–1220. IEEE (2016)
Google Scholar
Das, A., Mueller, F., Rountree, B.: Aarohi: making real-time node failure prediction feasible. In: 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 1092–1101. IEEE (2020)
Google Scholar
Dorier, M., Antoniu, G., Ross, R., Kimpe, D., Ibrahim, S.: CALCioM: mitigating I/O interference in HPC systems through cross-application coordination. In: 2014 IEEE 28th International Parallel and Distributed Processing Symposium, pp. 155–164. IEEE (2014)
Google Scholar
ECP: Exascale computing project. https://www.exascaleproject.org/what-is-exascale/
Intel server board S2600IP and workstation board W2600CR technical product specification, October 2013
Google Scholar
Jayathilaka, H., Krintz, C., Wolski, R.: Performance monitoring and root cause analysis for cloud-hosted web applications. In: Proceedings of the 26th International Conference on World Wide Web, pp. 469–478 (2017)
Google Scholar
Marathe, A., Zhang, Y., Blanks, G., Kumbhare, N., Abdulla, G., Rountree, B.: An empirical survey of performance and energy efficiency variation on intel processors. In: Proceedings of the 5th International Workshop on Energy Efficient Supercomputing, pp. 1–8 (2017)
Google Scholar
Netti, A., Kiziltan, Z., Babaoglu, O., Sîrbu, A., Bartolini, A., Borghesi, A.: A machine learning approach to online fault classification in HPC systems. Future Gener. Comput. Syst. 110, 1009–1022 (2020)
Article Google Scholar
Netti, A., Ott, M., Guillen, C., Tafani, D., Schulz, M.: Operational data analytics in practice: experiences from design to deployment in production HPC environments. arXiv preprint arXiv:2106.14423 (2021)
Seyedkazemi Ardebili, M., Cavazzoni, C., Benini, L., Bartolini, A.: Thermal characterization of a Tier0 datacenter room in normal and thermal emergency conditions. In: Proceedings of High Performance Computing in Science and Engineering 2019 (2019)
Google Scholar
Seyedkazemi Ardebili, M., et al.: Prediction of thermal hazards in a real datacenter room using temporal convolutional networks. In: 2021 Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 1256–1259. IEEE (2021)
Google Scholar
Shaykhislamov, D., Voevodin, V.: An approach for dynamic detection of inefficient supercomputer applications. Procedia Comput. Sci. 136, 35–43 (2018)
Article Google Scholar

Download references

Acknowledgments

The study has been conducted in the context of EU H2020-JTI-EuroHPC-2019-1 project REGALE (g.n. 956560), EuroHPC EU PILOT project (g.a. 101034126), EU Pilot for exascale EuroHPC EUPEX (g.a. 101033975), European Processor Initiative (EPI) SGA2 (g.a. 101036168), and CINECA.

Author information

Authors and Affiliations

Universitá degli Studi di Bologna, Viale Risorgimento, 2, 40136, Bologna, Italy
Mohsen Seyedkazemi Ardebili, Andrea Bartolini, Andrea Acquaviva & Luca Benini
Eidgenössische Technische Hochschule Zürich, Gloriastrasse 35, 8092, Zürich, Switzerland
Luca Benini

Authors

Mohsen Seyedkazemi Ardebili
View author publications
You can also search for this author in PubMed Google Scholar
Andrea Bartolini
View author publications
You can also search for this author in PubMed Google Scholar
Andrea Acquaviva
View author publications
You can also search for this author in PubMed Google Scholar
Luca Benini
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mohsen Seyedkazemi Ardebili .

Editor information

Editors and Affiliations

University of Tennessee, Knoxville, TN, USA
Hartwig Anzt
University of New Mexico, Albuquerque, NM, USA
Amanda Bienz
University of Tennessee, Knoxville, TN, USA
Piotr Luszczek
Université Paris-Saclay, Orsay, France
Marc Baboulin

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ardebili, M.S., Bartolini, A., Acquaviva, A., Benini, L. (2022). Rule-Based Thermal Anomaly Detection for Tier-0 HPC Systems. In: Anzt, H., Bienz, A., Luszczek, P., Baboulin, M. (eds) High Performance Computing. ISC High Performance 2022 International Workshops. ISC High Performance 2022. Lecture Notes in Computer Science, vol 13387. Springer, Cham. https://doi.org/10.1007/978-3-031-23220-6_18

Download citation

DOI: https://doi.org/10.1007/978-3-031-23220-6_18
Published: 04 January 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-23219-0
Online ISBN: 978-3-031-23220-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Rule-Based Thermal Anomaly Detection for Tier-0 HPC Systems