Skip to main content

Rule-Based Thermal Anomaly Detection for Tier-0 HPC Systems

  • Conference paper
  • First Online:
High Performance Computing. ISC High Performance 2022 International Workshops (ISC High Performance 2022)

Abstract

Today, significant advances in science and technology can not be envisioned without high computing capacity. To solve large problems in science, engineering, and business, data centers provide High-Performance Computing (HPC) systems with aggregation of the computing capacity of thousand of computing nodes with the cost of millions of euros per year [12]. In the datacenter, an anomaly is a suspicious/abnormal pattern in the monitoring signals. The severity of the anomaly can be different, and in extreme conditions, it can yield the outage of the datacenter. By defining complex statistical rules-based anomaly detection methods, this paper investigates the thermal anomaly detection task in one of the most powerful HPC systems in the world, namely Marconi100 hosted at CINECA. The suggested anomaly detection method is successfully validated against real thermal hazard events reported for the studied HPC cluster while in production.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 74.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. The 53th, 55th, 58th editions of the top500 list, June 2022. https://www.top500.org/

  2. ACM: Getting started with HPC (2021). https://selects.acm.org/selections/getting-started-with-hpc

  3. Agelastos, A., et al.: Toward rapid understanding of production HPC applications and systems. In: 2015 IEEE International Conference on Cluster Computing, pp. 464–473. IEEE (2015)

    Google Scholar 

  4. Ahad, R., Chan, E., Santos, A.: Toward autonomic cloud: Automatic anomaly detection and resolution. In: 2015 International Conference on Cloud and Autonomic Computing, pp. 200–203. IEEE (2015)

    Google Scholar 

  5. Aksar, B., et al.: E2EWatch: an end-to-end anomaly diagnosis framework for production HPC systems. In: Sousa, L., Roma, N., Tomás, P. (eds.) Euro-Par 2021. LNCS, vol. 12820, pp. 70–85. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-85665-6_5

    Chapter  Google Scholar 

  6. Arzani, B., Ciraci, S., Loo, B.T., Schuster, A., Outhred, G.: Taking the blame game out of data centers operations with Net Poirot. In: Proceedings of the 2016 ACM SIGCOMM Conference, pp. 440–453. SIGCOMM 2016. Association for Computing Machinery, New York, NY, USA (2016). https://doi.org/10.1145/2934872.2934884

  7. Bartolini, A., et al.: Paving the way toward energy-aware and automated datacentre. In: Proceedings of the 48th International Conference on Parallel Processing: Workshops, pp. 8:1–8:8. ICPP 2019. ACM, New York, NY, USA (2019). https://doi.org/10.1145/3339186.3339215, http://doi.acm.org/10.1145/3339186.3339215

  8. Bhatele, A., Mohror, K., Langer, S.H., Isaacs, K.E.: There goes the neighborhood: performance degradation due to nearby jobs. In: SC 2013: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pp. 1–12. IEEE (2013)

    Google Scholar 

  9. Bhatele, A., et al.: The case of performance variability on dragonfly-based systems. In: 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 896–905. IEEE (2020)

    Google Scholar 

  10. Borghesi, A., Bartolini, A., Lombardi, M., Milano, M., Benini, L.: Anomaly detection using autoencoders in high performance computing systems. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 9428–9433 (2019)

    Google Scholar 

  11. Borghesi, A., Bartolini, A., Lombardi, M., Milano, M., Benini, L.: A semisupervised autoencoder-based approach for anomaly detection in high performance computing systems. Eng. Appl. Artif. Intell. 85, 634–644 (2019)

    Article  Google Scholar 

  12. Borghesi, A., Bartolini, A., Milano, M., Benini, L.: Pricing schemes for energy-efficient HPC systems: design and exploration. Int. J. High Perform. Comput. Appl. 33(4), 716–734 (2019). https://doi.org/10.1177/1094342018814593

  13. Borghesi, A., Molan, M., Milano, M., Bartolini, A.: Anomaly detection and anticipation in high performance computing systems. IEEE Trans. Parallel Distrib. Syst. 33(4), 739–750 (2021)

    Article  Google Scholar 

  14. Brandt, J.M., et al.: Enabling advanced operational analysis through multi-subsystem data integration on trinity. Technical report, Sandia National Lab. (SNL-CA), Livermore, CA (United States); Sandia National ... (2015)

    Google Scholar 

  15. Conficoni, C., Bartolini, A., Tilli, A., Cavazzoni, C., Benini, L.: Integrated energy-aware management of supercomputer hybrid cooling systems. IEEE Trans. Industr. Inf. 12(4), 1299–1311 (2016)

    Article  Google Scholar 

  16. Conficoni, C., Bartolini, A., Tilli, A., Cavazzoni, C., Benini, L.: HPC cooling: a flexible modeling tool for effective design and management. IEEE Trans. Sustain. Comput. 6(3), 441–455 (2018). https://doi.org/10.1109/TSUSC.2018.2809574

  17. Dalmazo, B.L., Vilela, J.P., Simoes, P., Curado, M.: Expedite feature extraction for enhanced cloud anomaly detection. In: NOMS 2016–2016 IEEE/IFIP Network Operations and Management Symposium, pp. 1215–1220. IEEE (2016)

    Google Scholar 

  18. Das, A., Mueller, F., Rountree, B.: Aarohi: making real-time node failure prediction feasible. In: 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 1092–1101. IEEE (2020)

    Google Scholar 

  19. Dorier, M., Antoniu, G., Ross, R., Kimpe, D., Ibrahim, S.: CALCioM: mitigating I/O interference in HPC systems through cross-application coordination. In: 2014 IEEE 28th International Parallel and Distributed Processing Symposium, pp. 155–164. IEEE (2014)

    Google Scholar 

  20. ECP: Exascale computing project. https://www.exascaleproject.org/what-is-exascale/

  21. Intel server board S2600IP and workstation board W2600CR technical product specification, October 2013

    Google Scholar 

  22. Jayathilaka, H., Krintz, C., Wolski, R.: Performance monitoring and root cause analysis for cloud-hosted web applications. In: Proceedings of the 26th International Conference on World Wide Web, pp. 469–478 (2017)

    Google Scholar 

  23. Marathe, A., Zhang, Y., Blanks, G., Kumbhare, N., Abdulla, G., Rountree, B.: An empirical survey of performance and energy efficiency variation on intel processors. In: Proceedings of the 5th International Workshop on Energy Efficient Supercomputing, pp. 1–8 (2017)

    Google Scholar 

  24. Netti, A., Kiziltan, Z., Babaoglu, O., Sîrbu, A., Bartolini, A., Borghesi, A.: A machine learning approach to online fault classification in HPC systems. Future Gener. Comput. Syst. 110, 1009–1022 (2020)

    Article  Google Scholar 

  25. Netti, A., Ott, M., Guillen, C., Tafani, D., Schulz, M.: Operational data analytics in practice: experiences from design to deployment in production HPC environments. arXiv preprint arXiv:2106.14423 (2021)

  26. Seyedkazemi Ardebili, M., Cavazzoni, C., Benini, L., Bartolini, A.: Thermal characterization of a Tier0 datacenter room in normal and thermal emergency conditions. In: Proceedings of High Performance Computing in Science and Engineering 2019 (2019)

    Google Scholar 

  27. Seyedkazemi Ardebili, M., et al.: Prediction of thermal hazards in a real datacenter room using temporal convolutional networks. In: 2021 Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 1256–1259. IEEE (2021)

    Google Scholar 

  28. Shaykhislamov, D., Voevodin, V.: An approach for dynamic detection of inefficient supercomputer applications. Procedia Comput. Sci. 136, 35–43 (2018)

    Article  Google Scholar 

Download references

Acknowledgments

The study has been conducted in the context of EU H2020-JTI-EuroHPC-2019-1 project REGALE (g.n. 956560), EuroHPC EU PILOT project (g.a. 101034126), EU Pilot for exascale EuroHPC EUPEX (g.a. 101033975), European Processor Initiative (EPI) SGA2 (g.a. 101036168), and CINECA.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mohsen Seyedkazemi Ardebili .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ardebili, M.S., Bartolini, A., Acquaviva, A., Benini, L. (2022). Rule-Based Thermal Anomaly Detection for Tier-0 HPC Systems. In: Anzt, H., Bienz, A., Luszczek, P., Baboulin, M. (eds) High Performance Computing. ISC High Performance 2022 International Workshops. ISC High Performance 2022. Lecture Notes in Computer Science, vol 13387. Springer, Cham. https://doi.org/10.1007/978-3-031-23220-6_18

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-23220-6_18

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-23219-0

  • Online ISBN: 978-3-031-23220-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics