Abstract
U.S. Department of Energy’s National Renewable Energy Laboratory (NREL) hosts one of the world’s most energy-efficient HPC data centers; this system uses component-level warm-water liquid cooling to efficiently remove heat from the data center and capture it for reuse in the building or rejection to the atmosphere. Given the complexity of this system, building data-driven tools for holistically monitoring and operating the entire data center is a priority for ensuring maximal efficiency and resiliency. In this advanced smart facility, over one million metrics are recorded per minute using state-of-the-art streaming data architecture and software to capture and process the state of the system in real time. Here we detail two efforts to effectively analyze, visualize, and interpret this large volume streaming data. We have developed a novel, flexible system for identifying and visualizing individual metric anomalies and component performance across the data center through automatic metadata extraction and physically-motivated visualization for quick interpretation. Additionally, to directly connect system maintenance to data stream processing we explore a physics informed multi-metric drift and anomaly detection application to detect scale-build up in heat exchangers.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Eagle system configuration. https://www.nrel.gov/hpc/eagle-system-configuration.html
NREL, 2018: NREL garners top sustainability honor at data center dynamics awards. Technical report, National Renewable Energy Laboratory (2018)
Borghesi, A., Bartolini, A., Lombardi, M., Milano, M., Benini, L.: A semisupervised autoencoder-based approach for anomaly detection in high performance computing systems. Eng. Appl. Artif. Intell. 85, 634–644 (2019). https://doi.org/10.1016/j.engappai.2019.07.008, https://www.sciencedirect.com/science/article/pii/S0952197619301721
Bortot, L., Nardelli, W., Seto, P.: Data centers are a software development challenge. In: 48th Annual International Conference on Parallel Processing, pp. 1–5 (2019)
Demirbaga, U., et al.: AutoDiagn: an automated real-time diagnosis framework for big data systems. IEEE Trans. Comput. 71(5), 1035–1048 (2022). https://doi.org/10.1109/TC.2021.3070639
Guan, Q., Fu, S.: Adaptive anomaly identification by exploring metric subspace in cloud computing infrastructures. In: 2013 IEEE 32nd International Symposium on Reliable Distributed Systems, pp. 205–214. IEEE (2013)
Sickinger, D., Geet, O.V., Belmont, S., Carter, T., Martinez, D.: Thermosyphon cooler hybrid system for water savings in an energy-efficient HPC data center: results from 24 months and impact on water usage effectiveness. Technical report NREL/TP-2C00-72196, National Renewable Energy Laboratory, September 2018
Tuncer, O., et al.: Diagnosing performance variations in HPC applications using machine learning. In: Kunkel, J.M., Yokota, R., Balaji, P., Keyes, D. (eds.) ISC High Performance 2017. LNCS, vol. 10266, pp. 355–373. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-58667-0_19
Tuncer, O., et al.: Online diagnosis of performance variation in HPC systems using machine learning. IEEE Trans. Parallel Distrib. Syst. 30(4), 883–896 (2018)
Acknowledgements
This work was authored in part by the National Renewable Energy Laboratory, operated by Alliance for Sustainable Energy, LLC, for the U.S. Department of Energy (DOE) under Contract No. DE-AC36-08GO28308. Funding provided by U.S. Department of Energy Office of Energy Efficiency and Renewable Energy and Hewlett-Packard Enterprise.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 Springer Nature Switzerland AG
About this paper
Cite this paper
Egan, H., Purkayastha, A., Sickinger, D. (2022). Data Center Facility Monitoring with Physics Aware Approach. In: Anzt, H., Bienz, A., Luszczek, P., Baboulin, M. (eds) High Performance Computing. ISC High Performance 2022 International Workshops. ISC High Performance 2022. Lecture Notes in Computer Science, vol 13387. Springer, Cham. https://doi.org/10.1007/978-3-031-23220-6_17
Download citation
DOI: https://doi.org/10.1007/978-3-031-23220-6_17
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-23219-0
Online ISBN: 978-3-031-23220-6
eBook Packages: Computer ScienceComputer Science (R0)