Holistic thermal-aware workload management and infrastructure control for heterogeneous data centers using machine learning

doi:10.1016/j.future.2021.01.007

Future Generation Computer Systems

Volume 118, May 2021, Pages 208-218

https://doi.org/10.1016/j.future.2021.01.007 Get rights and content

Highlights

•
We model thermal heterogeneity using a novel thermal model.
•
We incorporate low complexity data-driven thermal models for workload assignment and cooling control.
•
We jointly optimize the assignment of workload and the operational parameters of the cooling unit(s).

Abstract

Two key contributors to the energy expenditure in data centers are information technology (IT) equipment and cooling infrastructures. The standard practice of data centers lacks a tight correlation between these two entities, resulting in considerable power wastage. Considering the cooling cost of different locations inside a data center (cooling heterogeneity) and various cooling capabilities of servers (server heterogeneity) has significant potential for saving power, yet has not been studied thoroughly in the literature. There is a necessity for state-of-the-art approaches to integrate the control of IT and cooling units. Moreover, the literature still lacks an accurate and fast thermal model for temperature prediction inside a data center. In this paper, innovative approaches to quantify data center thermal heterogeneities are presented. Using data center thermal models the cost of providing cold air at the front of servers can be (indirectly) calculated, and the capability of servers to be cooled is formulated. Our approach assigns jobs to locations that are efficient to cool (from the perspectives of both servers and cooling units) and tunes cooling unit parameters. The method, called holistic data center infrastructure control (HDIC), has the potential to save a considerable amount of power by exploiting synergies between the workload scheduler and operational parameters of cooling units.

Introduction

Two percent of power consumption in the United States in 2014 was due to data centers, equivalent to approximately 70 billion kWh [1]. In contrast, the power consumption of data centers in 2000 was 30 billion kWh [2]. It has been estimated that from 2015 to 2020, the incoming load to data centers will double [3]. The increasing number of online and mobile applications, public interest to access cyber entertainment, and cloud services for both personal and business users have a significant role in this jump [4]. Anticipating this increase, in addition to power usage constraints, have led large data center operators and vendors to invest more in the efficient use of power [1].

There are several methods and techniques to reduce power consumption at different levels of a data center. At the device level, some electronic devices support low power states to save energy, if the performance of the device is not impacted [5], [6]. For example, dynamic voltage and frequency scaling (DVFS) is a method that provides different levels of power consumption and performance for processors [7], [8]. At the server level, dynamic suspension of unneeded servers, server consolidation, and the ability to choose different levels of power and performance are vital approaches for energy efficiency. For instance, server consolidation aims to save power by turning unneeded servers off during low workload periods [9], [10], [11]. At the facility level, power efficiency of the cooling system itself is also a significant concern [12], [13], [14].

Different servers and locations in data centers are not cooled equally, resulting in what we call data center thermal heterogeneity. In other words, servers are different in their cooling requirements (server heterogeneity), and locations are also different in their cooling cost (cooling heterogeneity). Cooling heterogeneity refers to the fact that from a particular cooling unit, all locations in a data center do not benefit to the same degree. Related works in the literature have either simplified or ignored heterogeneity that exists in the data center environment when studying workload assignment or cooling control. We have studied the cost-saving opportunities that exist due to server heterogeneity during workload assignment [15], and also due to cooling heterogeneity [16], however no study has considered all aspects of data center thermal heterogeneity to control cooling unit parameters and assign workload.

In this paper, a holistic data center infrastructure control (HDIC) framework is presented. HDIC is a novel method to exploit all aspects of data center thermal heterogeneity and uses them as an opportunity to save power during data center control. The proposed framework employs neural networks to construct thermal models for the data center and individual servers. Server thermal models are used to estimate the core temperature of servers, and a data center thermal model is used to predict the inlet temperatures of servers. These have the attraction of being data-driven models, as building accurate physical models for data center thermal dynamics is notoriously tricky.

The generated thermal models incorporate both cooling and server heterogeneity. These models can then be used by an optimizer to control the system in a power-efficient manner. We demonstrate that the solutions to the underlying optimization problem lead to considerable power savings while maintaining IT performance. Our contributions in this paper can be summarized as follows:

•
We incorporate low complexity data-driven thermal models to take thermal heterogeneity in data centers into account during workload assignment and cooling control.
•
We present an optimization framework that can jointly optimize the assignment of workload and the operational parameters of the cooling unit(s), while respecting the expected performance of IT equipment.
•
We show advantages of using thermal differences between servers and locations in a data center using thermal models.

In the next section, related work is classified and reviewed. In Section 3, the architecture of the system under study is illustrated and the required models to formulate the problem are explained in Section 3.2. In Section 4, the methodology for cooling control and workload assignment is discussed and techniques to optimize the data center control parameters are explained. The solution of the developed optimization problems is discussed in Section 5, and HDIC is compared with other representative methods. Finally, concluding remarks are in Section 6. A summary of the notation used in this paper is listed in Table 1.

Section snippets

Literature review

There is a significant literature on this topic, studying various control methods, workload assignment frameworks, and thermal models for data centers. In this section, a number of previous works related to our contributions are reviewed: data center thermal models, thermal-aware workload assignment frameworks and thermal-aware control methods.

There are various methods of temperature prediction for data centers (data center thermal models). Computational fluid dynamics (CFD) is a traditional

System architecture and models

In this section, the architecture of the data center under study is provided. The steps to acquire data and then to build data center and server thermal models are explained and the power consumption model is formulated.

Thermal-aware cooling control and workload assignment

Exploring data center thermal heterogeneity is possible through thermal models. In this section, two different approaches are discussed to be compared later as a demonstration of the efficiency of HDIC.

In the first approach, cooling heterogeneity is only considered via the data center thermal model. This approach is called cooling heterogeneity-aware infrastructure control or CHIC. The second approach is HDIC which uses both the data center and server thermal models for control decisions.

Results and comparison

Both optimization problems , must be solved by nonlinear solution methods as both the cost function and the thermal models are nonlinear. We used the Matlab fmincon tool with the interior-point option to solve this optimization problem. A complete description of the data center configuration is illustrated in Section 3.1. Briefly, for this data center configuration, the decision variables are the utilizations of 40 servers, the speed of five fans, and one inlet water temperature. Due to the

Conclusion

Considering all aspects of data center thermal heterogeneity for workload assignment and cooling control can result in a considerable amount of savings in cooling power consumption. Data center heterogeneity can be obtained by means of data center and server thermal models. The data center thermal model predicts the temperature of different locations as a function of IT and cooling parameters. This thermal model is used to indirectly calculate the cost of providing cool air for a specific

CRediT authorship contribution statement

SeyedMorteza MirhoseiniNejad: Conceptualization, Methodology, Software, Investigation, Writing - original draft. Ghada Badawy: Methodology, Writing - review & editing, Supervision, Funding acquisition. Douglas G. Down: Methodology, Writing - review & editing, Supervision, Funding acquisition.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgment

This research was supported by a Collaborative Research and Development grant CRDPI506142-16 from the Natural Sciences and Engineering Research Council of Canada (NSERC) . We would like to acknowledge the useful comments of the anonymous referees.

SeyedMorteza MirhoseiniNejad is a Ph.D. student in Computer Science at McMaster University. He received his M.Sc. degree from Iran University of Science and Technology and his B.Sc. degree from Bahonar University of Kerman, both in Computer Engineering. His research interests are machine learning, optimization, queueing theory, data analysis, resource management, and predictive control and maintenance.

Email: [email protected].

References (40)

MirhoseiniNejadS. et al.
Joint data center cooling and workload management: A thermal-aware approach
Future Gener. Comput. Syst.
(2020)
MoazamigoodarziH. et al.
Real-time temperature predictions in IT server enclosures
Int. J. Heat Mass Transfer
(2018)
GillS.S. et al.
Thermosim: Deep learning based framework for modeling and simulation of thermal-aware resource management for cloud computing environments
J. Syst. Softw.
(2020)
ZhaoX. et al.
A smart coordinated temperature feedback controller for energy-efficient data centers
Future Gener. Comput. Syst.
(2019)
FangQ. et al.
Optimization based resource and cooling management for a high performance computing data center
ISA Trans.
(2019)
MukherjeeT. et al.
Spatio-temporal thermal-aware job scheduling to minimize energy consumption in virtualized heterogeneous data centers
Comput. Netw.
(2009)
MellitA. et al.
Artificial intelligence techniques for sizing photovoltaic systems: A review
Renew. Sustain. Energy Rev.
(2009)
ShehabiA. et al.
United States Data Center Energy Usage ReportTech. rep.
(2016)
BrownR. et al.
Report to Congress on Server and Data Center Energy Efficiency Public Law 109-431Tech. rep.
(2007)
LiY. et al.
Thermal-aware hybrid workload management in a green datacenter towards renewable energy utilization
Energies
(2019)

KlemickH. et al.

Data Center Energy Efficiency Investments: Qualitative Evidence from Focus Groups and InterviewsTech. rep.

(2017)

GuptaM. et al.

Using low-power modes for energy conservation in ethernet LANs

YadavaN. et al.

Design of one-transistor SRAM cell for low power consumption

AldahariE.

Dynamic voltage and frequency scaling enhanced task scheduling technologies toward green cloud computing

GeR. et al.

Performance-constrained distributed DVS scheduling for scientific applications on power-aware clusters

MeisnerD. et al.

Power management of online data-intensive services

ACM SIGARCH Comput. Archit. News

(2011)

LinM. et al.

Dynamic right-sizing for power-proportional data centers

IEEE/ACM Trans. Netw.

(2013)

KrioukovA. et al.

[Napsac]: Design and implementation of a power-proportional web cluster

SIGCOMM Comput. Commun. Rev.

(2011)

TangQ. et al.

Energy-efficient thermal-aware task scheduling for homogeneous high-performance computing data centers: A cyber-physical approach

IEEE Trans. Parallel Distrib. Syst.

(2008)

BashC. et al.

Cool job allocation: Measuring the power savings of placing jobs at cooling-efficient locations in the data center

Cited by (24)

Design and performance analysis of modern computational storage devices: A systematic review
2024, Expert Systems with Applications
Computational Storage Devices, also known as In-Storage computing or In-Suit Processing, offer higher computing power than traditional storage devices. Innovation in computational storage devices is more important than ever because of the exponential growth of digital data produced by both individuals and organizations. With the advent of non-volatile storage devices in mainstream computing (e.g., hard disk drives and solid-state drives), the idea has gained wide appeal, with several academic and commercial prototypes becoming available. This survey aims to provide a systematic overview of the work being done in the area of computational storage and to indicate future directions. This overview considers and examines a number of research questions to comprehensively summarize, analyse and discuss various storage devices such as Hard Disk Drives (HDD), Solid-State Drives (SSD) and Computational Storage Devices (CSDs) in terms of design, programming model, acceleration and energy efficiency. Also, the Load Balancing algorithm has been reviewed from different domains such as cloud computing, data centers and the Internet of Things (IoT). This review not only focuses on the existing literature problems but also requires some effort to explore the future in this area.
The sustainability benefits of economization in data centers containing chilled water systems
2023, Resources, Conservation and Recycling
This study provides the first known conservative estimates of reductions in carbon and water scarcity footprints when employing economization technologies in chilled water-based data centers (DCs) in the U.S. Specifically, a generic DC that employs a Computer Room Air Handling (CRAH)-based cooling system with a total IT load of 400 kW is computationally modeled at 925 locations in the U.S. with airside and/or water-side economization. The energy savings and environmental benefits are presented in terms of key DC performance metric reductions, including Carbon Usage Effectiveness (CUE) and Water Scarcity Usage Effectiveness (WSUE). The results suggest that implementing both water-side and airside economization schemes under conditions within the ASHRAE A1 Allowable envelope reduces the mean state-averaged PUE, WUE, CUE, and WSUE by 11%, 22%, 11%, and 15%, respectively. These results highlight the environmental benefits of economization technologies and provide guidance on economization strategies.
A comprehensive review on deep learning algorithms: Security and privacy issues
2023, Computers and Security
Machine Learning (ML) algorithms are used to train the machines to perform various complicated tasks that begin to modify and improve with experiences. It has become widely used for automated decisions. In particular, the applications which have a profound impact on society that rely on Deep Learning (DL) for autonomous decisions, such as Patient Health Record (PHR), Unmanned Aerial Vehicles (UAVs), etc. Such impacts have a vital concern about the potential vulnerabilities introduced by DL. Traditional attackers have powerful motives that can alter and modify DL algorithms to subvert the outcomes. In poisoning attacks, an attacker can consciously change training dataset, which is used to operate the outcomes of decision-based model. While in privacy and evasion attacks, an adversary can also misclassify new datasets to infer private information. Therefore, in this paper, we have provided a review of security and privacy issues of DL algorithms and analyzed their applications and challenges based on state-of-the-art literature. We have classified attacks, devised a taxonomy, and comprehensive analysis of defense techniques for the most common attacks such as poisoning, evasion, model extraction, and model inversion. We have also presented various privacy preserving techniques to ensure the privacy of dataset. We have proposed a secure cryptographic framework for dataset based on hash functions and Homomorphic Encryption (HE) scheme. Finally, we have provided recent research challenges and future studies concerning security and privacy issues. We believed that the highlighted limitations and weaknesses provide possible research questions and open matters for designing efficient future DL algorithms.
A time-varying state-space model for real-time temperature predictions in rack-based cooling data centers
2023, Applied Thermal Engineering
Fast-growing data centers (DCs) require efficient cooling systems (such as rack-based cooling architectures) and control strategies to reduce operating costs and guarantee desired indoor conditions. Thus, this study proposed a novel real-time temperature prediction model for rack-based cooling DCs, in order to facilitate advanced control regarding cooling management and workload assignment. Specifically, a data-driven technology was introduced to estimate time-invariant model parameters, in order to avoid the time-consuming physics-based parameters extracting process. The mass conservation relationships were employed to update time-varying flow parameters in real-time to capture the nonlinear behaviors in DCs. Moreover, the proposed control-oriented thermal modeling method can model hot air recirculation and cold air bypass occurring simultaneously for the first time. The performance of the developed time-varying state-space model was validated by CFD simulation data. Additionally, the timeliness of modeling and temperature prediction was also investigated. The results show that the developed model achieves sufficient accuracy with a mean absolute error (MAE) equal to 0.28 °C, even for long prediction horizons and dynamic IT workloads. Also, the developed model has outstanding timeliness for advanced control techniques, in terms of less than 30 min for parameter identification and less than 10 s for temperature prediction.
A decentralized adaptation of model-free Q-learning for thermal-aware energy-efficient virtual machine placement in cloud data centers
2023, Computer Networks
The traditional method of saving energy in Virtual Machine Placement (VMP) is based on consolidating more virtual machines (VMs) in fewer servers and putting the rest in sleep mode, which may lead to the overheating of servers resulting in performance degradation and cooling cost. The lack of an accurate and computationally efficient model to describe the thermal condition of the data center environment makes it challenging to develop an effective and adaptive VMP mechanism. Although recently, data-driven approaches have acted successfully in model construction, the shortage of clean, adequate, and sufficient amounts of data put limits their generalizability. Moreover, any change in the data center configuration during operation, makes these models prone to error and forces them to repeat the learning process. Thus, researchers turn to applying model-free paradigms such as reinforcement learning. Due to the vast action-state space of real-world applications, scalability is one of the significant challenges in this area. In addition, the delayed feedback of environmental variables such as temperature give rise to exploration costs. In this paper, we present a decentralized implementation of reinforcement learning along with a novel state-action representation to perform the VMP in the data centers to optimize energy consumption and keep the host temperature as low as possible while satisfying Service Level Agreements (SLA). Our experimental results show more than 17% improvement in energy consumption and 12% in CPU temperature reduction compared to baseline algorithms. We also succeeded in accelerating optimal policy convergence after the occurrence of a configuration change.
Server temperature prediction using deep neural networks to assist thermal-aware scheduling
2022, Sustainable Computing: Informatics and Systems
Citation Excerpt :
Finally, DVFS is also employed to save energy wastage. Recently, MirhoseiniNejad et al. [28] study the relationship between heterogeneity in terms of cooling and server capacity. The proposed system integrates neural network model to forecast the inlet temperatures during the workload distribution.
Thermal-Aware (TA) scheduling is a primitive thermal-management tool to avoid hotspots and attain a better thermal profile inside data centers. However, TA scheduling depends on the accuracy of server temperature calculation for making reliable scheduling decisions. Existing TA simulation frameworks rely on the CRAC, RC, and thermodynamics model to calculate the temperature of a server. However, these models are not very efficient computationally while calculating the temperature of computing nodes. Hence, there is a need for an efficient and lightweight temperature prediction model to fill this gap. Moreover, existing TA allocation and migration schemes neglect the workload heterogeneity regarding execution time (length) of user tasks in batch workloads. Ignoring task heterogeneity may lead to higher ambient temperature on its neighboring servers—assigning a hotter job with a longer duration to a server with higher ambient effect results in significantly higher cooling expenses.
Our contribution in this paper is twofold: (1) firstly, we design and train a deep neural network to predict the temperature of servers; our proposed DNN model achieves 96.11% prediction accuracy, and (2) secondly, we propose a TA algorithm for job allocation and migration. Specifically, we consider the length and thermal profile of user tasks and servers during allocation and migration. We compare the proposed strategy against existing TA scheduling designs such as TA Scheduling Algorithm (TASA) and TA Control Strategy (TACS) using simulations. Results demonstrate that the proposed TA approach significantly reduces the overall energy consumption by up to 12.03% and 8.28% compared to the TASA and TACS, respectively.

View all citing articles on Scopus

Email: [email protected].

Dr. Ghada Badawy is an Adjunct Assistant Professor at the Computing and Software department and a Principal Research Engineer at the Computing Infrastructure Research Center (CIRC) at McMaster University. Before joining CIRC she worked at BlackBerry as an Advanced networks connectivity researcher where she has led multiple video over Wi-Fi and peer to peer research projects and authored multiple patents. She has also worked as a Postdoctoral fellow at McMaster University and Ryerson University and as a senior software engineer at IBM. Ghada received her Ph.D. degree in Computer Engineering from McMaster University.

Email: [email protected]

Douglas G. Down received his B.A.Sc. and M.A.Sc. degrees from the University of Toronto (1986 and 1990) and his Ph.D. from the University of Illinois at Urbana-Champaign (1994). His interests lie in performance evaluation and resource allocation in distributed computer systems. He is currently the Academic Director of the Computing Infrastructure Research Centre at McMaster University, Canada.