Holistic thermal-aware workload management and infrastructure control for heterogeneous data centers using machine learning

https://doi.org/10.1016/j.future.2021.01.007Get rights and content

Highlights

  • We model thermal heterogeneity using a novel thermal model.

  • We incorporate low complexity data-driven thermal models for workload assignment and cooling control.

  • We jointly optimize the assignment of workload and the operational parameters of the cooling unit(s).

Abstract

Two key contributors to the energy expenditure in data centers are information technology (IT) equipment and cooling infrastructures. The standard practice of data centers lacks a tight correlation between these two entities, resulting in considerable power wastage. Considering the cooling cost of different locations inside a data center (cooling heterogeneity) and various cooling capabilities of servers (server heterogeneity) has significant potential for saving power, yet has not been studied thoroughly in the literature. There is a necessity for state-of-the-art approaches to integrate the control of IT and cooling units. Moreover, the literature still lacks an accurate and fast thermal model for temperature prediction inside a data center. In this paper, innovative approaches to quantify data center thermal heterogeneities are presented. Using data center thermal models the cost of providing cold air at the front of servers can be (indirectly) calculated, and the capability of servers to be cooled is formulated. Our approach assigns jobs to locations that are efficient to cool (from the perspectives of both servers and cooling units) and tunes cooling unit parameters. The method, called holistic data center infrastructure control (HDIC), has the potential to save a considerable amount of power by exploiting synergies between the workload scheduler and operational parameters of cooling units.

Introduction

Two percent of power consumption in the United States in 2014 was due to data centers, equivalent to approximately 70 billion kWh [1]. In contrast, the power consumption of data centers in 2000 was 30 billion kWh [2]. It has been estimated that from 2015 to 2020, the incoming load to data centers will double [3]. The increasing number of online and mobile applications, public interest to access cyber entertainment, and cloud services for both personal and business users have a significant role in this jump [4]. Anticipating this increase, in addition to power usage constraints, have led large data center operators and vendors to invest more in the efficient use of power [1].

There are several methods and techniques to reduce power consumption at different levels of a data center. At the device level, some electronic devices support low power states to save energy, if the performance of the device is not impacted [5], [6]. For example, dynamic voltage and frequency scaling (DVFS) is a method that provides different levels of power consumption and performance for processors [7], [8]. At the server level, dynamic suspension of unneeded servers, server consolidation, and the ability to choose different levels of power and performance are vital approaches for energy efficiency. For instance, server consolidation aims to save power by turning unneeded servers off during low workload periods [9], [10], [11]. At the facility level, power efficiency of the cooling system itself is also a significant concern [12], [13], [14].

Different servers and locations in data centers are not cooled equally, resulting in what we call data center thermal heterogeneity. In other words, servers are different in their cooling requirements (server heterogeneity), and locations are also different in their cooling cost (cooling heterogeneity). Cooling heterogeneity refers to the fact that from a particular cooling unit, all locations in a data center do not benefit to the same degree. Related works in the literature have either simplified or ignored heterogeneity that exists in the data center environment when studying workload assignment or cooling control. We have studied the cost-saving opportunities that exist due to server heterogeneity during workload assignment [15], and also due to cooling heterogeneity [16], however no study has considered all aspects of data center thermal heterogeneity to control cooling unit parameters and assign workload.

In this paper, a holistic data center infrastructure control (HDIC) framework is presented. HDIC is a novel method to exploit all aspects of data center thermal heterogeneity and uses them as an opportunity to save power during data center control. The proposed framework employs neural networks to construct thermal models for the data center and individual servers. Server thermal models are used to estimate the core temperature of servers, and a data center thermal model is used to predict the inlet temperatures of servers. These have the attraction of being data-driven models, as building accurate physical models for data center thermal dynamics is notoriously tricky.

The generated thermal models incorporate both cooling and server heterogeneity. These models can then be used by an optimizer to control the system in a power-efficient manner. We demonstrate that the solutions to the underlying optimization problem lead to considerable power savings while maintaining IT performance. Our contributions in this paper can be summarized as follows:

  • We incorporate low complexity data-driven thermal models to take thermal heterogeneity in data centers into account during workload assignment and cooling control.

  • We present an optimization framework that can jointly optimize the assignment of workload and the operational parameters of the cooling unit(s), while respecting the expected performance of IT equipment.

  • We show advantages of using thermal differences between servers and locations in a data center using thermal models.

In the next section, related work is classified and reviewed. In Section 3, the architecture of the system under study is illustrated and the required models to formulate the problem are explained in Section 3.2. In Section 4, the methodology for cooling control and workload assignment is discussed and techniques to optimize the data center control parameters are explained. The solution of the developed optimization problems is discussed in Section 5, and HDIC is compared with other representative methods. Finally, concluding remarks are in Section 6. A summary of the notation used in this paper is listed in Table 1.

Section snippets

Literature review

There is a significant literature on this topic, studying various control methods, workload assignment frameworks, and thermal models for data centers. In this section, a number of previous works related to our contributions are reviewed: data center thermal models, thermal-aware workload assignment frameworks and thermal-aware control methods.

There are various methods of temperature prediction for data centers (data center thermal models). Computational fluid dynamics (CFD) is a traditional

System architecture and models

In this section, the architecture of the data center under study is provided. The steps to acquire data and then to build data center and server thermal models are explained and the power consumption model is formulated.

Thermal-aware cooling control and workload assignment

Exploring data center thermal heterogeneity is possible through thermal models. In this section, two different approaches are discussed to be compared later as a demonstration of the efficiency of HDIC.

In the first approach, cooling heterogeneity is only considered via the data center thermal model. This approach is called cooling heterogeneity-aware infrastructure control or CHIC. The second approach is HDIC which uses both the data center and server thermal models for control decisions.

Results and comparison

Both optimization problems , must be solved by nonlinear solution methods as both the cost function and the thermal models are nonlinear. We used the Matlab fmincon tool with the interior-point option to solve this optimization problem. A complete description of the data center configuration is illustrated in Section 3.1. Briefly, for this data center configuration, the decision variables are the utilizations of 40 servers, the speed of five fans, and one inlet water temperature. Due to the

Conclusion

Considering all aspects of data center thermal heterogeneity for workload assignment and cooling control can result in a considerable amount of savings in cooling power consumption. Data center heterogeneity can be obtained by means of data center and server thermal models. The data center thermal model predicts the temperature of different locations as a function of IT and cooling parameters. This thermal model is used to indirectly calculate the cost of providing cool air for a specific

CRediT authorship contribution statement

SeyedMorteza MirhoseiniNejad: Conceptualization, Methodology, Software, Investigation, Writing - original draft. Ghada Badawy: Methodology, Writing - review & editing, Supervision, Funding acquisition. Douglas G. Down: Methodology, Writing - review & editing, Supervision, Funding acquisition.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgment

This research was supported by a Collaborative Research and Development grant CRDPI506142-16 from the Natural Sciences and Engineering Research Council of Canada (NSERC) . We would like to acknowledge the useful comments of the anonymous referees.

SeyedMorteza MirhoseiniNejad is a Ph.D. student in Computer Science at McMaster University. He received his M.Sc. degree from Iran University of Science and Technology and his B.Sc. degree from Bahonar University of Kerman, both in Computer Engineering. His research interests are machine learning, optimization, queueing theory, data analysis, resource management, and predictive control and maintenance.

Email: [email protected].

References (40)

  • KlemickH. et al.

    Data Center Energy Efficiency Investments: Qualitative Evidence from Focus Groups and InterviewsTech. rep.

    (2017)
  • GuptaM. et al.

    Using low-power modes for energy conservation in ethernet LANs

  • YadavaN. et al.

    Design of one-transistor SRAM cell for low power consumption

  • AldahariE.

    Dynamic voltage and frequency scaling enhanced task scheduling technologies toward green cloud computing

  • GeR. et al.

    Performance-constrained distributed DVS scheduling for scientific applications on power-aware clusters

  • MeisnerD. et al.

    Power management of online data-intensive services

    ACM SIGARCH Comput. Archit. News

    (2011)
  • LinM. et al.

    Dynamic right-sizing for power-proportional data centers

    IEEE/ACM Trans. Netw.

    (2013)
  • KrioukovA. et al.

    [Napsac]: Design and implementation of a power-proportional web cluster

    SIGCOMM Comput. Commun. Rev.

    (2011)
  • TangQ. et al.

    Energy-efficient thermal-aware task scheduling for homogeneous high-performance computing data centers: A cyber-physical approach

    IEEE Trans. Parallel Distrib. Syst.

    (2008)
  • BashC. et al.

    Cool job allocation: Measuring the power savings of placing jobs at cooling-efficient locations in the data center

  • Cited by (24)

    • Server temperature prediction using deep neural networks to assist thermal-aware scheduling

      2022, Sustainable Computing: Informatics and Systems
      Citation Excerpt :

      Finally, DVFS is also employed to save energy wastage. Recently, MirhoseiniNejad et al. [28] study the relationship between heterogeneity in terms of cooling and server capacity. The proposed system integrates neural network model to forecast the inlet temperatures during the workload distribution.

    View all citing articles on Scopus

    SeyedMorteza MirhoseiniNejad is a Ph.D. student in Computer Science at McMaster University. He received his M.Sc. degree from Iran University of Science and Technology and his B.Sc. degree from Bahonar University of Kerman, both in Computer Engineering. His research interests are machine learning, optimization, queueing theory, data analysis, resource management, and predictive control and maintenance.

    Email: [email protected].

    Dr. Ghada Badawy is an Adjunct Assistant Professor at the Computing and Software department and a Principal Research Engineer at the Computing Infrastructure Research Center (CIRC) at McMaster University. Before joining CIRC she worked at BlackBerry as an Advanced networks connectivity researcher where she has led multiple video over Wi-Fi and peer to peer research projects and authored multiple patents. She has also worked as a Postdoctoral fellow at McMaster University and Ryerson University and as a senior software engineer at IBM. Ghada received her Ph.D. degree in Computer Engineering from McMaster University.

    Email: [email protected]

    Douglas G. Down received his B.A.Sc. and M.A.Sc. degrees from the University of Toronto (1986 and 1990) and his Ph.D. from the University of Illinois at Urbana-Champaign (1994). His interests lie in performance evaluation and resource allocation in distributed computer systems. He is currently the Academic Director of the Computing Infrastructure Research Centre at McMaster University, Canada.

    Email: [email protected]

    View full text